Failing to save checkpoints using experimental PyTorch API#

Observed Error#

When checkpoint_steps is set to 1 or save_initial_checkpoint is set to True in the runconfig section of the params.yaml file, the resulting checkpoints on step 0 and 1 are invalid and won’t be generally usable.

e.g. params.yaml

...
runconfig:
    ...
    experimental_api: True
    checkpoint_steps: 1   # not supported in 1.8 when using experimental API
    save_initial_checkpoint: True   # not supported in 1.8 when using experimental API
    ...
...

Explanation#

A bug was found with the format that these two checkpoints were being saved with. As a result, they do not fully conform to Cerebras’s H5 checkpointing format and may fail to load at all. It may also affect the state of the rest of the run.

Work around#

If checkpoints are desired, please ensure that

  • checkpoint_steps is greater than 1

  • save_initial_checkpoint is not set to True.

e.g. params.yaml

...
runconfig:
    ...
    experimental_api: True
    checkpoint_steps: 1000   # must be greater than 1
    save_initial_checkpoint: False   # must not be True
    ...
...