Failing to save checkpoints using experimental PyTorch API#
Observed Error#
When checkpoint_steps
is set to 1
or save_initial_checkpoint
is set
to True
in the runconfig
section of the params.yaml
file, the
resulting checkpoints on step 0 and 1 are invalid and won’t be generally usable.
e.g. params.yaml
...
runconfig:
...
experimental_api: True
checkpoint_steps: 1 # not supported in 1.8 when using experimental API
save_initial_checkpoint: True # not supported in 1.8 when using experimental API
...
...
Explanation#
A bug was found with the format that these two checkpoints were being saved with. As a result, they do not fully conform to Cerebras’s H5 checkpointing format and may fail to load at all. It may also affect the state of the rest of the run.
Work around#
If checkpoints are desired, please ensure that
checkpoint_steps
is greater than1
save_initial_checkpoint
is not set toTrue
.
e.g. params.yaml
...
runconfig:
...
experimental_api: True
checkpoint_steps: 1000 # must be greater than 1
save_initial_checkpoint: False # must not be True
...
...