Failing to save checkpoints using experimental PyTorch API#
checkpoint_steps is set to
save_initial_checkpoint is set
True in the
runconfig section of the
params.yaml file, the
resulting checkpoints on step 0 and 1 are invalid and won’t be generally usable.
... runconfig: ... experimental_api: True checkpoint_steps: 1 # not supported in 1.8 when using experimental API save_initial_checkpoint: True # not supported in 1.8 when using experimental API ... ...
A bug was found with the format that these two checkpoints were being saved with. As a result, they do not fully conform to Cerebras’s H5 checkpointing format and may fail to load at all. It may also affect the state of the rest of the run.
If checkpoints are desired, please ensure that
checkpoint_stepsis greater than
save_initial_checkpointis not set to
... runconfig: ... experimental_api: True checkpoint_steps: 1000 # must be greater than 1 save_initial_checkpoint: False # must not be True ... ...