Failing to automatically load checkpoints#
When adding a custom checkpoint into
model_dir, the run command does not automatically pick that up in my model_dir and runs with it. Instead, it reports that no checkpoint is found.
Cerebras ModelZoo PyTorch runs have a feature (enabled by default) to auto-load the last available checkpoint in the
model_dir if a
--checkpoint_path is not explicitly provided.
It is important to note that only a specific checkpoint naming scheme is checked to find the latest checkpoint.
All files in the format
checkpoint_<step>.mdl are checked in the model_dir.
If one or more are found, the file with the highest value of
<step> is chosen and model weights are initialized with that checkpoint.
This feature can be turned off by setting
False in the params yaml file.
You can either
Provide a checkpoint inside
model_dirwith the naming format
Specify checkpoint path by using the