Upgrade checkpoints from previous releases#

Overview#

Cerebras software version 2.0 has a new checkpointing implementation. This means that checkpoints from older versions (prior to 2.0) may not work with 2.0+ software. To address this, we offer two scripts: fix_checkpoint_spec.py and data_iter_state_conversion.py. Use these scripts to upgrade and convert checkpoints for software conversion.

fix_checkpoint_spec.py fixes the checkpoint itself. This script has been introduced because there was a backward compatibility breaking change in the checkpoint format that meant that older checkpoints could not be loaded in 2.0. data_iter_state_conversion.py converts dataloader state files saved in release 1.9 to the dataloader checkpoint format for the new map and iterable dataloaders in Model Zoo in release 2.0. This allows for deterministic restart of the dataloader for model training jobs that move from release 1.9 to release 2.0.

fix_checkpoint_spec.py#

Purpose#

Used to fix checkpoints to prepare for conversion from pre-2.0 software to 2.0+ software version.

Usage#

The following code-block describes the usage of the fix_checkpoint_spec.py script:

fix_checkpoint_spec.py [-h] [--fixed_checkpoint_path FIXED_CHECKPOINT_PATH] [--fix_inplace] checkpoint_path

Description#

Fixes the specification of the checkpoints taken in releases < 2.0 to be functional for releases >= 2.0

The arguments are listed below:

positional arguments	Description
checkpoint_path	Path to the checkpoint to fix

optional arguments	Description
-h, –help	show this help message and exit
–fixed_checkpoint_path	FIXED_CHECKPOINT_PATH Path to save the fixed checkpoint to. If not provided, the fixed checkpoint will be saved to the same path as the checkpoint, with ‘_fixed’ appended to the name
–fix_inplace	Fix the checkpoint inplace instead of saving to a new file

Note

You still need to use the convert_checkpoint.py to upgrade your checkpoint to Cerebras software version 2.0+.

First use fix_checkpoint_spec.py on your old CS-compatible checkpoint, then convert your dataloader with data_iter_state_conversion.py, and finally use convert_checkpoint.py to upgrade your checkpoint sequentially (e.g., release 1.8 to 1.9 to 2.0).

Location#

Download from here.

data_iter_state_conversion.py#

Purpose#

Used to convert the checkpoints of pre-2.0 dataloaders to 2.0 to allow for deterministic restart of dataloader and continuing of training from releases prior to 2.0 to release 2.0+.

Usage#

The following code-block describes the usage of the data_iter_state_conversion.py script:

data_iter_state_conversion.py [-h] --old_checkpoint OLD_CHECKPOINT --worker_data_iter_files_dir WORKER_DATA_ITER_FILES_DIR --output_file OUTPUT_FILE [--dataloader_type {map,iterable}] [--shuffle_seed SHUFFLE_SEED]

Description#

Converts Release 1.9 dataloader state to Release 2.0 dataloader state.

The arguments are listed below:

optional arguments	Description
-h, –help	show this help message and exit
–old_checkpoint OLD_CHECKPOINT, -c OLD_CHECKPOINT	Path to the r1.9 checkpoint file
–worker_data_iter_files_dir WORKER_DATA_ITER_FILES_DIR, -w WORKER_DATA_ITER_FILES_DIR	Path to directory containing data step file data_iter_checkpoint_state_file_global and worker checkpoint files of the format data_iter_state_file_worker__step_.txt
–output_file OUTPUT_FILE, -o OUTPUT_FILE	Path where the output R2.0 checkpoint file with the converted DataLoader state should be saved.
–dataloader_type {map,iterable}, -d {map,iterable}	The MZ DataLoader for which state is being converted. Use map for the map-style dataloader and iterable for the iterable-style dataloader. Defaults to map-style dataloader.
–shuffle_seed SHUFFLE_SEED, -s SHUFFLE_SEED	The seed value to be captured in the Dataloader state for the map-style dataloader. Note that the seed is only relevant for deterministically restarting the map-style dataloader if dataset shuffling/mixing is enabled.

Location#

Download from here

Work with Cerebras checkpoints

Port model using Cerebras Model Zoo