Upgrade checkpoints from previous releases#

Overview#

Cerebras software version 2.0 has a new checkpointing implementation. This means that checkpoints from older versions (prior to 2.0) may not work with 2.0+ software. To address this, we offer two scripts: fix_checkpoint_spec.py and data_iter_state_conversion.py. Use these scripts to upgrade and convert checkpoints for software conversion.

fix_checkpoint_spec.py fixes the checkpoint itself. This script has been introduced because there was a backward compatibility breaking change in the checkpoint format that meant that older checkpoints could not be loaded in 2.0. data_iter_state_conversion.py converts dataloader state files saved in release 1.9 to the dataloader checkpoint format for the new map and iterable dataloaders in Model Zoo in release 2.0. This allows for deterministic restart of the dataloader for model training jobs that move from release 1.9 to release 2.0.

fix_checkpoint_spec.py#

Purpose#

Used to fix checkpoints to prepare for conversion from pre-2.0 software to 2.0+ software version.

Usage#

The following code-block describes the usage of the fix_checkpoint_spec.py script:

fix_checkpoint_spec.py [-h] [--fixed_checkpoint_path FIXED_CHECKPOINT_PATH] [--fix_inplace] checkpoint_path

Description#

Fixes the specification of the checkpoints taken in releases < 2.0 to be functional for releases >= 2.0

The arguments are listed below:

positional arguments

Description

checkpoint_path

Path to the checkpoint to fix

optional arguments

Description

-h, –help

show this help message and exit

–fixed_checkpoint_path

FIXED_CHECKPOINT_PATH Path to save the fixed checkpoint to. If not provided, the fixed checkpoint will be saved to the same path as the checkpoint, with ‘_fixed’ appended to the name

–fix_inplace

Fix the checkpoint inplace instead of saving to a new file

Note

You still need to use the convert_checkpoint.py to upgrade your checkpoint to Cerebras software version 2.0+.

First use fix_checkpoint_spec.py on your old CS-compatible checkpoint, then convert your dataloader with data_iter_state_conversion.py, and finally use convert_checkpoint.py to upgrade your checkpoint sequentially (e.g., release 1.8 to 1.9 to 2.0).

Location#

Download from here.

data_iter_state_conversion.py#

Purpose#

Used to convert the checkpoints of pre-2.0 dataloaders to 2.0 to allow for deterministic restart of dataloader and continuing of training from releases prior to 2.0 to release 2.0+.

Usage#

The following code-block describes the usage of the data_iter_state_conversion.py script:

data_iter_state_conversion.py [-h] --old_checkpoint OLD_CHECKPOINT --worker_data_iter_files_dir WORKER_DATA_ITER_FILES_DIR --output_file OUTPUT_FILE [--dataloader_type {map,iterable}] [--shuffle_seed SHUFFLE_SEED]

Description#

Converts Release 1.9 dataloader state to Release 2.0 dataloader state.

The arguments are listed below:

optional arguments

Description

-h, –help

show this help message and exit

–old_checkpoint OLD_CHECKPOINT, -c OLD_CHECKPOINT

Path to the r1.9 checkpoint file

–worker_data_iter_files_dir WORKER_DATA_ITER_FILES_DIR, -w WORKER_DATA_ITER_FILES_DIR

Path to directory containing data step file data_iter_checkpoint_state_file_global and worker checkpoint files of the format data_iter_state_file_worker_*_step_*.txt

–output_file OUTPUT_FILE, -o OUTPUT_FILE

Path where the output R2.0 checkpoint file with the converted DataLoader state should be saved.

–dataloader_type {map,iterable}, -d {map,iterable}

The MZ DataLoader for which state is being converted. Use map for the map-style dataloader and iterable for the iterable-style dataloader. Defaults to map-style dataloader.

–shuffle_seed SHUFFLE_SEED, -s SHUFFLE_SEED

The seed value to be captured in the Dataloader state for the map-style dataloader. Note that the seed is only relevant for deterministically restarting the map-style dataloader if dataset shuffling/mixing is enabled.

Location#

Download from here