Convert checkpoints & configurations between Hugging Face and Cerebras Model Zoo, and between Cerebras software releases#

Motivation#

Our Cerebras Model Zoo deep learning models are designed to be highly generalizable, allowing users to easily make architectural modifications from a single configuration file. Unfortunately, due to the general model implementations, it is not possible to directly use model configs & checkpoints from other code repositories (ex: Hugging Face) with Cerebras Model Zoo (vice-versa).

The checkpoint & config converter tool was built in order to allow users to easily convert model implementations between Cerebras Model Zoo and other code repositories. Example use cases include:

Take a pretrained checkpoint from another code repository and convert it into the equivalent Cerebras Model Zoo compatible config & checkpoint so that you can continue training on the CS system.
Train a model within the Cerebras ecosystem using Cerebras Model Zoo, and then convert to another equivalent implementation (ex: Hugging Face) in order to run inference.
“Upgrade” an old Cerebras Model Zoo config & checkpoint to a new one (ex: convert Cerebras Model Zoo rel 1.6 checkpoints to 1.7). This ensures that old checkpoints can continue to be used in new releases in the event that our model implementations evolve.

Note

The checkpoint converter tool is for PyTorch models only. We currently only support conversions between Hugging Face (HF) and Cerebras Model Zoo (CS) implementations.

How It Works#

The tool is located at modelzoo/common/pytorch/model_utils/convert_checkpoint.py. It offers three different commands:

list displays all the available conversions (models & formats).
convert-config performs only config conversion. If you intend to convert from Cerebras Model Zoo to another repository at any point in time, we highly recommend that you run config conversion before training the model. This will allow you to determine if the configuration you are using within Cerebras Model Zoo can be converted. Conversions are not always possible as other repositories are less general than our Cerebras Model Zoo implementations (ex: many Hugging Face NLP model implementations support a limited range of positional embeddings).
convert performs both config & checkpoint conversion. In other words, the tool is supplied the old config & checkpoint and produces a new config & checkpoint.

Note

Cerebras configuration files contain not only model parameters but also configurations for the optimizer, train_input, eval_input, and runconfig. Most other open source repositories (ex: Hugging Face) don’t have this information. Since these cannot be inferred by the converter tool, you will need to modify the output config with these additional properties. You can look at the example configs in Cerebras Model Zoo as a starter.

Usage#

Each of the three commands introduced above can be used as follows.

To get a list of all models/conversions that we support, use the following command:

python modelzoo/common/pytorch/model_utils/convert_checkpoint.py list
Note

It is essential that you read the notes section of the list command output before using the converter! This section explains the exact model classes that are being converted from/to. It also lists any caveats about the conversion process. For example, many NLP models offer -headless variants which are missing a language model head.

To convert a config file only, use the following command:

python modelzoo/common/pytorch/model_utils/convert_checkpoint.py convert-config --model <model name> --src-fmt <format of input config> --tgt-fmt <format of output config> --output-dir <location to save output config> <config file path>

To convert a checkpoint and its corresponding config, use the following command:

python modelzoo/common/pytorch/model_utils/convert_checkpoint.py convert --model <model name> --src-fmt <format of input checkpoint> --tgt-fmt <format of output checkpoint> --output-dir <location to save output checkpoint> <input checkpoint file path> --config <input config file path>

To learn more about usage and optional parameters about a particular subcommand, you can pass the -h flag. For example:

python modelzoo/common/pytorch/model_utils/convert_checkpoint.py convert -h

Examples#

1. Converting Eleuther AI GPT-J 6B (from model card) to Cerebras Model Zoo#

Eleuther’s final GPT-J checkpoint can be accessed on Hugging Face at EleutherAI/gpt-j-6B. Rather than manually entering the values from the model architecture table into a config file and write a script to convert their checkpoint, we can auto-generate these with a single command!

First we need to download the config & checkpoint files from the model card locally:

mkdir opensource_checkpoints
wget -P opensource_checkpoints https://huggingface.co/EleutherAI/gpt-j-6B/raw/main/config.json
wget -P opensource_checkpoints https://huggingface.co/EleutherAI/gpt-j-6B/resolve/main/pytorch_model.bin

Hugging Face configs contain the architecture property which specifies the class that the checkpoint was generated with. According to config.json, the HF checkpoint is from the GPTJForCausalLM class. Using this information, we can use the checkpoint converter tool’s list command to find the appropriate converter. In this case, we want to use the gptj model, with a source format of hf, and a target format of cs-1.8.

Now to convert the config & checkpoint, run the following command:

python modelzoo/common/pytorch/model_utils/convert_checkpoint.py convert --model gptj --src-fmt hf --tgt-fmt cs-1.8 --output-dir opensource_checkpoints/ opensource_checkpoints/pytorch_model.bin --config opensource_checkpoints/config.json

This produces two files:

opensource_checkpoints/pytorch_model_to_cs-1.8.mdl
opensource_checkpoints/config_to_cs-1.8.yaml

The output yaml config file contains the auto-generated model parameters from the Eleuther implementation. Note that before we can train/eval the model on Cerebras cluster, we need to add the train_input, eval_input, optimizer, and runconfig parameters to the yaml. Examples for these parameters can be found in the configs/ folder for each model within Model Zoo. In this case, we can copy the missing information from modelzoo/transformers/pytorch/gptj/configs/params_gptj_6B.yaml into opensource_checkpoints/config_to_cs-1.8.yaml. Make sure you modify the dataset paths under train_input and eval_input if they are stored elsewhere.

The following command demonstrates using the the converted config & checkpoint for continuous pretraining:

python run.py \
CSX \
weight_streaming \
--mode train \
--params opensource_checkpoints/config_to_cs-1.8.yaml \
--checkpoint_path opensource_checkpoints/pytorch_model_to_cs-1.8.mdl \
--model_dir gptj6b_continuous_pretraining \
--mount_dirs {paths to modelzoo and to data} \
--python_paths {paths to modelzoo and other python code if used}

Additional details about the run.py command can be found on the Launch your job page.

2. Converting a Hugging Face model without a model card to Cerebras Model Zoo#

Not all pretrained checkpoints on Hugging Face have corresponding model card webpages. We can still download these checkpoints & configs in order to convert them into a Model Zoo compatible format.

For example, Hugging Face has a model card for BertForMaskedLM accessible through the name bert-base-uncased. However, it doesn’t have a webpage for BertForPreTraining which we’re interested in.

We can manually get the config and checkpoint for this model as follows:

from transformers import BertForPreTraining
model = BertForPreTraining.from_pretrained("bert-base-uncased")
model.save_pretrained("bert_checkpoint")

This saves two files: bert_checkpoint/config.json and bert_checkpoint/pytorch_model.bin

Now that we’ve downloaded the required files, we can convert the checkpoints. We need to use the --model bert flag since the Hugging Face checkpoint is from the BertForPreTraining class. If you want to use another checkpoint which is from a different variant (such as a finetuning model), see the other bert- model converters.

The final conversion command is:

python modelzoo/common/pytorch/model_utils/convert_checkpoint.py convert --model bert --src-fmt hf --tgt-fmt cs-1.8 bert_checkpoint/pytorch_model.bin --config bert_checkpoint/config.json
...
Checkpoint saved to bert_checkpoint/pytorch_model_to_cs-1.8.mdl
Config saved to bert_checkpoint/config_to_cs-1.8.yaml

3. Converting Cerebras Model Zoo GPT-2 checkpoint to Hugging Face#

Suppose we’ve just finished training GPT-2 on CS and now want to run the model within the Hugging Face ecosystem. In this example, the configuration file is saved at model_dir/train/params_train.yaml and the checkpoint (corresponding to step 10k) is at model_dir/checkpoint_10000.mdl

To convert the Hugging Face, all we need to do is run the following command:

python modelzoo/common/pytorch/model_utils/convert_checkpoint.py convert --model gpt2 --src-fmt cs-1.8 --tgt-fmt hf model_dir/checkpoint_10000.mdl --config model_dir/train/params_train.yaml

Since the --output-dir flag is omitted, the two output files are saved to the same directories as the original files: model_dir/train/params_train_to_hf.json and model_dir/checkpoint_10000_to_hf.bin

Frequently Asked Questions#

Question	Answer
Which models, formats, classes, etc are supported?	See the `list` command under the usage section.
Which frameworks are supported?	PyTorch only.
Does optimizer state get converted?	No. Hugging Face checkpoints contain model state information only; unlike CS, they do not contain optimizer state information.
Sometimes when I run the checkpoint converter tool, it runs for a while before saying `Killed`. What happened?	The program hit the memory limit. Unfortunately, pytorch pickling works by storing whole checkpoints in the same file, forcing everything to be read into memory at once. Make sure that the system you’re running on has at least as much RAM as the size of the checkpoint file.
Conversion failed with a `ConfigConversionError`. Why did this happen?	Conversions are not always possible as other repositories are less general than our Model Zoo implementations (ex: many Hugging Face NLP model implementations support limited types of positional embeddings while Model Zoo includes an expanded range). For this reason, we highly recommend that you run config conversion before training the model if you intend to convert a Model Zoo model to another repository at any point in time. This will allow you to determine if the configuration you are using within Model Zoo can be converted before you go ahead with training the model. Additionally, you can use the information in the error message to modify the config file to generate a configuration that can be converted.
Sometimes during config conversion, I see the following: `WARNING:root:Key not matched:` Should I be concerned?	No. Not all keys in one config format need to be converted to the another. This warning message is simply printing out the keys that will be ignored. For example HF configs contain the `use_cache` parameter which isn’t relevant on CS.
Model conversion failed with the following error: `AssertionError: Unable to match all keys. If you want to proceed by dropping keys that couldn't matched, rerun with --drop-unmatched-keys` What should I do?	The checkpoint contains keys that weren’t expected, and therefore couldn’t be converted. The converters are heavily tested and so this error message highlights an issue with either the input checkpoint or the command being run, not the converter itself. Make sure that you are using the correct `--model` and `--src-fmt` flags which correspond to the checkpoint that you are using. To double check, you can look at the `notes` column displayed by the checkpoint converter tool’s `list` command. A misspecified `--model` or `--src-fmt` will lead to this error. All unexpected keys in the checkpoint are displayed with `WARNING:root:Key not matched:`. If you determine that these keys do not need to be converted, you can bypass the assertion using the `--drop-unmatched-keys` flag. You should never have to use this feature unless you’re using a custom checkpoint that deviates from the `--src-fmt` format.
I am unable to use a converted checkpoint because I get the following errors: `Error(s) in loading state_dict for <model name>:` `Missing key(s) in state_dict:...` `Unexpected key(s) in state_dict:...` What should I do?	There is a discrepancy between the format of the converted checkpoint and the expected format that you’re loading the model into. This is caused by a misspecified `--model` or `--tgt-fmt` flags. In order to double check that you’re using the correct flags, look at the `notes` column displayed by the checkpoint converter tool’s `list` command.
I have a sharded checkpoint. How do I use the checkpoint converter tool?	The tool currently doesn’t support sharded checkpoints because there is no standardized implementation for this. You can, however, use sharded checkpoints by first coalescing it into a single file: `import torch` `shard1 = torch.load("<shard 1>.bin")` `shard2 = torch.load("<shard 2>.bin")` `shard1.update(shard2)` `torch.save(shard1, "full_checkpoint.bin")` `# now use converter tool on full checkpoint`

Upgrading Checkpoints & Configs to the Current Release#

As our Model Zoo implementations evolve over time, the changes may sometimes break out-of-the-box compatibility when moving to a new release. To ensure that you can continue using your old checkpoints, we offer converters that allow you to “upgrade” configs & checkpoints when necessary. The section below covers conversions that are required when moving to a particular release. If a converter doesn’t exist, no explicit conversion is necessary.

Release 1.8#

T5 / Vanilla Transformer

As described in the release notes, the behavior of the use_pre_encoder_decoder_layer_norm flag has been flipped. In order to continue using rel 1.7 checkpoints in rel 1.8, you’ll need to update the config to reflect this change. You can do this automatically using the config converter tool as follows:

python modelzoo/common/pytorch/model_utils/convert_checkpoint.py convert-config --model <model type> --src-fmt cs-1.7 --tgt-fmt cs-1.8 <config file path>

In the command above, --model should be either t5 or transformer depending on which model you’re using. The config file path should point to the train/params_train.yaml file within your model directory.

Bert

As described in the release notes, we expanded the bert model configurations to expose two additional parameters: pooler_nonlinearity and mlm_nonlinearity. Due to a change in the default value of the mlm_nonlinearity parameter, you will need to update the config when using a rel 1.7 checkpoint in rel 1.8. You can do this automatically using the config converter tool as follows:

python modelzoo/common/pytorch/model_utils/convert_checkpoint.py convert-config --model bert --src-fmt cs-1.7 --tgt-fmt cs-1.8 <config file path>

The config file path should point to the train/params_train.yaml file within your model directory.

Work with Cerebras checkpoints

Deterministically restart dataloader after pausing training