Work with Cerebras checkpoints#
When moving to extremely large models reading, writing and manipulating checkpoints becomes a bottleneck. For that reason Cerebras has moved to using an HDF5 based file format in order to store checkpoints. The content of the checkpoints remains the same so they are convertible to native formats for use outside of Cerebras. Specific details about utilities and interfaces are provided for each framework supported.
Tensorflow Checkpoint Format#
Our use of the tensorflow estimator interface leads to the use of the tensorflow saver for checkpoint interactions. Unfortunately, the saver doesn’t provide the ability to do iterative updates when writing to a single file.
Usage#
Starting with existing checkpoint#
Our appliance estimator will use an existing checkpoint if provided in the model_dir
or as a warm start path. However, we also provide a utility to convert from tensorflow to our H5 format. That is provided either as a command line utility when the wheel is installed or a package.
$ tensorflow-to-h5 --help
usage: tensorflow-to-h5 [-h] saver_path h5_path
Convert Tensorflow Saver Checkpoint to Cerebras H5 Format
positional arguments:
saver_path Path to existing saver checkpoint
h5_path Path to store converted h5 checkpoint
$ python
>>> from cerebras_tensorflow.saver import tf_h5_saver
Converting Cerebras checkpoint#
After training on the appliance converting from the Cerebras format back to tensorflow saver can also be performed vis a command line utility or a package.
$ h5-to-tensorflow --help
usage: h5-to-tensorflow [-h] h5_path saver_path
Convert Cerebras H5 Checkpoint to Tensorflow Saver Format
positional arguments:
h5_path Path to existing h5 checkpoint
saver_path Path to store converted saver checkpoint
$ python
>>> from cerebras_tensorflow.saver import tf_h5_saver
For models larger than 20B, writing a tensorflow saver checkpoint can use a prohibitive amount of ram or swap. In this case, using the package provides more flexibility for sharding a checkpoint by weight names in the desired layout.
PyTorch Checkpoint Format#
In the new appliance mode, we define a new checkpoint format. The reason that this was necessary is that the pre-existing PyTorch checkpoint could not support saving extremely large models for which appliance mode was designed.
The new checkpoint format is based off the H5 file format.
At a high level, we took a PyTorch state dict, flattened it and stored it in an H5 file. For example, the following state dict:
{
"a": {
"b": 0.1,
"c": 0.001,
},
"d": [0.1, 0.2, 0.3]
}
Would be flattened and stored into the H5 file as follows
{
"a.b": 0.1,
"a.c": 0.001,
"d.0": 0.1,
"d.1": 0.2,
"d.2": 0.3,
}
A model/optimizer state dict can be saved in the new checkpoint format using the
cbtorch.save
method. e.g.
import cerebras_pytorch as cbtorch
...
state_dict = {
"model": model.state_dict(),
"optimizer": optimizer.state_dict(),
}
cbtorch.save(state_dict, "path/to/checkpoint")
...
A checkpoint saved using the above can be loaded using the cbtorch.load
method. e.g.
import cerebras_pytorch as cbtorch
...
state_dict = cbtorch.load("path/to/checkpoint")
model.load_state_dict(state_dict["model"])
optimizer.load_state_dict(state_dict["optimizer"])
...
Note
If using the run.py
scripts provided in the ModelZoo the above is all
already taken care of in the runners used in the ModelZoo.
Converting Checkpoint Formats#
If using cbtorch.load
is not a sufficient solution for loading the
checkpoint into memory, a simple conversion can be done to the pickle format
that PyTorch uses as follows
import torch
import cerebras_pytorch as cbtorch
state_dict = cbtorch.load("path/to/checkpoint")
torch.save(state_dict, "path/to/new/checkpoint")
Warning
This will not work for extremely large models whose state dict is too large to fit into memory. Sufficient RAM must be available to load the checkpoint into memory in order to be able to save it into the PyTorch pickle format.