Work with Cerebras checkpoints#

When moving to extremely large models reading, writing and manipulating checkpoints becomes a bottleneck. For that reason Cerebras has moved to using an HDF5 based file format in order to store checkpoints. The content of the checkpoints remains the same so they are convertible to native formats for use outside of Cerebras. Specific details about utilities and interfaces are provided for each framework supported.

PyTorch Checkpoint Format#

In the new appliance mode, we define a new checkpoint format. The reason that this was necessary is that the pre-existing PyTorch checkpoint could not support saving extremely large models for which appliance mode was designed.

The new checkpoint format is based off the H5 file format.

At a high level, we took a PyTorch state dict, flattened it and stored it in an H5 file. For example, the following state dict:

    "a": {
        "b": 0.1,
        "c": 0.001,
    "d": [0.1, 0.2, 0.3]

Would be flattened and stored into the H5 file as follows

    "a.b": 0.1,
    "a.c": 0.001,
    "d.0": 0.1,
    "d.1": 0.2,
    "d.2": 0.3,

A model/optimizer state dict can be saved in the new checkpoint format using the method. e.g.

import cerebras.pytorch as cbtorch


state_dict = {
    "model": model.state_dict(),
    "optimizer": optimizer.state_dict(),
}, "path/to/checkpoint")


A checkpoint saved using the above can be loaded using the cbtorch.load method. e.g.

import cerebras.pytorch as cbtorch


state_dict = cbtorch.load("path/to/checkpoint")




If using the scripts provided in the ModelZoo the above is all already taken care of in the runners used in the ModelZoo.

Converting Checkpoint Formats#

If using cbtorch.load is not a sufficient solution for loading the checkpoint into memory, a simple conversion can be done to the pickle format that PyTorch uses as follows

import torch
import cerebras.pytorch as cbtorch

state_dict = cbtorch.load("path/to/checkpoint"), "path/to/new/checkpoint")


This will not work for extremely large models whose state dict is too large to fit into memory. Sufficient RAM must be available to load the checkpoint into memory in order to be able to save it into the PyTorch pickle format.