Porting PyTorch Models to Cerebras#

Overview#

The Cerebras Model Zoo repository offers PyTorch-based reference implementations of well-known neural networks, including BERT, GPT-2, GPT-3, T5, and UNet. These implementations are organized in a modular fashion, dividing data preprocessing, the core model, and other execution-related functions.

We offer two tailored pathways based on the level of complexity you’re comfortable with.

1. A straightforward approach designed for those who plan to adapt existing models from the Cerebras Model Zoo, necessitating changes in the model’s architecture or its data preprocessing.

2. A moderately complex pathway ideal for users looking to create new models and data preprocessing scripts, where we advise starting with the robust foundation provided by the Cerebras Model Zoo.

Customizing models from the Cerebras Model Zoo#

If you intend to use models from the Cerebras Model Zoo and require modifications to the model architecture or data preprocessing, starting with the resources available in the Cerebras Model Zoo is recommended. From that starting point, you can proceed to make the required customizations to fit your specific requirements. This section aims to guide users on how to adapt these pre-built models to their specific needs, which might involve changes to the model architecture or adjustments in the data preprocessing steps.

Example: Modifying the data loader:

The data loader is a crucial element that feeds data into the model for training and evaluation. By altering the data loader, users can customize how the data is processed, presented, or augmented before being used by the model.

Implementing a custom data loader#

In this example, we are focusing on modifying the data loader within the PyTorch implementation of FC_MNIST from the Cerebras Model Zoo. Our goal is to create a synthetic data loader that will help us assess the network’s performance with varying input sizes and class counts.

In data.py, we’re going to define a function named get_random_dataloader:

import torch
import numpy as np

def get_random_dataloader(input_params,shuffle,num_classes):
    num_examples = input_params.get("num_examples")
    batch_size = input_params.get("batch_size")
    seed = input_params.get("seed",1)
    image_size = input_params.get("image_size",[1,28,28])

    # Note: Cast the tensor to be of dtype `np.int32` when running on CS-X systems and to `np.int64` when running on cpus/gpus.
    np.random.seed(seed)
    image = np.random.random(size = [num_examples,]+image_size).astype(np.float32)
    label = np.random.randint(low =0, high = num_classes, size = num_examples).astype(np.int32)

    dataset = torch.utils.data.TensorDataset(
        torch.from_numpy(image),
        torch.from_numpy(label)
    )

    return torch.utils.data.DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        num_workers=input_params.get("num_workers", 0),
    )

def get_train_dataloader(params):
    return get_random_dataloader(
        params["train_input"],
        params["train_input"].get("shuffle"),
        params["model"].get("num_classes")
    )

def get_eval_dataloader(params):
    return get_random_dataloader(
        params["eval_input"],
        False,
        params["model"].get("num_classes")
    )

This function is responsible for generating random images and labels, simulating a dataset for our experiments. The key feature of this function is its configurability via the params.yaml file.

In the params.yaml, we can specify several important parameters that will influence the behavior of our synthetic data loader:

num_examples: This parameter sets how many random images and labels the data loader should generate.
batch_size: This defines how many examples will be included in a single batch during the training or evaluation of the model.
seed: A seed for the random number generator ensures the reproducibility of our experiments by generating the same sequence of random images and labels for a given seed value.
image_size: This specifies the dimensions of the generated random images.
num_classes: This determines how many different classes the labels can take, which is crucial for classification tasks.

By adjusting these parameters in the params.yaml file, we can tailor the behavior of the get_random_dataloader function to meet our specific experimental needs, allowing for a flexible and dynamic approach to evaluating the network’s performance under different conditions.

Adapting the model#

In model.py, change the fix number of classes to a parameter in the params.yaml file:

class MNIST(nn.Module):
    def __init__(self, model_params):
        super().__init__()
        self.loss_fn = nn.NLLLoss()
        self.fc_layers = []
        input_size = model_params.get("input_size",784)
        num_classes = model_params.get("num_classes",10)
        ...
        self.last_layer = nn.Linear(input_size, num_classes)
        ...

In configs/params.yaml, add the additional fields used in the dataloader and model definition.

train_input:
    batch_size: 128
    drop_last_batch: True
    num_examples: 1000
    seed: 123
    image_size: [1,28,28]
    shuffle: True

eval_input:
    data_dir: "./data/mnist/val"
    batch_size: 128
    num_examples: 1000
    drop_last_batch: True
    seed: 1234
    image_size: [1,28,28]

model:
    name: "fc_mnist"
    mixed_precision: True
    input_size: 784 #1*28*28
    num_classes: 10
    ...

Developing new models and preprocessing scripts#

Utilizing the Cerebras Run Function#

If you aim to create new models and data preprocessing scripts, we recommend beginning with the Cerebras Model Zoo’s shared foundation, specifically the run function. This approach allows you to utilize an established structure, streamlining the development process for your custom models and preprocessing routines.

All PyTorch models in the Cerebras Model Zoo share a standard framework that facilitates running them on the Cerebras Systems (CS) platform or other hardware types like CPUs/GPUs. This framework handles the modifications required to compile and execute a model on a Cerebras cluster. It offers a unified training and evaluation interface, allowing users to integrate their models and data preprocessing scripts seamlessly. With this setup, users don’t need to make detailed code adjustments for compatibility with Cerebras systems.

Porting a PyTorch dense neural network for MNIST#

To utilize the run function effectively, ensure that the Cerebras Model Zoo repository, compatible with your target Cerebras cluster’s release, is installed. You can import the run function with the following code snippet:

from cerebras.modelzoo.common.run_utils import run

Code related with run function lives inside the Cerebras Model Zoo and can be found in the common folder.

The run function simplifies and organizes various aspects of your model’s workflow, including its implementation, data loading processes, hyperparameter settings, and overall execution. To effectively use the run function, you should have:

A params YAML file, which specifies the optimizers and the runtime configuration.
An implementation that encompasses:
- The definition of your model.
- Dataloaders that are responsible for both training and evaluation.

Define Model#

To define the model architecture, the run function requires a callable class or function that takes as input a dictionary of params and returns a torch.nn.Module whose forward implementation returns a loss tensor.

For example, let’s implement FC_MNIST parametrized by the depth and the hidden size of the network. Let’s assume that the input size is 784 and the last output dimension is 10. We use ReLU as non linearity, and a negative log likelihood loss.

In model.py:

import torch
import torch.nn as nn
import torch.nn.functional as F

class MNISTModel(nn.Module):
    def __init__(self, model_params):
        super().__init__()
        self.fc_layers = []
        input_size = 784

        # Depth is len(hidden_sizes)
        model_params["hidden_sizes"] = [
            model_params["hidden_size"]
        ] * model_params["depth"]

        for hidden_size in model_params["hidden_sizes"]:
            fc_layer = nn.Linear(input_size, hidden_size)
            self.fc_layers.append(fc_layer)
            input_size = hidden_size
        self.fc_layers = nn.ModuleList(self.fc_layers)
        self.last_layer = nn.Linear(input_size, 10)

        self.nonlin = nn.ReLU()

        self.dropout = nn.Dropout(model_params["dropout"])

        self.loss_fn = nn.NLLLoss()

    def forward(self, batch):
        inputs, targets = batch

        x = torch.flatten(inputs, 1)
        for fc_layer in self.fc_layers:
            x = fc_layer(x)
            x = self.nonlin(x)
            x = self.dropout(x)

        pred_logits = self.last_layer(x)

        outputs = F.log_softmax(pred_logits, dim=1)

        loss = self.loss_fn(outputs, targets)

        return loss

Note

The input to a torch.nn.Module object defined in the `run` function includes both the inputs and the labels to compute the loss. It is up to the model to extract the inputs and labels from the batch before using them.
The output of the model is expected to be the loss of that forward pass.

Define dataloaders#

To define the data loaders, the run function requires a callable (either class or function) that takes as input a dictionary of params, and returns a torch.utils.data.DataLoader. When running training, the train_data_fn must be provided. When running evaluation, the eval_data_fn must be provided.

For example, to implement FC_MNIST, we create two different functions for training and evaluation. We use torchvision.datasets functionality to download MNIST dataset. Each of these functions returns a torch.utils.data.DataLoader.

In data.py:

import torch
from torchvision import datasets, transforms

def get_train_dataloader(params):
    input_params = params["train_input"]

    batch_size = input_params.get("batch_size")
    dtype = torch.float16 if input_params["to_float16"] else torch.float32
    shuffle = input_params["shuffle"]

    train_dataset = datasets.MNIST(
        input_params["data_dir"],
        train=True,
        download=True,
        transform=transforms.Compose(
            [
                transforms.ToTensor(),
                transforms.Normalize((0.1307,), (0.3081,)),
                transforms.Lambda(
                    lambda x: torch.as_tensor(x, dtype=dtype)
                ),
            ]
        ),
        target_transform=transforms.Lambda(
            lambda x: torch.as_tensor(x, dtype=torch.int32)
        ),
    )

    train_loader = torch.utils.data.DataLoader(
        train_dataset,
        batch_size=batch_size,
        drop_last=input_params["drop_last_batch"],
        shuffle=shuffle,
        num_workers=input_params.get("num_workers", 0),
    )
    return train_loader

def get_eval_dataloader(params):
    input_params = params["eval_input"]

    batch_size = input_params.get("batch_size")
    dtype = torch.float16 if input_params["to_float16"] else torch.float32

    eval_dataset = datasets.MNIST(
        input_params["data_dir"],
        train=False,
        download=True,
        transform=transforms.Compose(
            [
                transforms.ToTensor(),
                transforms.Normalize((0.1307,), (0.3081,)),
                transforms.Lambda(
                    lambda x: torch.as_tensor(x, dtype=dtype)
                ),
            ]
        ),
        target_transform=transforms.Lambda(
            lambda x: torch.as_tensor(x, dtype=torch.int32)
        ),
    )

    eval_loader = torch.utils.data.DataLoader(
        eval_dataset,
        batch_size=batch_size,
        drop_last=input_params["drop_last_batch"],
        shuffle=False,
        num_workers=input_params.get("num_workers", 0),
    )
    return eval_loader

Set up the `run` function#

The run function must be imported from run_utils.py. Always remember to append the parent directory of the Cerebras Model Zoo repository in your local setup.

All the input parameters of run function are callables that take as input a dictionary, called params. params is a dictionary containing all of the model and data parameters specified by the params YAML file of the model.

Parameter	Type	Notes
`model_fn`	`Callable[[dict], torch.nn.Module]`	Required. A callable that takes in a dictionary of parameters. Returns a `torch.nn.Module`.
`train_data_fn`	`Callable[[dict], torch.utils.data.DataLoader]`	Required during training run.
`eval_data_fn`	`Callable[[dict], torch.utils.data.DataLoader]`	Required during evaluation run.
`default_params_fn`	`Callable[[dict], Optional[dict]]`	Optional. A callable that takes in a dictionary of parameters. Sets default parameters.

For the FC_MNIST example, with all of the elements in place, now we import the run function from modelzoo.common.run_utils. We append the parent directory of Cerebras Model Zoo.

In run.py:

import os
import sys

#Append path to parent directory of Cerebras ModelZoo Repository
sys.path.append(os.path.join(os.path.dirname(__file__), ".."))
from modelzoo.common.run_utils import run

from data import (
    get_train_dataloader,
    get_eval_dataloader,
)
from model import MNISTModel

def main():
    run(MNISTModel, get_train_dataloader, get_eval_dataloader)

if __name__ == '__main__':
    main()

Manage common params for multiple experiments#

To avoid params replication between multiple similar experiments, the run function has an optional input parameter called default_params_fn. This parameter modifies the dictionary of the params YAML file, adding default values of unspecified params.

Setting up a default_params_fn could be beneficial if the user is planning multiple experiments in which only a small subset of the params YAML file changes. The default_params_fn sets up the values shared in all of the experiments. The user can create different configuration YAML files to only address the changes between experiments.

The default_params_fn should be a callable that takes in the params dictionary and returns a new dictionary. If the default_params_fn is omitted, the params dictionary will be used as is.

Create params YAML file#

At runtime, the run function requires a separate params YAML. This file is specified during execution with the flag --params in the command line.

For example, this is the params.yaml file for the FC_MNIST implementation. We customize the fields in train_input, eval_input, model, to be used inside get_train_dataloader, get_eval_dataloader, MNISTModel. We also specify the required optimizer and runconfig params.

train_input:
    data_dir: "./data/mnist/train"
    batch_size: 128
    drop_last_batch: True
    shuffle: True
    to_float16: True

eval_input:
    data_dir: "./data/mnist/val"
    batch_size: 128
    drop_last_batch: True
    to_float16: True

model:
    name: "fc_mnist"
    mixed_precision: True
    depth: 10
    hidden_size: 50
    dropout: 0.0
    activation_fn: "relu"

optimizer:
    optimizer_type: "SGD"
    learning_rate: 0.001
    momentum: 0.9
    loss_scaling_factor: 1.0

runconfig:
    max_steps: 10000
    checkpoint_steps: 2000
    log_steps: 50
    seed: 1

The params YAML file typically includes several key sections to configure different aspects of the model, training, and execution. These sections can include:

Section	Required	Notes
`runconfig`	Yes	Used by run to set up logging and execution. It expects fields: `max_steps`, `checkpoint_steps`, `log_steps`.
`optimizer`	Yes	Modify the params YAML file to align with the available optimizer in the cerebras.pytorch.optim namespace Specify the optimizer using the `optimizer_type` field. The value must correspond to one of the available optimizers in the cerebras.pytorch.optim package.
`model`	No	By convention, it is used to customize the model architecture in `nn.Module`. Fields are tailored to needs inside the model.
`train_input`	No	By convention, it is used to customize train_data_fn. Fields are tailored to needs inside train_data_fn.
`eval_input`	No	By convention, it is used to customize eval_data_fn. Fields are tailored to needs inside eval_data_fn.

Optimizer#

Within the context of the cerebras.pytorch.optim package, you can configure the optimizer for your training process using various parameters.

To select an optimizer, use the optimizer_type parameter and then adjust the necessary and optional parameters to tailor the optimizer to your needs.

Example of a YAML file:

optimizer:
    optimizer_type: SGD  # Replace SGD with your chosen optimizer
    lr: 0.001            # Set the learning rate
    momentum: 0.9        # Set the momentum (note: not all optimizers use this)

For the different types of optimizers, check out our entire list here. Make sure the additional arguments or parameters required for each optimizer type align with those detailed in the API documentation.

Learning Rate Scheduler#

Within the ref:cerebras.pytorch.optim package, you have access to a comprehensive suite of learning rate schedulers to optimize the training process. These schedulers can be seamlessly integrated into your training configuration by using the learning_rate sub-parameter. The available subclasses for learning rate schedulers can be found in Learning Rate Schedulers in cerebras.pytorch.

Loss scaling#

We support static and dynamic loss scaling which are configurable through the optimizer’s subparameters:

`loss_scaling_factor`	A constant scalar value means configure for static loss scaling. Passing in the string `"dynamic"` configures it for dynamic loss scaling. (Default: `1`. Don’t configure any loss scaling.)
`initial_loss_scale`	The initial loss scale value if `loss_scale == "dynamic"`. (Default: `2e15`)
`steps_per_increase`	The number of steps after which to increase the loss scaling condition. (Default: `2000`)
`min_loss_scale`	The minimum loss scale value that can be chosen by dynamic loss scaling. (Default: `2e-14`)
`max_loss_scale`	The maximum loss scale value that can be chosen by dynamic loss scaling. (Default: `2e15`)

Global Gradient Clipping#

We support global gradient clipping by value or by the normalized value. They are configurable through the optimizer’s subparameters:

`max_gradient_norm`	max norm of the gradients
`max_gradient_value`	max value of the gradients

Note

The above subparameters are mutually exclusive. They cannot both be specified at the same time.

Execute script with run function#

All models in Cerebras Model Zoo use the run function inside the script run.py for both PyTorch and TensorFlow implementations. Therefore, once you have ported your model to use the run function, you can follow the steps in Launch your job section to launch your training or evaluation job.

Logging and Evaluation metrics#

By default, the run function logs training information to the console and to TensorBoard, as explained in Measure throughput of your model. You can also define your your own scalar and tensor summaries.

The Cerebras Model Zoo git repository uses a base class to compute evaluation metrics called CBMetric. Metrics already defined in the Model Zoo git repository can be imported as:

from cerebras.pytorch.metrics import (
    AccuracyMetric,
    DiceCoefficientMetric,
    MeanIOUMetric,
    PerplexityMetric,
)

As an example, the GPT2 implementation in PyTorch uses some of these metrics.

How to use evaluation metrics:

Registration: All metrics must be registered with the corresponding torch.nn.Module class. This is automatically done when the CBMetric object is constructed. That is, to register a metric to a torch.nn.Module class, construct the metric object in the torch.nn.Module class’ constructor.
Update: The metrics are stateful. This means that every call to the metric object with the appropriate arguments automatically the latest metric value and save it in the metric’s internal state.
Logging: At the very end of the run, the final metrics values will be computed and then logged both to the console and to the TensorBoard SummaryWriter.