Porting PyTorch Model to CS

Option 1 (Easiest): Modify reference models in Cerebras Reference Implementations

The Cerebras Reference Implementations contains reference implementations in PyTorch of popular neural networks such as BERT, GPT-2 and T5. These implementations have been modularized to separate data preprocessing, model implementation, and additional functions for execution.

If your primary goal is to use one of these models, even with some model or data preprocessing changes, we recommend start from the Cerebras Reference Implementations implementation and add the changes you need.

Example 1: Changing the data loader

For this example, we work with the PyTorch implementation of FC_MNIST in the Cerebras Reference Implementations. We create a synthetic dataloader to evaluate performance of the network with respect to different input sizes and number of classes.

In data.py, we create a function called get_random_dataloader that creates random images and labels. We instrument the function to specify in the params.yaml file the number of examples, the batch size the seed, the image_size and the number of classes of this dataset.

import torch
import numpy as np

def get_random_dataloader(input_params,shuffle,num_classes):
    num_examples = input_params.get("num_examples")
    batch_size = input_params.get("batch_size")
    seed = input_params.get("seed",1)
    image_size = input_params.get("image_size",[1,28,28])

    np.random.seed(seed)
    image = np.random.random(size = [num_examples,]+image_size).astype(np.float32)
    label = np.random.randint(low =0, high = num_classes, size = num_examples).astype(np.int32)

    dataset = torch.utils.data.TensorDataset(
        torch.from_numpy(image),
        torch.from_numpy(label)
    )

    return torch.utils.data.DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        num_workers=input_params.get("num_workers", 0),
    )

def get_train_dataloader(params):
    return get_random_dataloader(
        params["train_input"],
        params["train_input"].get("shuffle"),
        params["model"].get("num_classes")
    )

def get_eval_dataloader(params):
    return get_random_dataloader(
        params["eval_input"],
        False,
        params["model"].get("num_classes")
    )

In model.py, we change the fix number of classes to a parameter in the params.yaml file.

class MNIST(nn.Module):
    def __init__(self, model_params):
        super().__init__()
        self.loss_fn = nn.NLLLoss()
        self.fc_layers = []
        input_size = model_params.get("input_size",784)
        num_classes = model_params.get("num_classes",10)
        ...
        self.last_layer = nn.Linear(input_size, num_classes)
        ...

In configs/params.yaml, we add the additional fields used in the dataloader and model definition.

train_input:
    batch_size: 128
    drop_last_batch: True
    num_examples: 1000
    seed: 123
    image_size: [1,28,28]
    shuffle: True

eval_input:
    data_dir: "./data/mnist/val"
    batch_size: 128
    num_examples: 1000
    drop_last_batch: True
    seed: 1234
    image_size: [1,28,28]

model:
    name: "fc_mnist"
    mixed_precision: True
    input_size: 768 #1*68*68
    num_classes: 10
    ...

Option 2 (Easy): Create new models leveraging Cerebras run function available in Reference Implementations

All PyTorch implementations in the Cerebras Reference Implementations use a common harness to manage execution on CS system and other hardware. This harness implements the necessary code changes to compile a model for a Cerebras system, run a compiled model on a Cerebras system, or run the model on CPU/GPU. Therefore, it provides a training/evaluation interface in which models and data preprocessing scripts can be plugged into, without worrying about line-by-line modifications to have Cerebras-friendly code.

If your primary goal is to develop new model and data preprocessing scripts, we suggest to start by leveraging the common backbone in Cerebras Reference Implementations, the run function.

Prerequisites

To use the run function, you must have the Cerebras Reference Implementations implementation compatible with the release installed in the target CS system. The run function can be imported as

from cerebras_reference_implementations.common.pytorch.run_utils import run

All the code related with run function lives inside the Cerebras Reference Implementations and can be found in the common/pytorch folder.

How to use the run function

The run function modularizes the model implementation, the data loaders, the hyperparameters and the execution. To use the run function you need:

  1. Params YAML file. This file will be used at runtime.

  2. Implementation that includes the following:

    1. Model definition

    2. Data loaders for training and evaluation

Code Skeleton

import os
import sys

import torch

#Append path to parent directory of Cerebras Reference Implementations
sys.path.append(os.path.join(os.path.dirname(__file__), ".."))
from cerebras_reference_implementations.common.pytorch.run_utils import run
from cerebras_reference_implementations.common.pytorch.PyTorchBaseModel import PyTorchBaseModel

#Step 1: Define Model
#Step 1.1 Define Module
class Model(torch.nn.Module):
    def __init__(self, params):
        ...
    def forward(inputs):
        ...
        return outputs
    ...

#Step 1.2 Define PyTorchBaseModel
class BaseModel(PyTorchBaseModel):
    def __init__(self, params, device = None)
        self.model = Model(params)
        self.loss_fn = ...
        ...
        super().__init__(params=params, model=self.model, device=device)
    def __call__(self, data):
        ...
        inputs, targets = data
        outputs = self.model(inputs)
        loss = self.loss_fn(outputs, targets)
        return loss

#Step 2: Define dataloaders
def get_train_dataloader(params):
    ...
    loader = torch.utils.data.DataLoader(...)
    return loader

def get_eval_dataloader(params):
    ...
    loader = torch.utils.data.DataLoader(...)
    return loader

#Step 3: Setup run function
def main():
    run(BaseModel, get_train_dataloader, get_eval_dataloader)

if __name__ == '__main__':
    main()

Step 1: Define Model

To define the model architecture, the run function requires a callable (either class or function) that takes as input a dictionary of params and returns a PyTorchBaseModel. To construct this callable you can:

  1. First, define the model architecture with torch.nn.Module.

  2. Then, wrap it by defining a PyTorchBaseModel. This class also takes care of defining the optimizer.

To customize the model, the run function creates a dictionary or params using the params YAML file such that:

  1. The model section defines architecture hyperparameters. (optional)

  2. The optimizer defines learning rates and optimizer details. (required)

Creating a PyTorchBaseModel

A PyTorchBaseModel object is a light wrapper around a torch.nn.Module. The PyTorchBaseModel class configures the optimization parameters defined in the params YAML file. This class can be imported from Cerebras Reference Implementations by

from cerebras_reference_implementations.common.pytorch.PyTorchBaseModel import PyTorchBaseModel

The implementation can be found in https://github.com/Cerebras/cerebras_reference_implementations/blob/master/common/pytorch/PyTorchBaseModel.py

To initialize a PyTorchBaseModel it requires:

Type

Notes

params

dict

Dictionary constructed from params YAML file

model

torch.nn.Module

Definition of model architecture

device

torch.device

The default value is device: torch.device = None In this case, the runner code inside run function figures out the proper device to the run.

In addition, any child class of PyTorchBaseModel must implement the __call__ function. Given one iteration of a dataloader as input, the __call__ function should return the loss associated with one forward pass of that batch.

Step 2: Define dataloaders

To define the data loaders, the run function requires a callable (either class or function) that takes as input a dictionary of params returns a torch.utils.data.DataLoader. When running training, the train_data_fn must be provided. When running evaluation, the eval_data_fn must be provided.

Step 3: Set up the run function

The run function must be imported from cerebras_reference_implementations.common.pytorch.run_util. Always remember to append the parent directory of Cerebras Reference Implementations.

All the input parameters of run function are callables that take as input a dictionary, called params. params is a dictionary containing all of the model and data parameters specified by the params YAML file of the model.

Parameter

Type

Notes

model_fn

Callable[[dict], PyTorchBaseModel]

Required. A callable that takes in a dictionary of parameters. Returns a PyTorchBaseModel.

train_data_fn

Callable[[dict], torch.utils.data.DataLoader]

Required during training run.

eval_data_fn

Callable[[dict], torch.utils.data.DataLoader]

Required during evaluation run.

default_params_fn

Callable[[dict], Optional[dict]]

Optional. A callable that takes in a dictionary of parameters. Sets default parameters.

Manage common params for multiple experiments

To avoid params replication between multiple similar experiments, the run function has an optional input parameter called default_params_fn. This parameter modifies the dictionary of the params YAML file, adding default values of unspecified params.

Setting up a default_params_fn could be beneficial if the user is planning multiple experiments in which only a small subset of the params YAML file changes. The default_params_fn sets up the values shared in all of the experiments. The user can create different configuration YAML files to only address the changes between experiments.

The default_params_fn should be a callable that takes in the params dictionary and returns a new dictionary. If the default_params_fn is omitted, the params dictionary will be used as is.

Step 4: Create params YAML file

At runtime, the run function requires a separate params YAML. This file is specified during execution with the flag --params in the command line.

Parameters skeleton:

train_input:
    ...

eval_input:
    ...

model:
    ...

optimizer:
    optimizer_type: ...
    learning_rate: ...
    loss_scaling_factor: ...

runconfig:
    max_steps: ...
    checkpoint_steps: ...
    log_steps: ...
    seed: ...
    save_losses: ...

The params YAML file has the following sections:

Section

Required

Notes

runconfig

Yes

Used by run to set up logging and execution. It expects fields: max_steps, checkpoint_steps, log_steps, save_losses.

optimizer

Yes

Used by PyTorchBaseModel to set up optimizer. It expects fields: optimizer_type, learning_rate, loss_scaling_factor.

model

No

By convention, it is used to customize the model architecture in nn.Module. Fields are tailored to needs inside the model.

train_input

No

By convention, it is used to customize train_data_fn. Fields are tailored to needs inside train_data_fn.

eval_input

No

By convention, it is used to customize eval_data_fn. Fields are tailored to needs inside eval_data_fn.

Optimizer

There are a number of optimizer parameters that can be used to configure the optimizer for the run.

Currently, the only supported optimizers are SGD and AdamW. The optimizer type can be specified via the optimizer_type sub parameter. Below are the required and optional params that can be used to configure them

optimizer_type

Parameters

Descriptions

SGD

learning_rate

See “Learning Rate Scheduling” subsection.

momentum

The momentum factor.

weight_decay_rate

Optional. weight decay. (L2 penalty) (Default: 0.0)

AdamW

learning_rate

See “Learning Rate Scheduling” subsection.

beta1

Optional. Adam’s first beta parameter. (Default: 0.9)

beta2

Optional. Adam’s second beta parameter. (Default: 0.999)

correct_bias

Optional. Whether or not to correct bias in Adam. (Default: False)

exclude_from_weight_decay

Parameters to exclude from weight decay.

All above parameters being sub parameters to the optimizer top level parameter. Refer to the Reference Implementations for examples of how to configure the optimizer.

Learning Rate Scheduler

We also support various learning rate schedulers. They are configurable using the learning_rate sub parameter. Valid configurations include the following:

learning_rate

Parameters

Descriptions

Constant

A floating point number specifying the learning rate to be used throughout .

PieceWiseConstant

values

The constant values to use.

boundaries

The steps on which to change the learning rate values.

Linear

initial_learning_rate

The starting learning rate value.

end_learning_rate

The final learning rate value.

steps

The number of steps over which to transition from the starting learning rate to the final learning rate.

Exponential

initial_learning_rate

The starting learning rate value.

decay_steps

The number of steps to decay the learning rate.

decay_rate

The rate at which to decay the learning rate.

Loss scaling

We support static and dynamic loss scaling which are configurable through the optimizer’s subparameters:

loss_scaling_factor

A constant scalar value means configure for static loss scaling. Passing in the string "dynamic" configures it for dynamic loss scaling. (Default: 1. Don’t configure any loss scaling.)

initial_loss_scale

The initial loss scale value if loss_scale == "dynamic". (Default: 2e15)

steps_per_increase

The number of steps after which to increase the loss scaling condition. (Default: 2000)

min_loss_scale

The minimum loss scale value that can be chosen by dynamic loss scaling. (Default: 2e-14)

max_loss_scale

The maximum loss scale value that can be chosen by dynamic loss scaling. (Default: 2e15)

Global Gradient Clipping

We support global gradient clipping by value or by the normalized value. They are configurable through the optimizer’s subparameters:

max_gradient_norm

max norm of the gradients

max_gradient_value

max value of the gradients

Note

The above subparameters are mutually exclusive. They cannot both be specified at the same time.

Step 5: Execute script with run function

The run function is instrumented to parse the command line arguments.

Required arguments:

-p PARAMS,

--params PARAMS

Path to .yaml file with model parameters. This parameters usually define the architecture of the model, details of data preprocessing and logging frequencies.

-m {train,eval,train_and_eval}

--mode {train,eval,train_and_eval}

Execution mode: train, eval, or train_and_eval. Specifies the mode. Depending on this value, the runner exercises either the training loop or the evaluation loop.

Optional arguments:

-h, --help

Shows help message and exits.

-cs CS_IP

--cs_ip CS_IP

IP Address of Cerebras system This argument is the way to specify the IP address of the Cerebras system and the port that the connection manager is listening on.

If this parameter is not provided, the runner checks whether a GPU is available and will automatically use it unless the --cpu argument is also provided.

--compile_only

Enables compile only workflow.

This exercises the compile only path. This workflow compiles the model and creates executables for the Cerebras system, but it does not launch a job on the system.

This workflow is particularly useful to verify whether the model compiles without using system resources.

-o MODEL_DIR

--model_dir MODEL_DIR

Model directory where checkpoints are written.

This specifies the path to the model directory. This is the directory where all of the logs and artifacts generated by the compile and execution (including checkpoints) are stored.

--checkpoint_path CHECKPOINT_PATH

Checkpoint to initialize weights from.

This is useful to execute evaluation or to continue training from pretrained weights.

--is_pretrained_checkpoint

Flag indicating that the provided checkpoint is from a pre-training run. If set, training begins from step 0 after loading the matching weights from the checkpoint and ignores the optimizer state if present in the checkpoint.

--logging LOGGING

Specifies the default logging level. Defaults to INFO.

Execution in CS system

To execute the model in a CS system, we can use the instrumentation inside the run function

Compile:

csrun_cpu python-pt run.py --mode <train,eval> --params params.yaml --compile_only --cs_ip <CS_IP:port>

Execute:

csrun_wse python-pt run.py --mode <train,eval> --params params.yaml --cs_ip <CS_IP:port>

Execution in different hardware

Using run function enables the execution of the same code on a Cerebras system and on CPU/GPU without any changes. You can use the command line arguments to specify the type of device use for training and evaluation.

If you are interested in:

Execution

Devices Available

Flags Needed

Compilation

The compilation is done in CPU and it is only required when executing code in a Cerebras system.

--compile_only --cs_ip CS_IP --mode {train,eval}

Training

Cerebras system

--cs_ip CS_IP --mode train

CPU

--cpu --mode train

GPU

--mode train

Evaluation

Cerebras system

--cs_ip CS_IP --mode eval

CPU

--cpu --mode eval

GPU

--mode eval

Example using run with FC_MNIST

In this example, we port a PyTorch implementation of a fully connected dense neural network for the MNIST dataset to CS-friendly code using the run function.

Step 1: Define model

In this example, we construct a FC_MNIST implementation given the depth and the hidden size of the network. We assume that the input size is 784 and the last output dimension is 10. We use ReLU as non linearity, and a negative log likelihood loss.

In fc_mnist.py we define a child class from torch.nn.Module called MNIST.

import torch
import torch.nn as nn
import torch.nn.functional as F

class MNIST(nn.Module):
    def __init__(self, model_params):
        super().__init__()
        self.fc_layers = []
        input_size = 784

        # Depth is len(hidden_sizes)
        model_params["hidden_sizes"] = [
            model_params["hidden_size"]
        ] * model_params["depth"]

        for hidden_size in model_params["hidden_sizes"]:
            fc_layer = nn.Linear(input_size, hidden_size)
            self.fc_layers.append(fc_layer)
            input_size = hidden_size
        self.fc_layers = nn.ModuleList(self.fc_layers)
        self.last_layer = nn.Linear(input_size, 10)

        self.nonlin = nn.ReLU()

        self.dropout = nn.Dropout(model_params["dropout"])

    def forward(self, inputs):
        x = torch.flatten(inputs, 1)
        for fc_layer in self.fc_layers:
            x = fc_layer(x)
            x = self.nonlin(x)
            x = self.dropout(x)

        pred_logits = self.last_layer(x)

        outputs = F.log_softmax(pred_logits, dim=1)
        return outputs

Then, in model.py, we create a child class from PyTorchBaseModel called MNISTModel that wraps the module MNIST. In MNISTModel, additional to initialization, we implement two functions: build_model to create a MNIST object, and __call__ to return the loss associated with one forward pass of a given dataloader iteration.

import torch
from cerebras_reference_implementations.common.pytorch.PyTorchBaseModel import PyTorchBaseModel
from FC_MNIST import MNIST

class MNISTModel(PyTorchBaseModel):
    def __init__(self, params, device=None):
        self.params = params
        model_params = params["model"].copy()
        self.model = self.build_model(model_params)
        self.loss_fn = nn.NLLLoss()

        super().__init__(params=params, model=self.model, device=device)

    def build_model(self, model_params):
        dtype = torch.float32
        model = MNIST(model_params)
        model.to(dtype)
        return model

    def __call__(self, data):
        inputs, labels = data
        outputs = self.model(inputs)
        loss = self.loss_fn(outputs, labels)
        return loss

Step 2: Define dataloaders

In this example, we create two different functions for training and evaluation. We use torchvision.datasets functionality to download MNIST dataset. Each of these functions returns a torch.utils.data.DataLoader.

In data.py:

import torch
from torchvision import datasets, transforms

def get_train_dataloader(params):
    input_params = params["train_input"]

    batch_size = input_params.get("batch_size")
    dtype = torch.float16 if input_params["to_float16"] else torch.float32
    shuffle = input_params["shuffle"]

    train_dataset = datasets.MNIST(
        input_params["data_dir"],
        train=True,
        download=True,
        transform=transforms.Compose(
            [
                transforms.ToTensor(),
                transforms.Normalize((0.1307,), (0.3081,)),
                transforms.Lambda(
                    lambda x: torch.as_tensor(x, dtype=dtype)
                ),
            ]
        ),
        target_transform=transforms.Lambda(
            lambda x: torch.as_tensor(x, dtype=torch.int32)
        ),
    )

    train_loader = torch.utils.data.DataLoader(
        train_dataset,
        batch_size=batch_size,
        drop_last=input_params["drop_last_batch"],
        shuffle=shuffle,
        num_workers=input_params.get("num_workers", 0),
    )
    return train_loader

def get_eval_dataloader(params):
    input_params = params["eval_input"]

    batch_size = input_params.get("batch_size")
    dtype = torch.float16 if input_params["to_float16"] else torch.float32

    eval_dataset = datasets.MNIST(
        input_params["data_dir"],
        train=False,
        download=True,
        transform=transforms.Compose(
            [
                transforms.ToTensor(),
                transforms.Normalize((0.1307,), (0.3081,)),
                transforms.Lambda(
                    lambda x: torch.as_tensor(x, dtype=dtype)
                ),
            ]
        ),
        target_transform=transforms.Lambda(
            lambda x: torch.as_tensor(x, dtype=torch.int32)
        ),
    )

    eval_loader = torch.utils.data.DataLoader(
        eval_dataset,
        batch_size=batch_size,
        drop_last=input_params["drop_last_batch"],
        shuffle=False,
        num_workers=input_params.get("num_workers", 0),
    )
    return eval_loader

Step 3: Set up run function

With all of the elements in place, now we import the run function from cerebras_reference_implementations.common.pytorch.run_util. We append the parent directory of Cerebras Reference Implementations.

In run.py:

import os
import sys

#Append path to parent directory of Cerebras Reference Implementations
sys.path.append(os.path.join(os.path.dirname(__file__), ".."))
from cerebras_reference_implementations.common.pytorch.run_utils import run

from data import (
    get_train_dataloader,
    get_eval_dataloader,
)
from model import MNISTModel

def main():
    run(MNISTModel, get_train_dataloader, get_eval_dataloader)

if __name__ == '__main__':
    main()

Step 4: Setup params yaml file

We customize the fields in train_input, eval_input, model, to be used inside get_train_dataloader, get_eval_dataloader, MNISTModel. We also specify the required optimizer and runconfig params.

In params.yaml:

train_input:
    data_dir: "./data/mnist/train"
    batch_size: 128
    drop_last_batch: True
    shuffle: True
    to_float16: True

eval_input:
    data_dir: "./data/mnist/val"
    batch_size: 128
    drop_last_batch: True
    to_float16: True

model:
    name: "fc_mnist"
    mixed_precision: True
    depth: 10
    hidden_size: 50
    dropout: 0.0
    activation_fn: "relu"

optimizer:
    optimizer_type: "SGD"
    learning_rate: 0.001
    momentum: 0.9
    loss_scaling_factor: 1.0

runconfig:
    max_steps: 10000
    checkpoint_steps: 2000
    log_steps: 50
    seed: 1
    save_losses: True

Additional functionality

Logging

By default, the run function logs training information to the console and to TensorBoard.

  • Console logging: Given the frequency defined by log_steps in the runconfig

section of the params YAML file, the training displays the step number, the current loss, the number of samples per second, and the current time. As an example:

| Train Device=xla:0 Step=2 Loss=0.00000 Rate=361.86 GlobalRate=361.53 Time=08:29:00
  • TensorBoard Logging: A TensorBoard SummaryWriter is created. It contains information such as

the loss and samples per second. This information is stored inside the model_dir/{mode} directory.

Evaluation metrics

Cerebras Reference Implementations uses a base class to compute evaluation metrics called CBMetric. Metrics already defined in the Cerebras Reference Implementations can be imported as:

from cerebras_reference_implementations.common.pytorch.metrics import (
    AccuracyMetric,
    FBetaScoreMetric,
    PerplexityMetric,
    RougeScoreMetric,
)

As an example, the BERT implementation in PyTorch (cerebras_reference_implementations/transformers/pytorch/bert/model.py) uses some of these metrics.

How to use evaluation metrics
  1. Registration: All metrics must be registered with the corresponding PyTorchBaseModel class. This is automatically done when the CBMetric object is constructed. That is, to register a metric to a PytorchBaseModel class, construct the metric object in the PytorchBaseModel class’ constructor.

  2. Update: The metrics are stateful. This means that every call to the metric object with the appropriate arguments automatically the latest metric value and save it in the metric’s internal state.

  3. Logging: At the very end of the run, the final metrics values will be computed and then logged both to the console and to the TensorBoard SummaryWriter.

More on evaluation metrics

Implementation of CBMetric class can be found at https://github.com/Cerebras/cerebras_reference_implementations/blob/master/common/pytorch/metrics/cb_metric.py. The CBMetric class is a base class for creating metrics on CS devices. Subclasses must override methods to provide the full functionality of the metric. These methods are meant to split the computation graph into two portions:

  1. update_on_device: Compiles and runs on the device (i.e., CS system).

  2. update_on_host: Runs on the host (i.e., CPU).

These metrics also support running on CPU and GPU.