Porting PyTorch Model to CS
On This Page
Porting PyTorch Model to CS¶
Option 1 (Easiest): Modify reference models in Cerebras Reference Implementations¶
The Cerebras Reference Implementations contains reference implementations in PyTorch of popular neural networks such as BERT, GPT-2 and T5. These implementations have been modularized to separate data preprocessing, model implementation, and additional functions for execution.
If your primary goal is to use one of these models, even with some model or data preprocessing changes, we recommend start from the Cerebras Reference Implementations implementation and add the changes you need.
Example 1: Changing the data loader¶
For this example, we work with the PyTorch implementation of FC_MNIST in the Cerebras Reference Implementations. We create a synthetic dataloader to evaluate performance of the network with respect to different input sizes and number of classes.
In data.py
, we create a function called get_random_dataloader
that creates random images
and labels. We instrument the function to specify in the params.yaml file the number of
examples, the batch size the seed, the image_size and the number of classes of this
dataset.
import torch
import numpy as np
def get_random_dataloader(input_params,shuffle,num_classes):
num_examples = input_params.get("num_examples")
batch_size = input_params.get("batch_size")
seed = input_params.get("seed",1)
image_size = input_params.get("image_size",[1,28,28])
np.random.seed(seed)
image = np.random.random(size = [num_examples,]+image_size).astype(np.float32)
label = np.random.randint(low =0, high = num_classes, size = num_examples).astype(np.int32)
dataset = torch.utils.data.TensorDataset(
torch.from_numpy(image),
torch.from_numpy(label)
)
return torch.utils.data.DataLoader(
dataset,
batch_size=batch_size,
shuffle=shuffle,
num_workers=input_params.get("num_workers", 0),
)
def get_train_dataloader(params):
return get_random_dataloader(
params["train_input"],
params["train_input"].get("shuffle"),
params["model"].get("num_classes")
)
def get_eval_dataloader(params):
return get_random_dataloader(
params["eval_input"],
False,
params["model"].get("num_classes")
)
In model.py
, we change the fix number of classes to a parameter in the
params.yaml file.
class MNIST(nn.Module):
def __init__(self, model_params):
super().__init__()
self.loss_fn = nn.NLLLoss()
self.fc_layers = []
input_size = model_params.get("input_size",784)
num_classes = model_params.get("num_classes",10)
...
self.last_layer = nn.Linear(input_size, num_classes)
...
In configs/params.yaml, we add the additional fields used in the dataloader and model definition.
train_input:
batch_size: 128
drop_last_batch: True
num_examples: 1000
seed: 123
image_size: [1,28,28]
shuffle: True
eval_input:
data_dir: "./data/mnist/val"
batch_size: 128
num_examples: 1000
drop_last_batch: True
seed: 1234
image_size: [1,28,28]
model:
name: "fc_mnist"
mixed_precision: True
input_size: 768 #1*68*68
num_classes: 10
...
Option 2 (Easy): Create new models leveraging Cerebras run function available in Reference Implementations¶
All PyTorch implementations in the Cerebras Reference Implementations use a common harness to manage execution on CS system and other hardware. This harness implements the necessary code changes to compile a model for a Cerebras system, run a compiled model on a Cerebras system, or run the model on CPU/GPU. Therefore, it provides a training/evaluation interface in which models and data preprocessing scripts can be plugged into, without worrying about line-by-line modifications to have Cerebras-friendly code.
If your primary goal is to develop new model and data preprocessing scripts, we suggest
to start by leveraging the common backbone in Cerebras Reference Implementations, the run
function.
Prerequisites¶
To use the run
function, you must have the Cerebras Reference Implementations implementation
compatible with the release installed in the target CS system. The run
function can be
imported as
from cerebras_reference_implementations.common.pytorch.run_utils import run
All the code related with run function lives inside the Cerebras Reference Implementations and can be found in the common/pytorch folder.
How to use the run
function¶
The run
function modularizes the model implementation, the data loaders, the
hyperparameters and the execution. To use the run
function you need:
Params YAML file. This file will be used at runtime.
Implementation that includes the following:
Model definition
Data loaders for training and evaluation
Code Skeleton¶
import os
import sys
import torch
#Append path to parent directory of Cerebras Reference Implementations
sys.path.append(os.path.join(os.path.dirname(__file__), ".."))
from cerebras_reference_implementations.common.pytorch.run_utils import run
from cerebras_reference_implementations.common.pytorch.PyTorchBaseModel import PyTorchBaseModel
#Step 1: Define Model
#Step 1.1 Define Module
class Model(torch.nn.Module):
def __init__(self, params):
...
def forward(inputs):
...
return outputs
...
#Step 1.2 Define PyTorchBaseModel
class BaseModel(PyTorchBaseModel):
def __init__(self, params, device = None)
self.model = Model(params)
self.loss_fn = ...
...
super().__init__(params=params, model=self.model, device=device)
def __call__(self, data):
...
inputs, targets = data
outputs = self.model(inputs)
loss = self.loss_fn(outputs, targets)
return loss
#Step 2: Define dataloaders
def get_train_dataloader(params):
...
loader = torch.utils.data.DataLoader(...)
return loader
def get_eval_dataloader(params):
...
loader = torch.utils.data.DataLoader(...)
return loader
#Step 3: Setup run function
def main():
run(BaseModel, get_train_dataloader, get_eval_dataloader)
if __name__ == '__main__':
main()
Step 1: Define Model¶
To define the model architecture, the run
function requires a callable (either class
or function) that takes as input a dictionary of params and returns a PyTorchBaseModel.
To construct this callable you can:
First, define the model architecture with
torch.nn.Module
.Then, wrap it by defining a
PyTorchBaseModel
. This class also takes care of defining the optimizer.
To customize the model, the run
function creates a dictionary or params using the params YAML file such that:
The model section defines architecture hyperparameters. (optional)
The optimizer defines learning rates and optimizer details. (required)
Creating a PyTorchBaseModel¶
A PyTorchBaseModel object is a light wrapper around a torch.nn.Module
. The
PyTorchBaseModel
class configures the optimization parameters defined in the params
YAML file. This class can be imported from Cerebras Reference Implementations by
from cerebras_reference_implementations.common.pytorch.PyTorchBaseModel import PyTorchBaseModel
The implementation can be found in https://github.com/Cerebras/cerebras_reference_implementations/blob/master/common/pytorch/PyTorchBaseModel.py
To initialize a PyTorchBaseModel it requires:
Type |
Notes |
|
---|---|---|
|
|
Dictionary constructed from params YAML file |
|
|
Definition of model architecture |
|
|
The default value is |
In addition, any child class of PyTorchBaseModel
must implement the
__call__
function. Given one iteration of a dataloader as input, the
__call__
function should return the loss associated with one forward pass of
that batch.
Step 2: Define dataloaders¶
To define the data loaders, the run function requires a callable (either class or
function) that takes as input a dictionary of params returns a
torch.utils.data.DataLoader
. When running training, the train_data_fn
must be provided.
When running evaluation, the eval_data_fn
must be provided.
Step 3: Set up the run
function¶
The run function must be imported from cerebras_reference_implementations.common.pytorch.run_util
. Always remember
to append the parent directory of Cerebras Reference Implementations.
All the input parameters of run
function are callables that take as input a
dictionary, called params
. params
is a dictionary containing all of the model and
data parameters specified by the params YAML file of the model.
Parameter |
Type |
Notes |
---|---|---|
|
|
Required. A callable that takes in a dictionary of
parameters. Returns a |
|
|
Required during training run. |
|
|
Required during evaluation run. |
|
|
Optional. A callable that takes in a dictionary of parameters. Sets default parameters. |
Manage common params for multiple experiments¶
To avoid params replication between multiple similar experiments, the run
function has
an optional input parameter called default_params_fn
. This parameter modifies the
dictionary of the params YAML file, adding default values of unspecified params.
Setting up a default_params_fn
could be beneficial if the user is planning multiple
experiments in which only a small subset of the params YAML file changes. The
default_params_fn
sets up the values shared in all of the
experiments. The user can create different configuration YAML files to only address
the changes between experiments.
The default_params_fn
should be a callable that takes in the params
dictionary and
returns a new dictionary. If the default_params_fn
is omitted, the params
dictionary
will be used as is.
Step 4: Create params YAML file¶
At runtime, the run
function requires a separate params YAML. This file is specified during
execution with the flag --params
in the command line.
Parameters skeleton:
train_input:
...
eval_input:
...
model:
...
optimizer:
optimizer_type: ...
learning_rate: ...
loss_scaling_factor: ...
runconfig:
max_steps: ...
checkpoint_steps: ...
log_steps: ...
seed: ...
save_losses: ...
The params YAML file has the following sections:
Section |
Required |
Notes |
---|---|---|
|
Yes |
Used by run to set up logging and execution.
It expects fields: |
|
Yes |
Used by |
|
No |
By convention, it is used to customize the model
architecture in |
|
No |
By convention, it is used to customize train_data_fn. Fields are tailored to needs inside train_data_fn. |
|
No |
By convention, it is used to customize eval_data_fn. Fields are tailored to needs inside eval_data_fn. |
Optimizer¶
There are a number of optimizer parameters that can be used to configure the optimizer for the run.
Currently, the only supported optimizers are SGD
and AdamW
. The optimizer type can be
specified via the optimizer_type
sub parameter. Below are the required and optional
params that can be used to configure them
|
Parameters |
Descriptions |
---|---|---|
|
|
See “Learning Rate Scheduling” subsection. |
|
The momentum factor. |
|
|
Optional. weight decay. (L2 penalty) (Default: 0.0) |
|
|
|
See “Learning Rate Scheduling” subsection. |
|
Optional. Adam’s first beta parameter. (Default: 0.9) |
|
|
Optional. Adam’s second beta parameter. (Default: 0.999) |
|
|
Optional. Whether or not to correct bias in Adam. (Default: False) |
|
|
Parameters to exclude from weight decay. |
All above parameters being sub parameters to the optimizer top level parameter. Refer to the Reference Implementations for examples of how to configure the optimizer.
Learning Rate Scheduler¶
We also support various learning rate schedulers. They are configurable using the
learning_rate
sub parameter. Valid configurations include the following:
|
Parameters |
Descriptions |
Constant |
A floating point number specifying the learning rate to be used throughout . |
|
|
|
The constant values to use. |
|
The steps on which to change the learning rate values. |
|
|
|
The starting learning rate value. |
|
The final learning rate value. |
|
|
The number of steps over which to transition from the starting learning rate to the final learning rate. |
|
|
|
The starting learning rate value. |
|
The number of steps to decay the learning rate. |
|
|
The rate at which to decay the learning rate. |
Loss scaling¶
We support static and dynamic loss scaling which are configurable through the
optimizer
’s subparameters:
|
A constant scalar value means configure for static loss scaling.
Passing in the string |
|
The initial loss scale value if |
|
The number of steps after which to increase the loss scaling condition.
(Default: |
|
The minimum loss scale value that can be chosen by dynamic loss scaling.
(Default: |
|
The maximum loss scale value that can be chosen by dynamic loss scaling.
(Default: |
Global Gradient Clipping¶
We support global gradient clipping by value
or by the normalized value.
They are configurable through the optimizer
’s subparameters:
|
max norm of the gradients |
|
max value of the gradients |
Note
The above subparameters are mutually exclusive. They cannot both be specified at the same time.
Step 5: Execute script with run function¶
The run
function is instrumented to parse the command line arguments.
Required arguments:
|
Path to .yaml file with model parameters. This parameters usually define the architecture of the model, details of data preprocessing and logging frequencies. |
|
Execution mode: train, eval, or train_and_eval. Specifies the mode. Depending on this value, the runner exercises either the training loop or the evaluation loop. |
Optional arguments:
|
Shows help message and exits. |
|
IP Address of Cerebras system This argument is the way to specify the IP address of the Cerebras system and the port that the connection manager is listening on. If this parameter is not provided, the runner checks whether
a GPU is available and will automatically use it unless the
|
|
Enables compile only workflow. This exercises the compile only path. This workflow compiles the model and creates executables for the Cerebras system, but it does not launch a job on the system. This workflow is particularly useful to verify whether the model compiles without using system resources. |
|
Model directory where checkpoints are written. This specifies the path to the model directory. This is the directory where all of the logs and artifacts generated by the compile and execution (including checkpoints) are stored. |
|
Checkpoint to initialize weights from. This is useful to execute evaluation or to continue training from pretrained weights. |
|
Flag indicating that the provided checkpoint is from a pre-training run. If set, training begins from step 0 after loading the matching weights from the checkpoint and ignores the optimizer state if present in the checkpoint. |
|
Specifies the default logging level. Defaults to INFO. |
Execution in CS system¶
To execute the model in a CS system, we can use the instrumentation inside the run
function
Compile:
csrun_cpu python-pt run.py --mode <train,eval> --params params.yaml --compile_only --cs_ip <CS_IP:port>
Execute:
csrun_wse python-pt run.py --mode <train,eval> --params params.yaml --cs_ip <CS_IP:port>
Execution in different hardware¶
Using run
function enables the execution of the same code on a Cerebras system and on CPU/GPU
without any changes. You can use the command line arguments to specify the type of device use for training
and evaluation.
If you are interested in:
Execution |
Devices Available |
Flags Needed |
Compilation |
The compilation is done in CPU and it is only required when executing code in a Cerebras system. |
|
Training |
Cerebras system |
|
CPU |
|
|
GPU |
|
|
Evaluation |
Cerebras system |
|
CPU |
|
|
GPU |
|
Example using run
with FC_MNIST¶
In this example, we port a PyTorch implementation of a fully connected dense neural
network for the MNIST dataset to CS-friendly code using the run
function.
Step 1: Define model¶
In this example, we construct a FC_MNIST implementation given the depth and the hidden
size of the network. We assume that the input size is 784 and the last output dimension
is 10. We use ReLU
as non linearity, and a negative log likelihood loss.
In fc_mnist.py we define a child class from torch.nn.Module
called MNIST.
import torch
import torch.nn as nn
import torch.nn.functional as F
class MNIST(nn.Module):
def __init__(self, model_params):
super().__init__()
self.fc_layers = []
input_size = 784
# Depth is len(hidden_sizes)
model_params["hidden_sizes"] = [
model_params["hidden_size"]
] * model_params["depth"]
for hidden_size in model_params["hidden_sizes"]:
fc_layer = nn.Linear(input_size, hidden_size)
self.fc_layers.append(fc_layer)
input_size = hidden_size
self.fc_layers = nn.ModuleList(self.fc_layers)
self.last_layer = nn.Linear(input_size, 10)
self.nonlin = nn.ReLU()
self.dropout = nn.Dropout(model_params["dropout"])
def forward(self, inputs):
x = torch.flatten(inputs, 1)
for fc_layer in self.fc_layers:
x = fc_layer(x)
x = self.nonlin(x)
x = self.dropout(x)
pred_logits = self.last_layer(x)
outputs = F.log_softmax(pred_logits, dim=1)
return outputs
Then, in model.py, we create a child class from PyTorchBaseModel
called MNISTModel
that
wraps the module MNIST
. In MNISTModel
, additional to initialization, we implement two
functions: build_model
to create a MNIST
object, and __call__
to return the loss
associated with one forward pass of a given dataloader iteration.
import torch
from cerebras_reference_implementations.common.pytorch.PyTorchBaseModel import PyTorchBaseModel
from FC_MNIST import MNIST
class MNISTModel(PyTorchBaseModel):
def __init__(self, params, device=None):
self.params = params
model_params = params["model"].copy()
self.model = self.build_model(model_params)
self.loss_fn = nn.NLLLoss()
super().__init__(params=params, model=self.model, device=device)
def build_model(self, model_params):
dtype = torch.float32
model = MNIST(model_params)
model.to(dtype)
return model
def __call__(self, data):
inputs, labels = data
outputs = self.model(inputs)
loss = self.loss_fn(outputs, labels)
return loss
Step 2: Define dataloaders¶
In this example, we create two different functions for training and evaluation. We use
torchvision.datasets
functionality to download MNIST dataset. Each of these functions
returns a torch.utils.data.DataLoader
.
In data.py:
import torch
from torchvision import datasets, transforms
def get_train_dataloader(params):
input_params = params["train_input"]
batch_size = input_params.get("batch_size")
dtype = torch.float16 if input_params["to_float16"] else torch.float32
shuffle = input_params["shuffle"]
train_dataset = datasets.MNIST(
input_params["data_dir"],
train=True,
download=True,
transform=transforms.Compose(
[
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,)),
transforms.Lambda(
lambda x: torch.as_tensor(x, dtype=dtype)
),
]
),
target_transform=transforms.Lambda(
lambda x: torch.as_tensor(x, dtype=torch.int32)
),
)
train_loader = torch.utils.data.DataLoader(
train_dataset,
batch_size=batch_size,
drop_last=input_params["drop_last_batch"],
shuffle=shuffle,
num_workers=input_params.get("num_workers", 0),
)
return train_loader
def get_eval_dataloader(params):
input_params = params["eval_input"]
batch_size = input_params.get("batch_size")
dtype = torch.float16 if input_params["to_float16"] else torch.float32
eval_dataset = datasets.MNIST(
input_params["data_dir"],
train=False,
download=True,
transform=transforms.Compose(
[
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,)),
transforms.Lambda(
lambda x: torch.as_tensor(x, dtype=dtype)
),
]
),
target_transform=transforms.Lambda(
lambda x: torch.as_tensor(x, dtype=torch.int32)
),
)
eval_loader = torch.utils.data.DataLoader(
eval_dataset,
batch_size=batch_size,
drop_last=input_params["drop_last_batch"],
shuffle=False,
num_workers=input_params.get("num_workers", 0),
)
return eval_loader
Step 3: Set up run function¶
With all of the elements in place, now we import the run function from
cerebras_reference_implementations.common.pytorch.run_util
. We append the parent directory of Cerebras Reference Implementations.
In run.py:
import os
import sys
#Append path to parent directory of Cerebras Reference Implementations
sys.path.append(os.path.join(os.path.dirname(__file__), ".."))
from cerebras_reference_implementations.common.pytorch.run_utils import run
from data import (
get_train_dataloader,
get_eval_dataloader,
)
from model import MNISTModel
def main():
run(MNISTModel, get_train_dataloader, get_eval_dataloader)
if __name__ == '__main__':
main()
Step 4: Setup params yaml file¶
We customize the fields in train_input
, eval_input
, model
, to be used inside
get_train_dataloader
, get_eval_dataloader
, MNISTModel
. We also specify the required
optimizer
and runconfig
params.
In params.yaml:
train_input:
data_dir: "./data/mnist/train"
batch_size: 128
drop_last_batch: True
shuffle: True
to_float16: True
eval_input:
data_dir: "./data/mnist/val"
batch_size: 128
drop_last_batch: True
to_float16: True
model:
name: "fc_mnist"
mixed_precision: True
depth: 10
hidden_size: 50
dropout: 0.0
activation_fn: "relu"
optimizer:
optimizer_type: "SGD"
learning_rate: 0.001
momentum: 0.9
loss_scaling_factor: 1.0
runconfig:
max_steps: 10000
checkpoint_steps: 2000
log_steps: 50
seed: 1
save_losses: True
Additional functionality¶
Logging¶
By default, the run function logs training information to the console and to TensorBoard.
Console logging: Given the frequency defined by
log_steps
in therunconfig
section of the params YAML file, the training displays the step number, the current loss, the number of samples per second, and the current time. As an example:
| Train Device=xla:0 Step=2 Loss=0.00000 Rate=361.86 GlobalRate=361.53 Time=08:29:00
TensorBoard Logging: A TensorBoard SummaryWriter is created. It contains information such as
the loss and samples per second. This information is stored inside the model_dir/{mode} directory.
Evaluation metrics¶
Cerebras Reference Implementations uses a base class to compute evaluation metrics called
CBMetric
. Metrics already defined in the Cerebras Reference Implementations can be imported as:
from cerebras_reference_implementations.common.pytorch.metrics import (
AccuracyMetric,
FBetaScoreMetric,
PerplexityMetric,
RougeScoreMetric,
)
As an example, the BERT implementation in PyTorch (cerebras_reference_implementations/transformers/pytorch/bert/model.py) uses some of these metrics.
How to use evaluation metrics¶
Registration: All metrics must be registered with the corresponding
PyTorchBaseModel
class. This is automatically done when theCBMetric
object is constructed. That is, to register a metric to aPytorchBaseModel
class, construct the metric object in thePytorchBaseModel
class’ constructor.Update: The metrics are stateful. This means that every call to the metric object with the appropriate arguments automatically the latest metric value and save it in the metric’s internal state.
Logging: At the very end of the run, the final metrics values will be computed and then logged both to the console and to the TensorBoard
SummaryWriter
.
More on evaluation metrics¶
Implementation of CBMetric
class can be found at
https://github.com/Cerebras/cerebras_reference_implementations/blob/master/common/pytorch/metrics/cb_metric.py.
The CBMetric
class is a base class for creating metrics on CS devices. Subclasses must
override methods to provide the full functionality of the metric. These methods are
meant to split the computation graph into two portions:
update_on_device
: Compiles and runs on the device (i.e., CS system).update_on_host
: Runs on the host (i.e., CPU).
These metrics also support running on CPU and GPU.