PyTorch Runners
On This Page
PyTorch Runners#
The run
function that was described in Porting PyTorch Model to CS
exists as a wrapper around the PyTorch runners. The run
function’s true
purpose is to act as an interface between the user and the
PyTorchBaseRunner
.
The PyTorchBaseRunner
is, as the name suggests, the base runner class.
It contains all of the common elements required to train or evaluate any PyTorch
model. There are a number of PyTorchBaseRunner
subclasses that customize
the behavior of the run to fit the needs of the specific run being conducted.
PyTorchRunner
: ThePyTorchBaseRunner
subclass that facilitates CPU/GPU runsPyTorchCSCompiler
: ThePyTorchBaseRunner
subclass that facilitates compiling a PyTorch model for the Cerebras systemPyTorchCSRunner
: ThePyTorchBaseRunner
subclass that facilitates executing a training/evaluation run of a PyTorch model on a Cerebras system
The run
function’s job is to parse the command line arguments,
configure the appropriate runner, and start the actual training or evaluation
run. However, the bulk of the training/evaluation loop code lives inside the
PyTorchBaseRunner
.
Creating Custom PyTorch Runners (Advanced)#
We highly recommend that the run
function be used as it is
provided. However, if it is simply insufficient, there is a
way to customize the existing PyTorch runners to add custom behaviors such as
custom logging and performance profiling.
At the end of the day, all custom runners should subclass either the
PyTorchBaseRunner
or one of its subclasses. We recommend
doing the latter, as they are already custom tailored to facilitate their use case.
There are a number of hooks that are provided that can be overridden in the respective subclass to customize the runs:
|
A method that is run at the very beginning of training, right before the training loop is entered. |
|
A method that is run at the very end of training, right after the training loop is finished, (but only if the training loop finished successfully and without exceptions). |
|
A method that is run at the very beginning of evaluation, right before the evaluation loop is entered. |
|
A method that is run at the very end of evaluation, right after the evaluation loop is finished, (but only if the evaluation loop finished successfully and without exceptions). |
|
A method that is run at the very beginning of each training epoch, right before the training epoch is started. |
|
A method that is run at the very end of each training epoch, right after the epoch ends, (but only if the epoch finished successfully and without exceptions). |
|
A method that is run at the very beginning of each evaluation epoch, right before the evaluation epoch is started. |
|
A method that is run at the very end of each evaluation epoch, right after the epoch ends, (but only if the epoch finished successfully and without exceptions). |
|
A method that is run at the very beginning of each training step, right before the forward/backwards pass and optimizer step is performed. |
|
A method that is run at the very end of each training step, right after the optimizer step is performed. |
|
A method that is run at the very beginning of each evaluation step, right before the forward pass. |
|
A method that is run at the very end of each evaluation step, right after the forward pass. |
Note
Many of these hooks already have overrides defined in the
PyTorchBaseRunner
, PyTorchRunner
, PyTorchCSCompiler
, and
PyTorchCSRunner
. We highly recommended to call the super class’s
implementation of the hook to prevent any unexpected failures.
Warning
We do not recommended overriding any other method from
PyTorchBaseRunner
besides these hooks. There is a good chance that not
overriding them correctly could lead to failures in either the compile or
execution.
Using Custom Runner#
To use a custom PyTorch runner, you must modify the
PyTorchBaseRunner
’s create
static method in
https://github.com/Cerebras/modelzoo/blob/main/modelzoo/common/pytorch/pytorch_base_runner.py to make use of your custom
runner instead of the predefined ones.
Note
It is important that only the runner initialization is modified. There are
other components in the create
function that are integral to
configuring and ensuring the run on the Cerebras system works.