.. _pytorch-custom-runner:

PyTorch Runners
===============

The :code:`run` function that was described in :ref:`porting-pytorch-to-cs`
exists as a wrapper around the PyTorch runners. The :code:`run` function's true
purpose is to act as an interface between the user and the
:code:`PyTorchBaseRunner`.

The :code:`PyTorchBaseRunner` is, as the name suggests, the base runner class.
It contains all of the common elements required to train or evaluate any PyTorch
model. There are a number of :code:`PyTorchBaseRunner` subclasses that customize
the behavior of the run to fit the needs of the specific run being conducted.

- :code:`PyTorchRunner`: The :code:`PyTorchBaseRunner` subclass that facilitates
  CPU/GPU runs

- :code:`PyTorchCSCompiler`: The :code:`PyTorchBaseRunner` subclass that facilitates
  compiling a PyTorch model for the Cerebras system

- :code:`PyTorchCSRunner`: The :code:`PyTorchBaseRunner` subclass that facilitates
  executing a training/evaluation run of a PyTorch model on a Cerebras system

- :code:`PyTorchCSAppliance`: The :code:`PyTorchBaseRunner` subclass that
  facilitates compiling and executing a weight streaming training/evaluation run
  using appliance mode of a PyTorch model on a Cerebras system


The :code:`run` function's job is to parse the command line arguments,
configure the appropriate runner, and start the actual training or evaluation
run. However, the bulk of the training/evaluation loop code lives inside the
:code:`PyTorchBaseRunner`.


Using Custom Models with the PyTorch Runners
--------------------------------------------

If you have a custom model and optimizer that you wish to use with our PyTorch
runners, you can use the following steps to transform your existing model and
optimizer with our :code:`PyTorchBaseModel`.

To create a simple wrapper for your model and optimizer, it is as simple as
subclassing :code:`PyTorchBaseModel` and assigning the model and optimizer
accordingly:

.. code-block:: python

    import torch
    from modelzoo.common.pytorch.PyTorchBaseModel import PyTorchBaseModel

    class BaseModel(PyTorchBaseModel):
        def __init__(
            self,
            params: dict,
            model: torch.nn.Module,
            optimizer: torch.optim.Optimizer,

        ):
            self.custom_optimizer = optimizer
            self.loss_fn = ...

            super().__init__(params, model, device=None)

        def _configure_optimizer(self, params):
            """
            Override default optimizer configuration and return custom optimizer
            """
            return self.custom_optimizer

        def _configure_lr_scheduler(self, params):
            """
            Override default lr scheduler configuration and return custom
            lr scheduler
            """
            return None

        # An example implementation of the __call__ function
        def __call__(self, data):
            inputs, targets = data
            outputs = self.model(data)
            loss = self.loss_fn(outputs, targets)
            return loss

With this :code:`BaseModel` definition, it is easy to then wrap it inside a
:code:`model_fn` function which can then be used with the common :code:`run`
interface:

.. code-block:: python

    from modelzoo.common.pytorch.run_utils import run

    # Define your model and optimizer here
    model: torch.nn.Module = ...
    optimizer: torch.optim.Optimizer = ...

    def model_fn(params):
        return BaseModel(params, model, optimizer)

    run(
        model_fn,
        train_input_dataloader_fn,
        eval_input_dataloader_fn,
    )

The common CLI arguments are still expected to be present to be parsed and used
in the run. The only difference being, the :code:`"optimizer"` key from the
`params.yaml` won't be used to configure the optimizer and learning rate
scheduler.

If you are implementing your own custom optimizers and learning rate schedulers,
please see the :ref:`cbtorch-limitations` page to see what constraints your
implementations must satisfy


Creating Custom PyTorch Runners (Advanced)
------------------------------------------

We highly recommend that the :code:`run` function be used as it is
provided. However, if it is simply insufficient, there is a
way to customize the existing PyTorch runners to add custom behaviors such as
custom logging and performance profiling.

At the end of the day, all custom runners should subclass either the
:code:`PyTorchBaseRunner` or one of its subclasses. We recommend 
doing the latter, as they are already custom tailored to facilitate their use case.

There are a number of hooks that are provided that can be overridden in the
respective subclass to customize the runs:

+------------------------------+--------------------------------------------------------------------------------+
| :code:`on_train_start`       | A method that is run at the very beginning of training, right before           |
|                              | the training loop is entered.                                                  |
+------------------------------+--------------------------------------------------------------------------------+
| :code:`on_train_end`         | A method that is run at the very end of training, right after                  |
|                              | the training loop is finished,                                                 |
|                              | (but only if the training loop finished successfully and without exceptions).  |
+------------------------------+--------------------------------------------------------------------------------+
| :code:`on_eval_start`        | A method that is run at the very beginning of evaluation,                      |
|                              | right before the evaluation loop is entered.                                   |
+------------------------------+--------------------------------------------------------------------------------+
| :code:`on_eval_end`          | A method that is run at the very end of evaluation, right after                |
|                              | the evaluation loop is finished,                                               |
|                              | (but only if the evaluation loop finished successfully and without exceptions).|
+------------------------------+--------------------------------------------------------------------------------+
| :code:`on_train_epoch_start` | A method that is run at the very beginning of each training epoch,             |
|                              | right before the training epoch is started.                                    |
+------------------------------+--------------------------------------------------------------------------------+
| :code:`on_train_epoch_end`   | A method that is run at the very end of each training epoch, right after       |
|                              | the epoch ends,                                                                |
|                              | (but only if the epoch finished successfully and without exceptions).          |
+------------------------------+--------------------------------------------------------------------------------+
| :code:`on_eval_epoch_start`  | A method that is run at the very beginning of each evaluation epoch,           |
|                              | right before the evaluation epoch is started.                                  |
+------------------------------+--------------------------------------------------------------------------------+
| :code:`on_eval_epoch_end`    | A method that is run at the very end of each evaluation epoch, right after     |
|                              | the epoch ends,                                                                |
|                              | (but only if the epoch finished successfully and without exceptions).          |
+------------------------------+--------------------------------------------------------------------------------+
| :code:`on_train_batch_start` | A method that is run at the very beginning of each training step,              |
|                              | right before the forward/backwards pass and optimizer step is performed.       |
+------------------------------+--------------------------------------------------------------------------------+
| :code:`on_train_batch_end`   | A method that is run at the very end of each training step, right after        |
|                              | the optimizer step is performed.                                               |
+------------------------------+--------------------------------------------------------------------------------+
| :code:`on_train_batch_start` | A method that is run at the very beginning of each evaluation step,            |
|                              | right before the forward pass.                                                 |
+------------------------------+--------------------------------------------------------------------------------+
| :code:`on_train_batch_end`   | A method that is run at the very end of each evaluation step, right after      |
|                              | the forward pass.                                                              |
+------------------------------+--------------------------------------------------------------------------------+

.. note::

    Many of these hooks already have overrides defined in the
    :code:`PyTorchBaseRunner`, :code:`PyTorchRunner`, :code:`PyTorchCSCompiler`, and
    :code:`PyTorchCSRunner`. We highly recommended to call the super class’s
    implementation of the hook to prevent any unexpected failures.

.. warning::

    We do not recommended overriding any other method from
    :code:`PyTorchBaseRunner` besides these hooks. There is a good chance that not
    overriding them correctly could lead to failures in either the compile or
    execution.


Using Custom Runner
-------------------

To use a custom PyTorch runner, you must modify the
:code:`PyTorchBaseRunner`'s :code:`create` static method in
`https://github.com/Cerebras/modelzoo/blob/master/modelzoo/common/pytorch/pytorch_base_runner.py` to make use of your custom
runner instead of the predefined ones.

.. note::

    It is important that only the runner initialization is modified. There are
    other components in the :code:`create` function that are integral to
    configuring and ensuring the run on the Cerebras system works.