Train using dynamic loss scaling#

Concept#

When you use mixed-precision for training a neural network you can obtain a considerable speedup in the training computation and require lesser memory bandwidth. Mixed-precision training can be done using a combination of half-precision floating point FP16 for computations and FP32 (also called single-precision or full-precision) for storing the information.

However, simply converting data from FP32 to FP16 will result in increased training loss and eventually training divergence. This is because during the backward pass computation of the loss gradients, when the loss gradient values undergo FP32-to-FP16 conversion, most of these result in 0s in FP16.

The solution to this problem is to:

Scale up the loss values before starting the backpropagation computations of loss gradients, and
Unscaling the weight gradients before the weight update begins, in order to maintain the magnitude of updates.

This process is called loss scaling and it helps to preserve small gradient values.

While you can choose the scaling factor manually, it often takes several rounds of experimentation to find the correct loss scale for your network. To simplify this process, the CS system supports dynamic loss scaling (DLS) during training. DLS will automatically determine an appropriate loss scale for your network, making it easy for you to enable mixed precision training with just a few lines of code.

Supported precision#

For the precision formats supported on CS system, see Precision optimization level and data formats. The CS system supports training in mixed precision with either IEEE FP16 or CB16. The CB16 is Cerebras’ 16-bit floating point format.

Note

We recommend you use mixed-precision in your models while training on the CS system.

PyTorch Dynamic Loss Scaling#

Dynamic loss scaling is supported for PyTorch. It is configurable via the cbtorch.amp.GradScaler module. The following are the supported configuration parameters:

loss_scale: Must be "dynamic" for dynamic loss scaling.
initial_loss_scale: Default value: 2e15.
steps_per_increase: Default value: 2000.
min_loss_scale: Default value: 2e-14.
max_loss_scale: Default value: 2e15.
max_gradient_norm: For dynamic loss scaling with global gradient clipping.

These can be passed in via the amp.GradScaler constructor. For example:

from cerebras.framework.torch import amp

scaler = amp.GradScaler(
    loss_scale="dynamic"
    # DLS optimizer (loss_scale=='dynamic')
    initial_loss_scale=2e15,
    steps_per_increase=2000,
    min_loss_scale=2e-14,
    max_loss_scale=2e15,
    max_gradient_norm=...,
)

The GradScaler is used to wrap the loss Tensor and scales it before the backwards pass occurs

from cerebras.framework.torch import amp

...
scaler = amp.GradScaler(...)
...

for inputs in dataloader:
    loss = model(inputs)
    scaler(loss).backwards()

TensorFlow Dynamic Loss Scaling#

To enable dynamic loss scaling (DLS) with TensorFlow, use the CS system supported Trainer optimizer.

Trainer#

The Trainer optimizer builds the train ops based on the given configuration parameters. This optimizer initializes several parameters that apply to DLS, such as initial loss scaling factor, number of steps before changing the loss scale factor and so on. These settings are optimized for CS system.

Parameters#

params: Input. Datatype dict. Configuration parameters for the Trainer optimizer.

tf_summary: Input. Datatype bool. The flag for summaries. Defaults to False.

mixed_precision: Input. Datatype bool. The flag for mixed precision. Defaults to False.

Example#

The following is an example showing how to use the Trainer optimizer in your code:

First, create an instance of the Trainer optimizer in the __init__(self) section in your code.

# Model trainer
     self.trainer = Trainer(
         params=params["optimizer"],
         tf_summary=tf_summary,
         mixed_precision=params["training"]["mixed_precision"],
     )

Then build the train ops.

def build_train_ops(self, total_loss):
       """
       Setup optimizer and build train ops.
       """
       return self.trainer.build_train_ops(total_loss)

For more details on the CSDynamicLossScale and the Trainer optimizer, refer to the code in the Cerebras Model Zoo repository.

Note

To access the Python code for CSDynamicLossScale and the Trainer optimizer, you will need read permission for Cerebras Model Zoo Git repository.

The CSDynamicLossScale object in Cerebras Graph Compiler (CGC) implements the dynamic loss scaling. See LossScale.py.
This CSDynamicLossScale object is used by the Trainer optimizer. See Trainer.py.

Precision optimization level and data formats

Training with number of tokens loss scaling