Train with dynamic loss scaling#

Overview#

When you use mixed-precision for training a neural network you can obtain a considerable speedup in the training computation and require lesser memory bandwidth. Mixed-precision training can be done using a combination of half-precision floating point FP16 for computations and FP32 (also called single-precision or full-precision) for storing the information.

However, simply converting data from FP32 to FP16 will result in increased training loss and eventually training divergence. This is because during the backward pass computation of the loss gradients, when the loss gradient values undergo FP32-to-FP16 conversion, most of these result in 0s in FP16.

The solution to this problem is to:

Scale up the loss values before starting the backpropagation computations of loss gradients, and
Unscaling the weight gradients before the weight update begins, in order to maintain the magnitude of updates.

This process is called loss scaling and it helps to preserve small gradient values.

While you can choose the scaling factor manually, it often takes several rounds of experimentation to find the correct loss scale for your network. To simplify this process, the CS system supports dynamic loss scaling (DLS) during training. DLS will automatically determine an appropriate loss scale for your network, making it easy for you to enable mixed precision training with just a few lines of code.

Supported precision#

For the precision formats supported on CS system, see Control numerical precision level. The CS system supports training in mixed precision with either IEEE FP16 or CB16. The CB16 is Cerebras’ 16-bit floating point format.

Note

We recommend you use mixed-precision in your models while training on the Cerebras Wafer-Scale Cluster.

Configuration parameters#

Dynamic loss scaling is supported for PyTorch. It is configurable via the cbtorch.amp.GradScaler module. The following are the supported configuration parameters:

loss_scale: Must be "dynamic" for dynamic loss scaling.
initial_loss_scale: Default value: 2e15.
steps_per_increase: Default value: 2000.
min_loss_scale: Default value: 2e-14.
max_loss_scale: Default value: 2e15.
max_gradient_norm: For dynamic loss scaling with global gradient clipping.

Procedure#

The above parameters can be passed in via the amp.GradScaler constructor. For example:

from cerebras.framework.torch import amp

scaler = amp.GradScaler(
    loss_scale="dynamic"
    # DLS optimizer (loss_scale=='dynamic')
    initial_loss_scale=2e15,
    steps_per_increase=2000,
    min_loss_scale=2e-14,
    max_loss_scale=2e15,
    max_gradient_norm=...,
)

The GradScaler is used to wrap the loss Tensor and scales it before the backwards pass occurs

from cerebras.framework.torch import amp

...
scaler = amp.GradScaler(...)
...

for inputs in dataloader:
    loss = model(inputs)
    scaler(loss).backwards()

Conclusion#

Dynamic loss scaling (DLS) offers a crucial solution for optimizing neural network training with mixed-precision, addressing the challenges of precision loss and training divergence. By automatically adjusting the loss scale, DLS ensures the preservation of small gradient values, enabling faster and more memory-efficient training without sacrificing accuracy. The CS system’s support for DLS through the cbtorch.amp.GradScaler module simplifies its implementation, allowing practitioners to benefit from mixed-precision training with minimal setup, thereby enhancing computational efficiency and training effectiveness.

Summarize scalars and tensors

Running autoregressive inference