Dynamic Loss Scaling#

Concept#

When you use mixed-precision for training a neural network you can obtain a considerable speedup in the training computation and require lesser memory bandwidth. Mixed-precision training can be done using a combination of half-precision floating point FP16 for computations and FP32 (also called single-precision or full-precision) for storing the information.

However, simply converting data from FP32 to FP16 will result in increased training loss and eventually training divergence. This is because during the backward pass computation of the loss gradients, when the loss gradient values undergo FP32-to-FP16 conversion, most of these result in 0s in FP16.

The solution to this problem is to:

  • Scale up the loss values before starting the backpropagation computations of loss gradients, and

  • Unscaling the weight gradients before the weight update begins, in order to maintain the magnitude of updates.

This process is called loss scaling and it helps to preserve small gradient values.

While you can choose the scaling factor manually, it often takes several rounds of experimentation to find the correct loss scale for your network. To simplify this process, the CS system supports dynamic loss scaling (DLS) during training. DLS will automatically determine an appropriate loss scale for your network, making it easy for you to enable mixed precision training with just a few lines of code.

Supported precision#

For the precision formats supported on CS system, see Data Formats. The CS system supports training in mixed precision with either IEEE FP16 or CB16. The CB16 is Cerebras’ 16-bit floating point format.

Note

We recommend you use mixed-precision in your models while training on the CS system.

DLS with TensorFlow#

See TensorFlow Dynamic Loss Scaling.

DLS with PyTorch#

See PyTorch Dynamic Loss Scaling.