Train using dynamic loss scaling#
Concept#
When you use mixed-precision for training a neural network you can obtain a considerable speedup in the training computation and require lesser memory bandwidth. Mixed-precision training can be done using a combination of half-precision floating point FP16 for computations and FP32 (also called single-precision or full-precision) for storing the information.
However, simply converting data from FP32 to FP16 will result in increased training loss and eventually training divergence. This is because during the backward pass computation of the loss gradients, when the loss gradient values undergo FP32-to-FP16 conversion, most of these result in 0s in FP16.
The solution to this problem is to:
Scale up the loss values before starting the backpropagation computations of loss gradients, and
Unscaling the weight gradients before the weight update begins, in order to maintain the magnitude of updates.
This process is called loss scaling and it helps to preserve small gradient values.
While you can choose the scaling factor manually, it often takes several rounds of experimentation to find the correct loss scale for your network. To simplify this process, the CS system supports dynamic loss scaling (DLS) during training. DLS will automatically determine an appropriate loss scale for your network, making it easy for you to enable mixed precision training with just a few lines of code.
Supported precision#
For the precision formats supported on CS system, see Precision optimization level and data formats. The CS system supports training in mixed precision with either IEEE FP16 or CB16. The CB16 is Cerebras’ 16-bit floating point format.
Note
We recommend you use mixed-precision in your models while training on the CS system.
PyTorch Dynamic Loss Scaling#
Dynamic loss scaling is supported for PyTorch. It is configurable via the cbtorch.amp.GradScaler
module. The following are the supported configuration parameters:
loss_scale
: Must be"dynamic"
for dynamic loss scaling.initial_loss_scale
: Default value:2e15
.steps_per_increase
: Default value:2000
.min_loss_scale
: Default value:2e-14
.max_loss_scale
: Default value:2e15
.max_gradient_norm
: For dynamic loss scaling with global gradient clipping.
These can be passed in via the amp.GradScaler
constructor. For example:
from cerebras.framework.torch import amp
scaler = amp.GradScaler(
loss_scale="dynamic"
# DLS optimizer (loss_scale=='dynamic')
initial_loss_scale=2e15,
steps_per_increase=2000,
min_loss_scale=2e-14,
max_loss_scale=2e15,
max_gradient_norm=...,
)
The GradScaler
is used to wrap the loss Tensor and scales it before the backwards pass occurs
from cerebras.framework.torch import amp
...
scaler = amp.GradScaler(...)
...
for inputs in dataloader:
loss = model(inputs)
scaler(loss).backwards()
TensorFlow Dynamic Loss Scaling#
To enable dynamic loss scaling (DLS) with TensorFlow, use the CS system supported Trainer
optimizer.
Trainer#
The Trainer
optimizer builds the train ops based on the given configuration parameters. This optimizer initializes several parameters that apply to DLS, such as initial loss scaling factor, number of steps before changing the loss scale factor and so on. These settings are optimized for CS system.
Parameters#
params
: Input. Datatype dict
. Configuration parameters for the Trainer optimizer.
tf_summary
: Input. Datatype bool
. The flag for summaries. Defaults to False
.
mixed_precision
: Input. Datatype bool
. The flag for mixed precision. Defaults to False
.
Example#
The following is an example showing how to use the Trainer
optimizer in your code:
First, create an instance of the Trainer optimizer in the __init__(self)
section in your code.
# Model trainer
self.trainer = Trainer(
params=params["optimizer"],
tf_summary=tf_summary,
mixed_precision=params["training"]["mixed_precision"],
)
Then build the train ops.
def build_train_ops(self, total_loss):
"""
Setup optimizer and build train ops.
"""
return self.trainer.build_train_ops(total_loss)
For more details on the CSDynamicLossScale
and the Trainer
optimizer, refer to the code in the Cerebras Model Zoo repository.
Note
To access the Python code for CSDynamicLossScale
and the Trainer
optimizer, you will need read permission for Cerebras Model Zoo Git repository.
The
CSDynamicLossScale
object in Cerebras Graph Compiler (CGC) implements the dynamic loss scaling. See LossScale.py.This
CSDynamicLossScale
object is used by theTrainer
optimizer. See Trainer.py.