Dynamic Loss Scaling#
When you use mixed-precision for training a neural network you can obtain a considerable speedup in the training computation and require lesser memory bandwidth. Mixed-precision training can be done using a combination of half-precision floating point FP16 for computations and FP32 (also called single-precision or full-precision) for storing the information.
However, simply converting data from FP32 to FP16 will result in increased training loss and eventually training divergence. This is because during the backward pass computation of the loss gradients, when the loss gradient values undergo FP32-to-FP16 conversion, most of these result in 0s in FP16.
The solution to this problem is to:
Scale up the loss values before starting the backpropagation computations of loss gradients, and
Unscaling the weight gradients before the weight update begins, in order to maintain the magnitude of updates.
This process is called loss scaling and it helps to preserve small gradient values.
While you can choose the scaling factor manually, it often takes several rounds of experimentation to find the correct loss scale for your network. To simplify this process, the CS system supports dynamic loss scaling (DLS) during training. DLS will automatically determine an appropriate loss scale for your network, making it easy for you to enable mixed precision training with just a few lines of code.
For the precision formats supported on CS system, see Mixed-Precision Training. The CS system supports training in mixed precision with either IEEE FP16 or CB16. The CB16 is Cerebras’ 16-bit floating point format.
We recommend you use mixed-precision in your models while training on the CS system.
PyTorch Dynamic Loss Scaling#
Dynamic loss scaling is supported for PyTorch. It is configurable via the
cbtorch.amp.GradScaler module. The following are the supported configuration parameters:
loss_scale: Must be
"dynamic"for dynamic loss scaling.
initial_loss_scale: Default value:
steps_per_increase: Default value:
min_loss_scale: Default value:
max_loss_scale: Default value:
max_gradient_norm: For dynamic loss scaling with global gradient clipping.
These can be passed in via the
amp.GradScaler constructor. For example:
from cerebras.framework.torch import amp scaler = amp.GradScaler( loss_scale="dynamic" # DLS optimizer (loss_scale=='dynamic') initial_loss_scale=2e15, steps_per_increase=2000, min_loss_scale=2e-14, max_loss_scale=2e15, max_gradient_norm=..., )
GradScaler is used to wrap the loss Tensor and scales it before the backwards pass occurs
from cerebras.framework.torch import amp ... scaler = amp.GradScaler(...) ... for inputs in dataloader: loss = model(inputs) scaler(loss).backwards()
TensorFlow Dynamic Loss Scaling#
To enable dynamic loss scaling (DLS) with TensorFlow, use the CS system supported
Trainer optimizer builds the train ops based on the given configuration parameters. This optimizer initializes several parameters that apply to DLS, such as initial loss scaling factor, number of steps before changing the loss scale factor and so on. These settings are optimized for CS system.
params: Input. Datatype
dict. Configuration parameters for the Trainer optimizer.
tf_summary: Input. Datatype
bool. The flag for summaries. Defaults to
mixed_precision: Input. Datatype
bool. The flag for mixed precision. Defaults to
The following is an example showing how to use the
Trainer optimizer in your code:
First, create an instance of the Trainer optimizer in the
__init__(self) section in your code.
# Model trainer self.trainer = Trainer( params=params["optimizer"], tf_summary=tf_summary, mixed_precision=params["training"]["mixed_precision"], )
Then build the train ops.
def build_train_ops(self, total_loss): """ Setup optimizer and build train ops. """ return self.trainer.build_train_ops(total_loss)
For more details on the
CSDynamicLossScale and the
Trainer optimizer, refer to the code in the Cerebras Model Zoo repository.
To access the Python code for
CSDynamicLossScale and the
Trainer optimizer, you will need read permission for Cerebras Model Zoo Git repository.
CSDynamicLossScaleobject in Cerebras Graph Compiler (CGC) implements the dynamic loss scaling. See LossScale.py.
CSDynamicLossScaleobject is used by the
Traineroptimizer. See Trainer.py.