PyTorch Dynamic Loss Scaling
PyTorch Dynamic Loss Scaling¶
Attention
This document presents dynamic loss scaling for PyTorch. For TensorFlow, see TensorFlow Dynamic Loss Scaling.
See also
Dynamic Loss Scaling on Cerebras system.
Dynamic loss scaling is supported for PyTorch. It is configurable via the cbtorch.amp.GradScaler
module. The following are the supported configuration parameters:
loss_scale
: Must be"dynamic"
for dynamic loss scaling.initial_loss_scale
: Default value:2e15
.steps_per_increase
: Default value:2000
.min_loss_scale
: Default value:2e-14
.max_loss_scale
: Default value:2e15
.max_gradient_norm
: For dynamic loss scaling with global gradient clipping. See PyTorch Gradient Clipping.
These can be passed in via the amp.GradScaler
constructor. For example:
from cerebras.framework.torch import amp
scaler = amp.GradScaler(
loss_scale="dynamic"
# DLS optimizer (loss_scale=='dynamic')
initial_loss_scale=2e15,
steps_per_increase=2000,
min_loss_scale=2e-14,
max_loss_scale=2e15,
max_gradient_norm=...,
)
The GradScaler
is used to wrap the loss Tensor and scales it before the backwards pass occurs
from cerebras.framework.torch import amp
...
scaler = amp.GradScaler(...)
...
for inputs in dataloader:
loss = model(inputs)
scaler(loss).backwards()