PyTorch Dynamic Loss Scaling

PyTorch Dynamic Loss ScalingΒΆ

Attention

This document presents dynamic loss scaling for PyTorch. For TensorFlow, see TensorFlow Dynamic Loss Scaling.

See also

Dynamic Loss Scaling on Cerebras system.

Dynamic loss scaling is supported for PyTorch. It is configurable via the cbtorch.amp.GradScaler module. The following are the supported configuration parameters:

  • loss_scale: Must be "dynamic" for dynamic loss scaling.

  • initial_loss_scale: Default value: 2e15.

  • steps_per_increase: Default value: 2000.

  • min_loss_scale: Default value: 2e-14.

  • max_loss_scale: Default value: 2e15.

  • max_gradient_norm: For dynamic loss scaling with global gradient clipping. See PyTorch Gradient Clipping.

These can be passed in via the amp.GradScaler constructor. For example:

from cerebras.framework.torch import amp

scaler = amp.GradScaler(
    loss_scale="dynamic"
    # DLS optimizer (loss_scale=='dynamic')
    initial_loss_scale=2e15,
    steps_per_increase=2000,
    min_loss_scale=2e-14,
    max_loss_scale=2e15,
    max_gradient_norm=...,
)

The GradScaler is used to wrap the loss Tensor and scales it before the backwards pass occurs

from cerebras.framework.torch import amp

...
scaler = amp.GradScaler(...)
...

for inputs in dataloader:
    loss = model(inputs)
    scaler(loss).backwards()