Limitations of PyTorch on Cerebras#

Floating Point Precision#

Only mixed-precision is supported for training a model on the Cerebras Wafer-Scale cluster system. Weights are stored as float32 but the computation other than the weight update happens in a combination of float32 and bfloat16 or float16. Casts are automatically inserted; users do not need to insert them manually. See Control numerical precision level for more information on switching between precision optimization levels.

Note

At the moment, our primary focus is on the precision modes we currently offer. While we’re not actively pursuing the support of other precision modes, we remain open to potential developments in this area in the future.

Static Graphs#

As of the 2.0.0 software release, we do not officially support reprogramming the Cerebras Wafer-Scale cluster after initial programming. This means that multiple compiles are not supported and therefore, the PyTorch compute graph must not change between iterations.

Learning Rate Scheduler#

Currently, we do not support the typical PyTorch learning rate scheduler paradigm. A typical PyTorch learning scheduler would compute a learning rate scalar and set the values of the learning rates in the optimizer parameter groups. However, we cannot support this behavior due to the system’s current limitations requiring static graphs.

Scheduling Learning Rate#

We must specify the entire learning rate schedule as a function of the global step. This means that the learning rate becomes less of a scalar value and more of a tensor that depends on the value of the global step. See Learning Rate Scheduling for more details.

This also means that any optimizers being used need to be written in a way such that the learning rate is not treated as a scalar value but rather as a tensor. See cerebras_pytorch.optim.AdamBase for an example of this.