Limitations of PyTorch on Cerebras#
Floating Point Precision#
Only mixed-precision is supported for training a model on the Cerebras Wafer-Scale cluster system. Weights are stored as
float32 but the computation other than the weight update happens in a combination of
float16. Casts are automatically inserted; users do not need to insert them manually. See Control numerical precision level for more information on switching between precision optimization levels.
At the moment, our primary focus is on the precision modes we currently offer. While we’re not actively pursuing the support of other precision modes, we remain open to potential developments in this area in the future.
As of the 2.0.0 software release, we do not officially support reprogramming the Cerebras Wafer-Scale cluster after initial programming. This means that multiple compiles are not supported and therefore, the PyTorch compute graph must not change between iterations.
Learning Rate Scheduler#
Currently, we do not support the typical PyTorch learning rate scheduler paradigm. A typical PyTorch learning scheduler would compute a learning rate scalar and set the values of the learning rates in the optimizer parameter groups. However, we cannot support this behavior due to the system’s current limitations requiring static graphs.
Scheduling Learning Rate#
We must specify the entire learning rate schedule as a function of the global step. This means that the learning rate becomes less of a scalar value and more of a tensor that depends on the value of the global step. See Learning Rate Scheduling for more details.
This also means that any optimizers being used need to be written in a way
such that the learning rate is not treated as a scalar value but rather as a
cerebras_pytorch.optim.AdamBase for an example of