Limitations of PyTorch on Cerebras

Floating Point Precision

Only mixed precision is supported on the Cerebras system. This means that the weights are stored as float32 but the computations happens using float16. Casts are automatically inserted and becomes unnecessary to insert them manually.

This is due to the architecture of the system itself and thus, there are no plans to support other precision modes at this time.

Static Graphs

As of the 1.7.0 software release, it is not permitted to reprogram the fabric after initial programming. This means that multiple compiles are not supported and therefore the PyTorch compute graph must not change between iterations.

This means that there are a number of caveats as to how the training loop is allowed to be constructed, all of which are already addressed in our custom PyTorch runner classses. Refer to our implementations of the various hooks mentioned in PyTorch Runners.

Modes

Only mode = "train" and mode = "eval" are supported in this release. mode = "train_and_eval" is not supported due to the fact that it is not permitted to reprogram the fabric after initial programming. So, the fabric cannot be reprogrammed for eval mode after having been programmed for train in the same run.

Note

Even when Cerebras supports reprogramming the fabric after initial programming in the future, it will still be ideal to avoid recompiling if possible, as recompiling and reprogramming the fabric can take a very long time.

Learning Rate Scheduler

Currently, we do not support the typical PyTorch learning rate scheduler paradigm. A typical PyTorch learning scheduler would compute a learning rate scalar and set the values of the learning rates in the optimizer parameter groups. However, due to current limitations of the system requiring static graphs, we cannot support this behaviour.

The supported PyTorch learning rate schedulers are listed on this page: Supported PyTorch Learning Rate Schedulers.

There are two different ways of scheduling a learning rate, depending on whether you are running in pipeline mode or in weight streaming mode.

Pipeline mode

We provide custom LR scheduler classes which pass in the scheduler information directly to the system’s communication manager so that the scheduler can be programmed into the fabric.

See modelzoo.common.pytorch.optim.lr_scheduler for more details.

Weight Streaming mode

For weight streaming mode, we must specify the entire learning rate schedule as a function of the global step. This means that the learning rate becomes less of a scalar value and more of a tensor that depends on the value of the global step. See modelzoo.common.pytorch.optim.lr_scheduler for examples of this.

This does also mean that any optimizers being used need to be written in a way such that the learning rate is not treated as a scalar value, but rather as a tensor. See modelzoo.common.pytorch.optim.AdamBase for an example of this.

Eval metrics

Eval metrics are only allowed to return a single result value. Therefore, only the final metric value should be returned. No intermediate state or values can be retrieved at this time.