Model is too large to fit on the device#

The memory requirements of your model are too large to fit on the device. Please see below for model-specific workarounds:

  • Transformer models: please compile again with the batch size set to 1 using one CS-2 system to determine if the specified maximum sequence length is feasible. If the max sequence length is feasible, you can try a smaller batch size per device or enable gradient accumulation for supported models (all Transformer models except T5) by setting the use_cs_grad_accum parameter to True in the runconfig section of your model’s yaml file.

  • Vision models (CNNs): try manually decreasing the batch size and/or the image/volume size.


For more information on using gradient accumulation while training in the Cerebras cluster, visit Train with gradient accumulation


The batch size set on the yaml configuration is the global batch size. This means that the batch size per CS-2 system is computed as the global batch size divided by the number of CS-2s used.

Observed Error#

Model is too large to fit on the device. This can happen because of a large batch size, large input tensor dimensions, or other network parameters. Please refer to the Troubleshooting section in the documentation for potential workarounds.