Resource requirements for parallel training and compilation#

The Cerebras Wafer-Scale Cluster supports parallel compiles and trains, for maximum utilisation of the CS2’s in the cluster, depending on the configuration of the cluster. It supports explicit resource management with strict limits on Memory and CPU requests. You have the option of overriding the limits to run a model that needs more resources.


Swap has been removed from all the Cerebras Wafer-Scale Cluster CPU nodes.

CPU Requirements#

The number of cores used is set using K8s request. Additional memory requirements are managed by the cluster management if required.

Memory Requirements#

Management Node#

R1.9 supports a pool of management nodes, which allows for greater number of parallel compiles and trains on a cluster. The default train and compile coordinator memory limits for management nodes can optionally be overridden at cluster deployment time based on the memory of the management node. The default limits are tailored for a management node with 128Gi whereas overrides can provide additional buffer room in the case of management nodes with more memory. Some typical examples are as follows:

  • 125Gi single management node: 67Gi for Compile and 32Gi for Train. This allows for 1 compile and 1 train in parallel.

  • 503Gi single management node: 90Gi for Compile and 90Gi for Train. This allows for 2 compiles and 2 trains in parallel.

Some outlier models may need more memory, in which case, the you will see an OOM message corresponding to the compile or train and are propagated to the client side logs. These errors are also visible in the software errors section of the wsjob dashboard.

You can override the default limits in the runconfig section of the yaml configuration file using the compile_crd_memory_gi and execute_crd_memory_gi options, as follows:

  compile_crd_memory_gi: 100
  execute_crd_memory_gi: 120

Maximum override value#

Increasing the memory limits for management nodes may reduce the number of parallel compiles and trains. It should be noted that not all physical memory available on management node is available for compile/train coordinator consumption. The cluster management overhead should be discounted from the total physical memory before determining a possible max override value. Below is an estimated guidance on the cluster management overheads.

  • Cluster with single management node setup: 28 Gi

  • Cluster with multiple management node setup: 45 Gi

It is recommended to discount a small buffer on top of this before determining the max override possible.

MemoryX Nodes#

The servers running on MemoryX nodes set their memory limits dynamically based on estimate memory usage derived from compile artifacts. The cluster management automatically takes care of managing job scheduling when more memory is requested. Hence no overrides are required for these.

Worker Nodes#

Worker servers have a peak memory limit of 60Gi. Worker code is user-defined so memory usage is outside of Cerebras’ control. As a reference, all worker servers have not exceeded this limit when running Cerebras internal tests.

In the scenarion when a worker requires exceptional amounts of memory, the memory limit can be changed manually in the runconfig section of the yaml configuration file.

  wrk_memory_gi: 80

SwarmX Nodes#

The SwarmX servers have a peak memory limit to 60Gi. No known outliers exist.

User Nodes#

A 128GB user node can support up to 2 compiles and 2 trains in parallel, as long as the cluster’s capacity supports it. If jobs are launched beyond what the cluster can support, an error message is reported.

Supported number of parallel compilations and trainings in a cluster#

The number of parallel compilations and trainings that can be supported in a cluster depends on:

  • Number of CS-2 systems

  • Number of management nodes

  • Number of fully populated MemoryX racks

  • Memory and CPU capacity of the management nodes

  • Number of fully populated MemoryX node groups in the cluster

  • Largest size model required to be supported

  • Model characteristics and parameters that drive up the memory

The table below showcases the number of parallel compilations and trainings based on the type of management nodes.

Table 1 Feasible Number of Parallel Compiles and Trains#

Number of management nodes






Number of CS-2 systems






Compile, Train memory limit

67Gi, 32Gi

75Gi, 75Gi

75Gi, 75Gi

75Gi, 75Gi

75Gi, 75Gi

Parallel compiles(C) and trains (T)

1C, 1T

1C, 1T

1C, 2T

2C, 2T

8C, 8T


The Cluster Management overhead per management node is ~45Gi for multiple management nodes and ~28 Gi for single management node, which is taken into account in the above table.

Monitoring Cluster Resource Usage#

The max CPU/memory used by a specific run can be seen using grafana wsjob dashboard.

Known Outliers#

Compile memory outliers#

T5 model variants with Batch Size larger or equal to 700 are not supported.

Train memory outliers#

In general, the train memory requirement on the appliance increases with maximum tensor size. A maximum tensor size larger than 3.5Gi can drive the peak memory requirement on a management node above the 32Gi limit. Overriding the limit to 75Gi, the maximum tensor size that can be supported is approximately 6.8Gi. For example, a model with hidden size of ~6k and vocabulary size of ~150k results in a tensor sizes of ~3.7Gi. If the limit was set to 32Gi, then the memory limit should be overriden to 75Gi for execution.

The following is a list of known outliers assuming 75Gi limit and the peak memory usage

  • GPT-2 1.3B parameters model with vocabulary size of 1 Million and hidden size of 2280 requires 117Gi

Impact of checkpoint frequency on memory#

The frequency of checkpoints has an impact on memory usage. The lowest recommended frequency for models with 20B parameters and above is 200 steps.


To learn more about how to troubleshoot issues with system resources, visit the Out of memory errors and system resources section.