Resource requirements for parallel training and compilation#

Overview#

The Cerebras Wafer-Scale cluster offers several key capabilities to maximize resource utilization and manage resources efficiently:

Parallel Compilation and Training: The cluster supports parallel compilation and training of machine learning models. This means multiple models or training jobs can run concurrently, making the most of the CS-X resources available in the cluster. Parallel execution enhances throughput and minimizes resource idle time.
Resource Management with Strict Limits: The cluster implements resource management with strict limits on both memory and CPU requests. This ensures that resource allocation is controlled and optimized for efficiency. Jobs are allocated resources according to predefined limits to prevent resource contention and maintain system stability.
Explicit Resource Management: Users have the option of explicitly managing resource requests and limits for their training jobs. This means you can specify the exact amount of memory and CPU resources required for your job. This fine-grained control allows you to tailor resource allocation to the specific needs of your machine learning models.
Resource Override: In situations where a model requires more resources than initially specified, you have the flexibility to override the resource limits. This enables you to run models that demand additional resources, ensuring that your high-complexity or resource-intensive models can be accommodated without constraints.

Note

Swap has been removed from all the Cerebras Wafer-Scale Cluster CPU nodes.

CPU Requirements#

In the Cerebras Wafer-Scale cluster, the number of cores used by a job is typically set using Kubernetes (K8s) resource requests. When you submit a job or workload to the Cerebras Wafer-Scale cluster, you can specify the number of CPU cores your job requires by setting K8s resource requests. This request informs the cluster about the minimum number of CPU cores your job needs for execution. While you specify the CPU core requirements explicitly, the cluster management system dynamically manages additional memory requirements as necessary. This means that if your job requires more memory beyond what you’ve explicitly requested, the cluster management system can allocate it as needed to ensure your job runs smoothly.

Memory Requirements#

Management Node#

From Release 1.9.1, Cerebras supports a pool of management nodes, which allows for greater number of parallel compiles and trains on a cluster. The default train and compile coordinator memory limits for management nodes can optionally be overridden at cluster deployment time based on the memory of the management node. The default limits are tailored for a management node with 128Gi whereas overrides can provide additional buffer room in the case of management nodes with more memory. Some typical examples are as follows:

125Gi single management node: 67Gi for Compile and 32Gi for Train. This allows for 1 compile and 1 train in parallel.
503Gi single management node: 90Gi for Compile and 90Gi for Train. This allows for at least 2 compiles and 2 trains in parallel.

Some outlier models may need more memory, in which case, the you will see an OOM message corresponding to the compile or train and are propagated to the client side logs. These errors are also visible in the software errors section of the wsjob dashboard.

You can override the default limits in the runconfig section of the yaml configuration file using the compile_crd_memory_gi and execute_crd_memory_gi options, as follows:

runconfig:
  compile_crd_memory_gi: 100
  execute_crd_memory_gi: 120

Maximum override value#

Increasing the memory limits for servers running on the management node may reduce the number of parallel compiles and trains. Not all physical memory available on management node is available for compile/train coordinator consumption. The cluster management overhead should be discounted from the total physical memory before determining a possible max override value. Below is an estimated guidance on the cluster management overheads.

Cluster with single management node setup: 28 Gi
Cluster with multiple management node setup: 45 Gi

It is recommended to discount a small buffer on top of this before determining the max override possible.

MemoryX Nodes#

The servers running on MemoryX nodes set their memory limits dynamically based on estimate memory usage derived from compile artifacts. The cluster management automatically rebalances the scheduling when more memory is requested. Advanced users running large models may wish to tweak the automatically determined memory limits. You can override the default limits in the runconfig section of the yaml configuration file using the following configuration.

runconfig:
  # Users can set this value to -1 to disable memory limits entirely
  act_memory_gi: 128
  cmd_memory_gi: 1
  wgt_memory_gi: 64

Worker Nodes#

Worker servers have a peak memory limit of 60Gi. Worker code is user-defined so memory usage is outside of Cerebras’ control. As a reference, all worker servers have not exceeded this limit when running Cerebras internal tests.

In the scenarion when a worker requires exceptional amounts of memory, the memory limit can be changed manually in the runconfig section of the yaml configuration file.

runconfig:
  # Users can set this value to -1 to disable memory limits entirely
  wrk_memory_gi: 80

SwarmX Nodes#

The SwarmX servers have a peak memory limit to 60Gi. No known outliers exist.

User Nodes#

A 128GB user node can support up to 2 compiles and 2 trains in parallel, as long as the cluster’s capacity supports it. If jobs are launched beyond what the cluster can support, an error message is reported.

Parallel compilations and trainings in clusters#

The number of parallel compilations and trainings that can be supported in a cluster depends on:

Number of CS-X systems
Number of management nodes
Number of fully populated MemoryX racks
Memory and CPU capacity of the management nodes
Number of fully populated MemoryX node groups in the cluster
Largest size model required to be supported
Model characteristics and parameters that drive up the memory

The table below showcases the number of parallel compilations and trainings based on the type of management nodes.

Table 5 Feasible Number of Parallel Compiles and Trains#
Number of management nodes	1x128GB	2x128GB	3x128GB	1x512GB	4x512GB
Number of CS-X systems	1	1	2	2	16
Compile, Train memory limit	67Gi, 32Gi	75Gi, 75Gi	75Gi, 75Gi	75Gi, 75Gi	75Gi, 75Gi
Parallel compiles(C) and trains (T)	1C, 1T	1C, 1T	1C, 2T	2C, 2T	8C, 8T

Note

The Cluster Management overhead per management node is ~45Gi for multiple management nodes and ~28 Gi for single management node, which is taken into account in the above table.

The max CPU/memory used by a specific run can be seen using grafana wsjob dashboard.

Known Outliers#

Compile memory outliers#

T5 model variants with Batch Size larger or equal to 700 are not supported.

Train memory outliers#

In general, the train memory requirement on the appliance increases with maximum tensor size. A maximum tensor size larger than 3.5Gi can drive the peak memory requirement on a management node above the 32Gi limit. Overriding the limit to 75Gi, the maximum tensor size that can be supported is approximately 6.8Gi. For example, a model with hidden size of ~6k and vocabulary size of ~150k results in a tensor sizes of ~3.7Gi. If the limit was set to 32Gi, then the memory limit should be overriden to 75Gi for execution.

The following is a list of known outliers assuming 75Gi limit and the peak memory usage

GPT-2 1.3B parameters model with vocabulary size of 1 Million and hidden size of 2280 requires 117Gi

Impact of checkpoint frequency on memory#

The frequency of checkpoints has an impact on memory usage. The lowest recommended frequency for models with 20B parameters and above is 200 steps.

Note

To learn more about how to troubleshoot issues with system resources, visit the Out of memory errors and system resources section.

Integration with Slurm

Data processing and dataloaders