Performance Optimization Practices#

This document describes optimization practices to obtain best performance from Cerebras system.

Cerebras ML workflow#

When you are targeting the Cerebras system for your neural network jobs, the ML workflow you will follow consists of the following steps, in the order specified:

Port your code to CS.
Prepare input data.
Compile on CPU.
Run on the CS system.

Precision optimization level#

We introduce a new setting to control the level of numerical precision used for training runs for large NLP models in weight-streaming. This setting is controlled via the precision_opt_level flag in the .yaml configurations for the models that we ship in this release.

The flag reflects the tradeoff between model convergence and performance for long training runs depending on the numerical setting used. For this release, precision_opt_level: 0 refers to convergent numerics and makes use of single-precision (FP32) reductions and accumulations to ensure model convergence; whereas precision_opt_level: 1 is our high-performance setting and utilizes half-precision (FP16) for reductions and accumulations.

By default, our Tensorflow WS models have precision_opt_level: 0 specified in the .yaml files and our PyTorch models have precision_opt_level: 1.

Prepare input data#

Because the CS system can train a neural network at a very high speed, your input pipeline must be very fast to feed the input data to the CS system. A low input data throughput will become a bottleneck, preventing high CS system utilization. The following best practice will help optimize the input pipeline throughput.

Place input data on a shared network file system#

Note this distinction:

The Cerebras accelerator is responsible only for running and accelerating the actual training and predictions on the neural network.
All the surrounding tasks such as compiling the model, preprocessing the input data, running the input function, streaming the data, and managing the training loop, are executed in the Cerebras server cluster by the Cerebras software running on these CPU nodes. During runtime, the workers in the Cerebras server cluster stream the input data to the Cerebras accelerator.

Hence it is critically important that you place your input data on a shared network file system and ensure that all the CPU nodes have access to it. This way, your input data can be streamed at a very high speed to the Cerebras accelerator.

Determine optimal input workers#

To achieve a high input data throughput often requires that your input pipeline is running on multiple worker processes across many CPU nodes at once.

Use the The cs_input_analyzer Script. This script generates a report containing the recommended values for the following Slurm environment variables. These variables determine the number of CPU nodes to use in the cluster, the number of tasks to be used per node and the number of CPUs in a node assigned for a task.

nodes: $DEF_NODES
tasks_per_node: $DEF_TASKS
cpus_per_task: $DEF_CPUS

For optimal input data throughput, use the above recommended values in your The csrun_wse Script to train your model on the CS system.

Shard your input data#

While it is necessary to use multiple CPU nodes as workers to stream the input data into the CS system, it is also critical that you configure the input pipeline to properly cycle through these multiple worker processes.

Optimal input data throughput performance can be achieved when you shard the input dataset across the N input worker nodes. This is a recommended option when your dataset is very large or the available local memory on your CPU nodes is limited.

In this approach, you split your dataset into multiple, shorter, distinct files and allocate them across the worker nodes so that different workers have different slices of the dataset. Sharding can make managing the dataset easier. Sharding also improves the capacity to shuffle the dataset, as the shuffle buffer memory needed is only to shuffle shards, and not the entire dataset.

Enhance your input function code#

In addition to sharding your input data, as described above, the input function code can also be optimized for better performance. The compiler analyzes your input function. It provides a detailed log identifying any missing functions and provides recommendations on parameter values to enhance the training performance on the CS system.

Note

Make sure you modify your input function code following the recommendations of the Input Function Report.

Always compile on CPU first#

Before you run anything on the CS system, we recommend that you first compile your model on a Original Cerebras Support-Cluster CPU node. Compiling your model first on a Original Cerebras Support-Cluster CPU node has several benefits:

Validate only#

You can run in validate_only mode that runs a fast, light-weight verification. See Validate only.

Compile#

After a successful validate_only run, you can run full compilation with compile_only mode. See Compile only.

Using compiled artifacts#

Before running on a CS system, you can run compile_only on a Original Cerebras Support-Cluster CPU and save the compiled artifacts. This enables you to use these compiled artifacts later when you run this network on the CS system and save time in your workflow. You can accomplish this by running the compile_only mode with --cs_ip flag and passing in the IP address of the CS system. This will ensure that csrun_cpu optimizes the compile for your specific CS system.

Use dynamic loss scaling#

When you use mixed-precision for training a neural network you can obtain a considerable speedup in the training computation and require lesser memory bandwidth. However, simply converting data from FP32 to FP16 will result in increased training loss and eventually training divergence. This is because during the backward pass computation of the loss gradients, when the loss gradient values undergo FP32-to-FP16 conversion, most of these result in 0s in FP16. To avoid such increased training loss, use dynamic loss scaling.

The CS system supports dynamic loss scaling (DLS) during training. DLS will automatically determine an appropriate loss scale for your network, making it easy for you to enable mixed precision training with just a few lines of code.

Use `cbfloat16`#

Wherever it is supported, use the cbfloat16 data format. This is the CB16 Half-Precision, a Cerebras-specific 16-bit format. With 1 bit more for the exponent compared to FP16, the CB16 provides a bigger range with the following benefits:

Denormals are far less frequent.
Dynamic loss scaling is not necessary on many networks.

Recommended models for `cbfloat16`#

BERT for better throughput.

Monitor kernel performance metrics#

When you compile your model with csrun_cpu, for example:

csrun_cpu python run.py --mode=compile_only

the compiler writes out an estimated performance for your neural network into a text file compile_report.txt in the directory where you compiled. These estimates are generated immediately after the network is compiled. These performance estimates provide you with valuable insights into how your network might perform on the CS system, without actually running it on the CS system.

See more at Compile Report.

Dynamic Loss Scaling

Multi-Replica Data Parallel Training

Performance Optimization Practices#

Cerebras ML workflow#

Precision optimization level#

Prepare input data#

Place input data on a shared network file system#

Determine optimal input workers#

Shard your input data#

Enhance your input function code#

Always compile on CPU first#

Validate only#

Compile#

Using compiled artifacts#

Use dynamic loss scaling#

Use cbfloat16#

Recommended models for cbfloat16#

Monitor kernel performance metrics#

Use `cbfloat16`#

Recommended models for `cbfloat16`#