Performance Optimization Practices#
This document describes optimization practices to obtain best performance from Cerebras system.
Cerebras ML workflow#
When you are targeting the Cerebras system for your neural network jobs, the ML workflow you will follow consists of the following steps, in the order specified:
Port your code to CS.
Prepare input data.
Compile on CPU.
Run on the CS system.
Precision optimization level#
We introduce a new setting to control the level of numerical precision used for training runs for large NLP models in weight-streaming. This setting is controlled via the
precision_opt_level flag in the
.yaml configurations for the models that we ship in this release.
The flag reflects the tradeoff between model convergence and performance for long training runs depending on the numerical setting used. For this release,
precision_opt_level: 0 refers to convergent numerics and makes use of single-precision (FP32) reductions and accumulations to ensure model convergence; whereas
precision_opt_level: 1 is our high-performance setting and utilizes half-precision (FP16) for reductions and accumulations.
By default, our Tensorflow WS models have
precision_opt_level: 0 specified in the
.yaml files and our PyTorch models have
Prepare input data#
Because the CS system can train a neural network at a very high speed, your input pipeline must be very fast to feed the input data to the CS system. A low input data throughput will become a bottleneck, preventing high CS system utilization. The following best practice will help optimize the input pipeline throughput.
Determine optimal input workers#
To achieve a high input data throughput often requires that your input pipeline is running on multiple worker processes across many CPU nodes at once.
Use the The cs_input_analyzer Script. This script generates a report containing the recommended values for the following Slurm environment variables. These variables determine the number of CPU nodes to use in the cluster, the number of tasks to be used per node and the number of CPUs in a node assigned for a task.
nodes: $DEF_NODES tasks_per_node: $DEF_TASKS cpus_per_task: $DEF_CPUS
For optimal input data throughput, use the above recommended values in your The csrun_wse Script to train your model on the CS system.
Enhance your input function code#
In addition to sharding your input data, as described above, the input function code can also be optimized for better performance. The compiler analyzes your input function. It provides a detailed log identifying any missing functions and provides recommendations on parameter values to enhance the training performance on the CS system.
Make sure you modify your input function code following the recommendations of the Input Function Report.
Always compile on CPU first#
Before you run anything on the CS system, we recommend that you first compile your model on a Original Cerebras Support-Cluster CPU node. Compiling your model first on a Original Cerebras Support-Cluster CPU node has several benefits:
You can run in
validate_only mode that runs a fast, light-weight verification. See Validate only.
After a successful validate_only run, you can run full compilation with
compile_only mode. See Compile only.
Using compiled artifacts#
Before running on a CS system, you can run
compile_only on a Original Cerebras Support-Cluster CPU and save the compiled artifacts. This enables you to use these compiled artifacts later when you run this network on the CS system and save time in your workflow. You can accomplish this by running the
compile_only mode with
--cs_ip flag and passing in the IP address of the CS system. This will ensure that
csrun_cpu optimizes the compile for your specific CS system.
Use dynamic loss scaling#
When you use mixed-precision for training a neural network you can obtain a considerable speedup in the training computation and require lesser memory bandwidth. However, simply converting data from FP32 to FP16 will result in increased training loss and eventually training divergence. This is because during the backward pass computation of the loss gradients, when the loss gradient values undergo FP32-to-FP16 conversion, most of these result in 0s in FP16. To avoid such increased training loss, use dynamic loss scaling.
The CS system supports dynamic loss scaling (DLS) during training. DLS will automatically determine an appropriate loss scale for your network, making it easy for you to enable mixed precision training with just a few lines of code.
Wherever it is supported, use the
cbfloat16 data format. This is the CB16 Half-Precision, a Cerebras-specific 16-bit format. With 1 bit more for the exponent compared to FP16, the CB16 provides a bigger range with the following benefits:
Denormals are far less frequent.
Dynamic loss scaling is not necessary on many networks.
Recommended models for
BERT for better throughput.
Monitor kernel performance metrics#
When you compile your model with
csrun_cpu, for example:
csrun_cpu python run.py --mode=compile_only
the compiler writes out an estimated performance for your neural network into a text file
compile_report.txt in the directory where you compiled. These estimates are generated immediately after the network is compiled. These performance estimates provide you with valuable insights into how your network might perform on the CS system, without actually running it on the CS system.
See more at Compile Report.