Control numerical precision level#

Overview#

The Cerebras cluster currently has limited support for data types used in model training. Specifically, the cluster supports training neural network models with either float16, cbfloat16, or bfloat16 input data types, while arithmetic operations utilize float32 precision. Utilizing mixed precision data types (i.e. float16/cbfloat16/bfloat16 inputs with float32 operations) provides increased computational efficiency to the training process, especially when combined with dynamic loss scaling techniques. With bfloat16 as the selected input data type, the use of dynamic loss scaling during training is unnecessary. In summary, to maximize performance of models trained on the Cerebras cluster, users should enable either float16, cbfloat16, or bfloat16 inputs alongside float32 arithmetic operations.

Users have configurable control over the numerical precision settings utilized during the training of large natural language processing (NLP) models on this system. Specifically, parameters exist to specify the precision of the floating point representation used for model weights, activations, and gradients during the optimization process. These settings which dictate the aforementioned numerical precisions can be collectively referred to as the Precision and Optimization Level (POL).

Setting the numerical precision#

You can set the precision_opt_level by following instructions on this page: Precision Optimization Level (POL).

The precision_opt_level flag allows users to configure the numerical precision of operations during model training to achieve different precision and performance tradeoffs.

Setting precision_opt_level to 0 utilizes single-precision (float32) accumulations to ensure higher precision
A precision_opt_level of 1 represents an optimal tradeoff between precision and performance. It employs a combination of Float32 and bfloat16/cbfloat16/float16 reductions in matrix multiplication and attention layers to maximize performance while maintaining adequate precision to ensure model convergence
A precision_opt_level of 2 is a high-performance setting, combining float32 and bfloat16/float16 reductions in attention and matrix multiplication kernels, along with bfloat16/float16 softmax in attention, delivering the best performance

The default configuration for large language models in the Cerebras Model Zoo is to have precision_opt_level: 1 with the use of cbfloat16, which ensures convergence with high performance. For models whose input activations are in 16-bit, the default is to use bfloat16.

Note

A large number of Cerebras models, have been trained using “precision_opt_level: 1”. This setting is recommended to obtain the best training performance while retaining good convergence behavior. In the event of significant numeric issues during training, one can consider the use of the more precise “precision_opt_level: 0” setting.

Additionally, it’s worth noting that “precision_opt_level: 2” is primarily used as an internal tool by Cerebras to evaluate the future performance of their hardware.

Supported precisions on Cerebras systems#

The CS system offers support for the following data formats:

32-bit Floating-point format: This format is IEEE single-precision - commonly known as FP32
16-bit Floating-point format: This format is IEEE half-precision - commonly known as FP16
Cerebras 16-bit floating point: Specific to the Cerebras Wafer-Scale engine - commonly known as cbfloat16
Custom 16-bit floating point: bfloat16 has eight exponent bits and is specifically designed for deep-learning applications

Note that 16-bit arithmetic in the CS system uses 16-bit words and is always aligned to a 16-bit boundary. On the other hand, single-precision arithmetic uses even-aligned register pairs for register operands and requires 32-bit aligned addresses for memory operands.

Note

In the CS system, memory is 16-bit word addressable. It is not byte addressable, so you cannot directly access individual bytes within a word.

FP32 Single-Precision#

FP32, as used in the CS system, is indeed equivalent to IEEE binary32, which is also known as single-precision floating-point format. In this format, there are 8 bits allocated for the exponent and 23 bits for the explicit mantissa.

Sign: 1

Exponent: 8

Mantissa: 23

FP16#

The FP16 implementation in the CS system adheres to the IEEE standard for binary16, commonly known as half-precision floating-point format. In this format, there are 5 bits reserved for the exponent and 10 bits for the explicit mantissa.

Sign: 1

Exponent: 5

Mantissa: 10

cbfloat16#

cbfloat16 is a custom 16-bit floating point format available on the Cerebras Wafer-Scale engine that further optimizes the tradeoff between range and precision for neural network training. It uses 9 mantissa or fraction bits, and 6 exponent bits. It has double the range of the FP16 format, and significantly more precision (additional two bits) than the BF16 format.

Sign: 1

Exponent: 6

Mantissa: 9

Note

There is currently no support for cbfloat16 in Python nor PyTorch. For this reason, we make use of float16 as a “proxy-type” in these higher-level representations, which is then converted to cbfloat16 as part of the compilation process. Using float16 as the proxy-type avoids any loss of precision since it is more precise than cbfloat16, however, we lose some range representation. Extensive testing has shown this to not be an issue.

bfloat16#

The bfloat16 type is a custom 16-bit floating point format for deep learning that’s comprised of a sign bit, 8 exponent bits, and 7 mantissa bits. This is different from the industry-standard IEEE 16-bit floating point, which was not designed with deep learning applications in mind. It has a range similar to the FP32 format, but with a lot less accuracy due to the constraints of the 16-bit format.

Sign: 1

Exponent: 8

Mantissa: 7

Automatic Mixed Precision#

Automatic mixed precision is a mode that enables the training of deep learning models using a combination of single-precision floating-point (float32) and half-precision floating-point formats, such as FP16, cbfloat16, or bfloat16.

The primary advantages of the mixed precision mode are centered on performance. It’s an optimization technique that allows for faster network training without sacrificing quality. This efficiency stems from the fact that some layers within neural networks can be computed with lower precision, such as convolutional or linear layers. These layers have been shown to be significantly faster when executed with FP16, cbfloat16, or bfloat16. However, certain operations, like reductions, often require a higher precision level to maintain the same level of quality in results.

This trade-off between casting certain operations to half-precision and maintaining others in single precision is part of the “automatic mixed precision algorithm.” In essence, this algorithm assesses the network’s performance in its default precision, then strategically introduces castings to execute the same network with mixed precision settings to optimize performance without compromising accuracy.

It’s important to note that mixed precision doesn’t mandate the use of a specific half-precision floating point format. However, there are tradeoffs to be considered in this choice and we discuss this below.

Automatic Mixed Precision and cbfloat16#

The use of the cbfloat16 half-precision format further improves performance over bfloat16 as the additional precision allows it to be used for certain types of reductions.

Performance: cbfloat16 is approximately 19% faster than bfloat16.

Weight Growth: cbfloat16 shows similar weight growth as bfloat16.

Evaluation Scores: cbfloat16 demonstrates similar evaluation scores as bfloat16.

The use of the cbfloat16 half-precision format provides similar benefits as bfloat16 at a higher performance. However, due to the restricted range as compared to bfloat16, it is recommended that loss scaling be enabled to ensure that the format captures the numeric range of interest on the backward pass.

Automatic Mixed Precision and bfloat16#

We experimented with a variety of deep learning networks comparing bfloat16 and FP16 modes and they yielded valuable insights. It’s clear that bfloat16 offers several advantages:

Performance: bfloat16 is approximately 18% faster than FP16.

Weight Growth: bfloat16 is significantly less prone to weight growth.

Evaluation Scores: bfloat16 demonstrates improved evaluation scores.

These findings underscore the benefits of choosing bfloat16 over pure float32 or a mixed version with float16. bfloat16 enhances training efficiency, conserves memory space, and preserves the same level of accuracy. This is primarily because deep learning models are generally more sensitive to changes in the exponent rather than the mantissa of floating-point numbers.

Furthermore, training with the bfloat16 setting proves to be more robust and less susceptible to issues like underflows, overflows, or other numerical instabilities during training, much like training with pure float32 dtype. This enhanced stability is attributed to the fact that the exponent size of bfloat16 floating point matches that of float32, providing a balance between precision and performance.

How to Enable FP16, cbfloat16, and bfloat16#

To enable FP16, cbfloat16, or bfloat16 in the mixed precision mode, follow the instructions on this page: Automatic Mixed Precision

When using fp16_type: bfloat16, loss scaling is not necessary as the format has the same range as single precision and it will have identical behavior for underflows, overflows, or any other numeric instability during training as single precision. However, when using fp16_type: float16 or fp16_type: cbfloat16 the use of loss scaling is recommended to ensure that the values in the backward pass are captured correctly using the range specific to those formats. The Cerebras stack will throw an error if fp16_type: cbfloat16 is used and loss scaling is not enabled.

To experiment with networks using this setting, you can refer to the specific references provided for the gpt2, gpt3, and gptj models in the Model Zoo.

Conclusion#

Controlling the numerical precision level during training on the Cerebras Wafer-Scale cluster is a critical aspect of optimizing performance and efficiency. The support for float16, cbfloat16, and bfloat16 data types, in conjunction with float32 arithmetic operations, enables users to leverage mixed precision training effectively. By configuring the precision_opt_level and selecting the appropriate floating-point representation, users can achieve a balance between computational speed and model accuracy. The introduction of cbfloat16 and bfloat16 offers a nuanced choice between precision and performance, with cbfloat16 providing a unique advantage in terms of speed and precision balance. Through careful configuration and understanding of each precision type’s characteristics, users can tailor their training process to suit their specific needs, leading to faster training times without compromising the quality of the model outputs.