Multiple Models#

You can accelerate the training of a model on a Cerebras system using the multi-replica data parallel training feature. Separately, you can also deploy more than one model for inference with the multi-model inference feature. These two features differ in significant ways. This section describes these features and the differences between them.

In multi-replica data parallel training the Cerebras compiler uses several copies (replicas) of the same model to run data parallel training. This is similar to how multiple GPUs are used to accelerate training of a single model. In the background, the compiler ensures that these replicas are initialized with the same weights and during the training the weights across all these replicas are synchronized after every batch. A single trained model is available at the conclusion of multi-replica data parallel training. This multi-replica data parallel feature can be used only for training the model.

In multi-model inference different models that are already trained are loaded onto the CS system to do predictions. To be sure, with this feature you can either deploy same models with different weights or same models with same weights, or entirely different models with different weights. No weight synchronization is done across the models (as the models are already trained and this feature is used only for inference).

See the below table for differences between these two features:




Weight synchronization

Multi-replica data parallel training

Training only.

A single model. Compiler manages the replica training.

Same weights across the model replicas.

Compiler synchronizes weights in the replicas after every batch.

Multi-model inference

Inference only.

Different or same trained models.

Different weights for each trained model or same weights.

No weight synchronization is done.

Multi-model inference#