The Cerebras Wafer-Scale cluster is meticulously engineered to enable the training of neural networks with remarkably efficient linear scaling across millions of cores, all without the complexities of traditional distributed computing.
Fig. 1 illustrates the key components within the cluster:
Here is a breakdown of the key components:
These are the backbone of the cluster, featuring the powerful Wafer-Scale Engine (WSE). Each CS-2 node, mounted on a rack, is responsible for executing the core training and inference computations within a neural network. The CS-2 nodes not only house the WSE but also handle power management, cooling, and data delivery to the WSE. You can find more detailed information in the WSE-2 datasheet datasheet, explore a virtual tour of the CS-2, and delve into the CS-2 white paper for a comprehensive understanding.
This innovative technology serves as the storage and intelligent streaming solution for a model’s weights, ensuring efficient and timely access for the CS-2 systems.
SwarmX plays a pivotal role in integrating multiple CS-2 nodes into a unified Cerebras cluster. These nodes collaborate seamlessly in training a single model. SwarmX handles the broadcast of weights from MemoryX to the entire cluster and effectively reduces (sums) gradients in the opposite direction, contributing to efficient training processes.
Input Preprocessing Servers
These servers handle the critical task of preprocessing training data, ensuring that it is appropriately prepared before being dispatched to the CS-2 systems. This preprocessing step is vital for training, inference, and evaluation.
Management servers are responsible for orchestrating and scheduling the cluster’s resources, ensuring efficient utilization and coordination among all components in the Cerebras cluster. They play a key role in optimizing the cluster’s performance and resource allocation.
After developing your code, you can initiate the process of submitting it for both training and evaluation from a user node. It’s important to note that the user node operates independently from the cluster and connects to the Cerebras Cluster through the management server, as illustrated in Fig. 1. The management server is responsible for handling all resource scheduling. Therefore, your main task is to specify the number of CS-2 systems you wish to allocate for either training or evaluation. This enables you to efficiently utilize the Cerebras Cluster’s resources without directly managing the intricacies of cluster allocation.
For documentation related to the installation and administration of the Cerebras Wafer-Scale cluster, visit Cerebras deployment documentation.