How Cerebras Works

This section presents a big picture view of how CS systems are deployed and used. See ML Workflow on Cerebras for more details on how a Machine Learning (ML) developer runs ML jobs on CS systems.

Network-attached accelerator

Each CS system is deployed with a supporting CPU cluster. Unlike a conventional GPU, which is usually colocated with a host CPU in the same physical system and connected to the CPU via PCI-e interface, a CS system is connected to the cluster of CPU nodes over a network. CS System is a network-attached accelerator. Each CS system contains one Wafer Scale Engine (WSE) processor, and it brings power, cools and delivers data to the WSE. Each CS System installation comprises one or several CS Systems and a supporting CPU cluster.

  • The CS system is responsible only for accelerating computations within the neural network, for running core training and inference computations.

  • All the supporting tasks, such as running the TensorFlow or PyTorch frameworks, compiling the model, preprocessing the input data and streaming the data to the CS System(s), are executed on the CPU nodes by the Cerebras software running on these nodes.

ML users intreact with the CPU nodes, work with their code on the CPU nodes and launch all training and inference jobs from the CPU nodes.

Important

This documentation describes only how to use the already-installed CS system at your site. It does not present installation and administration instructions for your CS system.

As an ML developer you use the Cerebras system like this:

  • Using the login credentials provided by the system administrator, you log in to a CPU node.

  • Port your existing TensorFlow or PyTorch ML code to Cerebras.

  • Compile your code using the Cerebras Graph Compiler.

  • Make your input data ready on a network file system so it can be streamed at a very high speed to the Cerebras accelerator.

  • Run your compiled code on the CS system by launching your job from the CPU node.

  • During runtime, the workers in the CPU cluster stream the input data to the Cerebras accelerator. During the execution, the training artifacts, scuh as model checkpoints, summaries and logs, are streamed from the network-attached accelerator to the CPU node. You can use standard tools on the CPU node, e.g. TensorBoard, to monitor your runs.

Wafer-Scale Cluster and Original Cerebras Installation

Today Cerebras supports two types of installation of CS System(s).

  • Cerebras Wafer-Scale Cluster is designed to support large-scale models (up to and well beyond 1 billion parameters) and large-scale inputs. The cluster can contain single or multiple CS-2 systems with ability to distribute jobs across all or a subset of CS-2 systems in the cluster. A supporting CPU cluster in this installation consists of MemoryX, SwarmX, management and input worker nodes. The Cerebras Wafer-Scale cluster is run as a simple appliance: a user submits a job to the appliance, the appliance manages preprocessing and streaming of the data, IO, device orchestration within the appliance. It provides simple, single-node programming via PyTorch and TensorFlow with easy data-parallel distribution. This installation supports both Pipelined execution for models up to 1 billion parameters and Weight Streaming execution for models up and above 1 billion parameters. You can Learn more about execution modes in Cerebras Execution Modes.

  • Original Cerebras Installation is designed for deployments with a single CS System and supports only models below 1 billion parameters with Pipelined execution mode. It consists of a CS System and Original Cerebras Support Cluster - a CPU cluster with CPU nodes playing roles of a coordinator and input workers.

The Cerebras Software Platform

The Cerebras Software Platform, CSoft, integrates with TensorFlow and PyTorch, so researchers can effortlessly bring their models to CS systems and clusters. It runs on CS systems and supporting CPU cluster. CSoft compiles user code into binaries that can execute on CS system(s) and provides runtime environment. The CSoft natively supports both Weight Wtreaming and Pipelined execution modes, as described in Cerebras Execution Modes, and both Wafer-Scale Cluster and Original Cerebras Installation. CSoft takes care of optimal resource allocation and workload distribution across hundreds of thousands of cores of a single CS system and across multiple CS systems in the Wafer-Scale cluster.

Pipelined and Weight Streaming execution

CSoft supports two execution modes on CS systems. The execution mode refers to how the Cerebras runtime executes your neural network model, what data is loaded to the on-chip memory and when.

Let us use a simplified example of a 3-layer FC-MNIST network to demonstrate our two execution modes.

../_images/cs-exec-mode-3l-fcmnist.png

Two supported execution modes are:

  • Pipelined: In this mode, all the layers of the neural network are loaded into the on-chip memory and stay there for the whole duration of the training job. The training data is streamed through the WSE. With this execution mode we rely on layer-pipelined parallelisation. Layer 1, Layer 2 and Layer 3 are executed in parallel at the same time on different subsets of the WSE cores. This mode is selected for neural networks below 1 billion parameters, which can fit entirely into the on-chip memory.

  • Weight Streaming: In this mode, the mini-batch of the training data is loaded into the WSE memory and model weights are streamed through the WSE one layer at a time, first for the forward pass and then for the backpropagation. This layer-by-layer mode is used to run extremely large models. Model weights are stored in MemoryX.

Wafer-Scale Cluster support both Pipelined and Weight Streaming execution modes, while Original Cerebras Installation supports Pipelined execution mode only.

The execution modes are discussed in more details in Cerebras Execution Modes.