Weight Streaming Quickstart

Weight streaming (WS) is one of the Cerebras’ execution modes ideal to train extreme scale models. To learn more about the differences between pipelined mode and WS mode, visit Cerebras Execution Modes.

This page provides details on below two aspects:

  1. Changing execution mode (only system admins)

  2. Training a model in WS mode (all users)

Section 1: Changing execution mode (only system admins)

This step can be performed by System administrators only. If you want the execution mode to be changed to Weight Streaming, contact your system admin.

A system administrator who has access can log in the CS-2 system and change the execution mode. To change the execution mode, follow these steps:

Step 1: Log in into the CS system.

Step 2: Check current execmode using the command below:

cs> config execmode show

The following message appears:

Configured Execution Mode : PIPELINED

Step 3: To change execmode, transition the system to STANDBY state using the command below.

cs> system standby

The following message appears:

This puts the system in standby. Do you want to proceed?

Select yes.

Step 4: Change execmode to weight streaming using the command below.

cs> config execmode setup

Select Weight Streaming. The following message appears:

Selected execution mode configuration: ✔ Weight Streaming

Step 5: Activate the system.

cs> system activate

The system reboots and is now activated in Weight Streaming mode.

Section 2: Training a model in WS mode (all users)

After the system admin updates the system to WS mode, you can now follow the below steps to run a training job on single CS-2.

In CSoft R1.4.0, TF implementations of GPT-2, GPT-3XL (1.3B params) and GPT-J (6B params) are supported in WS mode on a single CS-2 with existing support cluster via compatibility mode. You can access a reference implementation of GPT-J in TF as an example in the Cerebras Reference Implementations repo.

Step 1: Clone Reference Implementations Repository

To clone the Cerebras Reference Implementations repository, use the following commands:

git clone https://github.com/Cerebras/cerebras_reference_implementations.git
cd cerebras_reference_implementations/gptj

Step 2: Run the model in a CS system

Here, we use the wrapper script csrun_wse command to compile and execute the code on CS-2 system. See The csrun_wse Script for more information.

csrun_wse --total-nodes 14 --tasks-per-node 8 --cpus-per-task 16 --single-task-nodes 2 --cyclic python-ws run.py --model_dir model_dir --cs_ip <CS IP> --params configs/params_continuous_pretraining.yaml --mode train --max_steps <num_train_steps>

The above command trains the GPT-J model for --max_steps by executing on the CS system at the IP address specified in the --cs_ip flag. Note that for Weight Streaming, you will use python-ws. On Weight streaming execution at least two single task cpu nodes are required. This is specified using --single-task-nodes 2.

When the command executes, you will see an output similar to shown below:

srun: job ... queued and waiting for resources
srun: job ... has been allocated resources
INFO:tensorflow:Checkpoints and summaries will be saved in: model_dir
INFO:tensorflow:Running the TF Client
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Completed weight initialization on CPU in: ... seconds
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Loading CPU pre-initialized weights took ... seconds
INFO:tensorflow:Saving checkpoint at global step 0
...: I tensorflow/compiler/jit/xla_compilation_cache.cc:241] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
INFO:tensorflow:global_step = 1, loss = ...
INFO:tensorflow:global_step = 2, loss = ...
INFO:tensorflow:global_step = 10, loss = ...
INFO:tensorflow:Saving checkpoint at global step ..
INFO:tensorflow:Training finished with ... samples in ... seconds, ... samples/second.
INFO:tensorflow:Loss for final step: ...


The compilation time for large-scale models in WS mode typically takes a very long time and can run over an hour. Reducing compile time is an active effort at Cerebras.


Offline compilation is not supported in WS mode in CSoft R1.4. This includes no support for --validation-only, --compile-only flags, and with every execution, the model is recompiled.

Output files and artifacts

The output files and artifacts include:

  • Model directory (model_dir) - Contains all of the results and artifacts of the latest run, including:
    • Compile directory (tfcs_<checksum>)

    • performance.json file

    • Checkpoints

    • Tensorboard event files

    • yaml files

Model directory and its structure

The model directory (model_dir) contains all of the results and artifacts of the latest run. If you go into the model_dir directory, the following subdirectories are present.

Compile dir - The directory containing the tfcs_<checksum>

The compilation artifacts during and after compilation are stored in <model_dir>/tfcs_<checksum> directory. Compilation logs and intermediate outputs are helpful to debug compilations issues.

Performance.json file and its parameters

There is a performance directory that should contain the performance.json <model_dir>/performance/performance.json. This contains information as listed below:

  • compile_time - The amount of time that it took to compile the model to generate the Cerebras executable.

  • est_samples_per_sec - The estimated performance in terms of samples per second based on the Cerebras compile. Note that this number is theoretical and actual performance may vary.

  • programming_time - This is the time taken to prepare the system and load with the model that is compiled.

  • samples_per_sec - The actual performance of your run execution.

  • suspected_input_bottleneck - This is a beta feature. It indicates whether you are input-starved and need more input workers to feed the Cerebras system.

  • total_samples - The total gross samples that were iterated during the execution.

  • total_time - The total time it took to complete the total samples.


Checkpoints are stored in <model_dir>/model-ckpt*.

Tensorboard event files

Tensorboard event files are stored in <model_dir> directory.

yaml files content after the run

The yaml file is stored in train directory. This yaml file contains information about the specifics of the run, such as model specific configuration (eg. dropout, activation_fn), optimizer type and optimizer parameters, input data configuration, such as batch_size, and shuffle and run configuration, such as max_steps, checkpoint_steps, and num_epochs.