Weight Streaming Quickstart
On This Page
Weight Streaming Quickstart¶
Weight streaming (WS) is one of the Cerebras’ execution modes ideal to train extreme scale models. To learn more about the differences between pipelined mode and WS mode, visit Cerebras Execution Modes.
This page provides details on below two aspects:
Changing execution mode (only system admins)
Training a model in WS mode (all users)
Section 1: Changing execution mode (only system admins)¶
This step can be performed by System administrators only. If you want the execution mode to be changed to Weight Streaming, contact your system admin.
A system administrator who has access can log in the CS-2 system and change the execution mode. To change the execution mode, follow these steps:
Step 1: Log in into the CS system.
Step 2: Check current execmode
using the command below:
cs> config execmode show
The following message appears:
Configured Execution Mode : PIPELINED
Step 3: To change execmode
, transition the system to STANDBY
state using the command below.
cs> system standby
The following message appears:
This puts the system in standby. Do you want to proceed?
Select yes
.
Step 4: Change execmode
to weight streaming using the command below.
cs> config execmode setup
Select Weight Streaming
. The following message appears:
Selected execution mode configuration: ✔ Weight Streaming
Step 5: Activate the system.
cs> system activate
The system reboots and is now activated in Weight Streaming mode.
Section 2: Training a model in WS mode (all users)¶
After the system admin updates the system to WS mode, you can now follow the below steps to run a training job on single CS-2.
In CSoft R1.4.0, TF implementations of GPT-2, GPT-3XL (1.3B params) and GPT-J (6B params) are supported in WS mode on a single CS-2 with existing support cluster via compatibility mode. You can access a reference implementation of GPT-J in TF as an example in the Cerebras Reference Implementations repo.
Step 1: Clone Reference Implementations Repository
To clone the Cerebras Reference Implementations repository, use the following commands:
git clone https://github.com/Cerebras/cerebras_reference_implementations.gitcd cerebras_reference_implementations/gptj
Step 2: Run the model in a CS system
Here, we use the wrapper script csrun_wse command to compile and execute the code on CS-2 system. See The csrun_wse Script for more information.
csrun_wse --total-nodes 14 --tasks-per-node 8 --cpus-per-task 16 --single-task-nodes 2 --cyclic python-ws run.py --model_dir model_dir --cs_ip <CS IP> --params configs/params_continuous_pretraining.yaml --mode train --max_steps <num_train_steps>
The above command trains the GPT-J model for --max_steps
by executing on the CS system at the IP address specified in the --cs_ip
flag.
Note that for Weight Streaming, you will use python-ws
.
On Weight streaming execution at least two single task cpu nodes are required. This is specified using --single-task-nodes 2
.
When the command executes, you will see an output similar to shown below:
srun: job ... queued and waiting for resources srun: job ... has been allocated resources INFO:tensorflow:Checkpoints and summaries will be saved in: model_dir ... INFO:tensorflow:Running the TF Client INFO:tensorflow:Calling model_fn. ... INFO:tensorflow:Done calling model_fn. ... INFO:tensorflow:Completed weight initialization on CPU in: ... seconds ... INFO:tensorflow:Calling model_fn. ... INFO:tensorflow:Loading CPU pre-initialized weights took ... seconds INFO:tensorflow:Saving checkpoint at global step 0 ... ...: I tensorflow/compiler/jit/xla_compilation_cache.cc:241] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process. INFO:tensorflow:global_step = 1, loss = ... INFO:tensorflow:global_step = 2, loss = ... .. INFO:tensorflow:global_step = 10, loss = ... ... ... INFO:tensorflow:Saving checkpoint at global step .. INFO:tensorflow:Training finished with ... samples in ... seconds, ... samples/second. INFO:tensorflow:Loss for final step: ...
Note
The compilation time for large-scale models in WS mode typically takes a very long time and can run over an hour. Reducing compile time is an active effort at Cerebras.
Note
Offline compilation is not supported in WS mode in CSoft R1.4. This includes no support for --validation-only
, --compile-only
flags, and with every execution, the model is recompiled.
Output files and artifacts¶
The output files and artifacts include:
- Model directory (
model_dir
) - Contains all of the results and artifacts of the latest run, including:
Compile directory (
tfcs_<checksum>
)
performance.json
fileCheckpoints
Tensorboard event files
yaml
files
Model directory and its structure¶
The model directory (model_dir
) contains all of the results and artifacts of the latest run. If you go into the model_dir
directory, the following subdirectories are present.
Compile dir - The directory containing the tfcs_<checksum>
¶
The compilation artifacts during and after compilation are stored in <model_dir>/tfcs_<checksum>
directory. Compilation logs and intermediate outputs are helpful to debug compilations issues.
Performance.json
file and its parameters¶
There is a performance directory that should contain the performance.json <model_dir>/performance/performance.json
. This contains information as listed below:
compile_time
- The amount of time that it took to compile the model to generate the Cerebras executable.
est_samples_per_sec
- The estimated performance in terms of samples per second based on the Cerebras compile. Note that this number is theoretical and actual performance may vary.
programming_time
- This is the time taken to prepare the system and load with the model that is compiled.
samples_per_sec
- The actual performance of your run execution.
suspected_input_bottleneck
- This is a beta feature. It indicates whether you are input-starved and need more input workers to feed the Cerebras system.
total_samples
- The total gross samples that were iterated during the execution.
total_time
- The total time it took to complete the total samples.
Checkpoints¶
Checkpoints are stored in <model_dir>/model-ckpt*
.
Tensorboard event files¶
Tensorboard event files are stored in <model_dir>
directory.
yaml
files content after the run¶
The yaml
file is stored in train directory. This yaml file contains information about the specifics of the run, such as model specific configuration (eg. dropout
, activation_fn
), optimizer type and optimizer parameters, input data configuration, such as batch_size
, and shuffle and run configuration, such as max_steps
, checkpoint_steps
, and num_epochs
.