Launch your job#

Running jobs in the Cerebras Wafer-Scale Cluster is as easy as running jobs on a single device. To start, you should have already followed the steps in Setup Cerebras virtual environment and Clone Cerebras Model Zoo.

1. Activate Cerebras virtual environment#

After you have Setup Cerebras virtual environment, activate this environment on the user node using

source venv_cerebras_pt/bin/activate

source venv_cerebras_tf/bin/activate

We recommend having a different virtual environment for PyTorch and TensorFlow workloads. You will need to activate your virtual environment any time you run jobs on the Cerebras Wafer-Scale Cluster.

2. Navigate to your model implementation#

We will use models in the Cerebras Model Zoo. Once you Clone Cerebras Model Zoo, navigate to the implementation of the model you are interested in.

For this example, we will use the GPT-2 model (PT, TF)

cd modelzoo/transformers/pytorch/gpt2

cd modelzoo/transformers/tf/gpt2

3. Prepare your datasets#

Each of the models at the Cerebras Model Zoo contains scripts to prepare your datasets.

In this example, we will assume that you have access to prepared data. Then you will change the data path as an absolute path in the configuration file inside configs/. The configs/ folder contains yaml files with different model sizes.:

train_input:
  data_dir: "/absolute/path/to/training/dataset"
  ...
eval_input:
  data_dir: "/absolute/path/to/evaluation/dataset/

4. Launch your job#

All models in Cerebras Model Zoo contain the script run.py for both PyTorch and TensorFlow implementations. These scripts are instrumented to launch compilation, training, and evaluation of your models in the Cerebras Cluster.

You will need to specify these flags:

Flag	Mandatory	Description
`CSX`	Yes	Specifies that the target device for execution is a Cerebras Cluster.
`{pipeline,weight_streaming}`	Yes	Whether to use weight_streaming or pipeline execution strategy when running on CS-X.
`--params <...>`	Yes	Path to a YAML file containing model/run configuration options.
`--mode <train,eval,train_and_eval,eval_all>`	Yes	Whether to run train, evaluate, train and evalaute, or eval_all.
`--mount_dirs <...>`	Yes	List of paths to be mounted to the Appliance containers. It should include parent paths for Cerebras Model Zoo and other locations needed by the dataloader, including datasets and code. (Default: Pulled from path defined by env variable `CEREBRAS_WAFER_SCALE_CLUSTER_DEFAULTS`)
`--python_paths <...>`	Yes	List of paths to be exported to `PYTHONPATH` when starting the workers on the Appliance container. It should include parent paths to Cerebras Model Zoo and python packages needed by input workers. (Default: Pulled from path defined by env variable `CEREBRAS_WAFER_SCALE_CLUSTER_DEFAULTS`)
`--credentials_path <...>`	No	Path to a TLS certificate which is used to authenticate the user against the Wafer-Scale Cluster. (Default: `/opt/cerebras/certs/tls.crt`, as defined in `opt/cerebras/config`)
`--mgmt_address <...>`	No	Address of the Wafer-Scaler Cluster management server. (Default: Pulled from `opt/cerebras/config`)
`--compile_only`	No	Compile the model including matching to Cerebras kernels and mapping to hardware. It does not execute on system. Mutually exclusive with validate_only. (Default: `None`)
`--validate_only`	No	Validate model can be matched to Cerebras kernels. This is a lightweight compilation. It does not map to the hardware nor execute on system. Mutually exclusive with compile_only. (Default: `None`)
`--model_dir <...>`	No	Path to store model checkpoints, TensorBoard events files, etc. (Default: `$CWD/model_dir`)
`--compile_dir <...>`	No	Path to store the compile artifacts inside Cerebras cluster (Default: `None`)
`--num_csx <1,2,4,8,16>`	No	Number of CS-X systems to use in weight streaming training. (Default: `1`)

Note

The params credentials_path and mgmt_address are configured by default in /opt/cerebras/config and do not need to be explicitly provided.

For example, /opt/cerebras/config could look as follows:

{
"clusters": [
        {
        "name": "system-name",
        "server": "1.2.3.4:9000",
        "authority": "cluster-server.system-name.example.com",
        "certificateAuthority": "/opt/cerebras/certs/tls.crt"
        }
],
"contexts": [
        {
        "cluster": "cluster-name",
        "name": "system-name"
        }
],
"currentContext": "system-name"
}

A note on python_paths and mount_dirs

It is important to pass the paths that are needed by the dataloaders and external Python packages in addition to the path in which the Cerebras Model Zoo resides.

For example, let’s assume the following directory structure:

/path/to/datasets
        my_dataset/

/path/to/modelzoo
        modelzoo

/path/to/packages
        package_x
        pacakge_y

Let’s further assume input workers need to access my_dataset directory to read the data and it needs python modules from packages modelzoo, package_x, and package_y. In order for the workers to access all these info, we need to specify the command as follows:

python run.py \
    CSX weight_streaming \
    --params params.yaml \
    --mode {train,eval,eval_all,train_and_eval} \
    --mount_dirs /path/to/datasets /path/to/modelzoo /path/to/packages \
    --python_paths /path/to/packages /path/to/modelzoo \

Note

If some paths share a common parent folder, only the parent folder needs to be specified for the given argument. For example, if modelzoo path is /cb/home/user/modelzoo, and data path is /cb/home/user/data, you only need to specify --mount_dirs /cb/home.

To simplify the command line arguments, you can define defaults for both mount_dirs and python_paths in a YAML file and export that file to an environment variable as follows:

export CEREBRAS_WAFER_SCALE_CLUSTER_DEFAULTS=/path/to/defaults/file.yaml

When this env variable is set in a Cerebras ModelZoo run, there is no need to pass --mount_dirs and --python_paths to the commandline unless you wish to add paths in addition to the defaults.

4.1 (Optional) Compile your job#

To validate that your model implementation is compatible with Cerebras Software Platform, you can use a --validate_only flag. This flag allows you to quickly iterate and check compatibility without requiring full model execution.

The following works for both PyTorch and TensorFlow implementations.

python run.py \
    CSX \
    {pipeline,weight_streaming} \
    --params params.yaml \
    --num_csx=1 \
    --mode {train,eval,eval_all,train_and_eval} \
    --mount_dirs {paths to modelzoo and to data} \
    --python_paths {paths to modelzoo and other python code if used} \
    --validate_only

You can also use a compile_only compilation, to create executables to run your model in the Cerebras Cluster. This compilation takes longer than validate_only, depending on the size and complexity of the model (15 minutes to an hour).

python run.py \
    CSX \
    {pipeline,weight_streaming} \
    --params params.yaml \
    --num_csx=1 \
    --model_dir model_dir --mode {train,eval,eval_all,train_and_eval} \
    --mount_dirs {paths to modelzoo and to data} \
    --python_paths {paths to modelzoo and other python code if used} \
    --compile_only

4.2 Execute your job#

The following works for both PyTorch and TensorFlow implementations.

To execute your job, you need to provide the following information:

The target device that you would like to execute on. To run on the Cerebras Cluster, this is done by adding CSX as the first positional argument in the commandline. These scripts can also be run locally using CPU or GPU.
If the target device is CSX, the next positional argument is the execution strategy, which is either pipeline or weight_streaming. This must be explicitly stated.
Information about the Cerebras Cluster where the job will be executed using the flags --python_paths, --mount_dirs, and optionally --credentials_path and --mgmt_address. Please note that python_paths and mount_dirs can be omitted from the command line as long as they are specified in the runconfig section of params.yaml. They should both generally include paths to the directory in which the Cerebras Modelzoo resides.
Finally, the mode of execution {train, eval, eval_all, train_and_eval} and a path to the configuration file must be passed.

python run.py \
    CSX \
    {pipeline,weight_streaming} \
    --params params.yaml \
    --num_csx=1 \
    --model_dir model_dir \
    --mode {train,eval,eval_all,train_and_eval} \
    --mount_dirs {paths modelzoo and to data} \
    --python_paths {paths to modelzoo and other python code if used}

Here is an example of typical output log for a training job:

Transferring weights to server: 100%|██| 1165/1165 [01:00<00:00, 19.33tensors/s]
INFO:   Finished sending initial weights
INFO:   | Train Device=xla:0 Step=50 Loss=8.31250 Rate=69.37 GlobalRate=69.37
INFO:   | Train Device=xla:0 Step=100 Loss=7.25000 Rate=68.41 GlobalRate=68.56
INFO:   | Train Device=xla:0 Step=150 Loss=6.53125 Rate=68.31 GlobalRate=68.46
INFO:   | Train Device=xla:0 Step=200 Loss=6.53125 Rate=68.54 GlobalRate=68.51
INFO:   | Train Device=xla:0 Step=250 Loss=6.12500 Rate=68.84 GlobalRate=68.62
INFO:   | Train Device=xla:0 Step=300 Loss=5.53125 Rate=68.74 GlobalRate=68.63
INFO:   | Train Device=xla:0 Step=350 Loss=4.81250 Rate=68.01 GlobalRate=68.47
INFO:   | Train Device=xla:0 Step=400 Loss=5.37500 Rate=68.44 GlobalRate=68.50
INFO:   | Train Device=xla:0 Step=450 Loss=6.43750 Rate=68.43 GlobalRate=68.49
INFO:   | Train Device=xla:0 Step=500 Loss=5.09375 Rate=66.71 GlobalRate=68.19
INFO:   Training Complete. Completed 60500 sample(s) in 887.2672743797302 seconds.

INFO     root:start_utils.py:519 # 1. Start Coordinator on separate process
INFO     root:start_utils.py:534 # 2. Begin Run
INFO     root:start_utils.py:545 # 3. Start Workers on separate processes
INFO     root:start_utils.py:554 # 4. Start Chief on separate processes
INFO     root:start_utils.py:564 # 5. Start WS Runtime servers (i.e. ws-srv) on separate processes
INFO     root:cs_estimator_app.py:274 Loaded global step 0
INFO     root:cs_estimator_app.py:817 Output activation tensors: ['truediv_3_1']
INFO     root:cluster_client.py:217 Initiating a new compile wsjob against the cluster server.
INFO     root:cluster_client.py:220 Compile job initiated
INFO     root:appliance_manager.py:135 Creating a framework GRPC client: localhost:50065, None,
INFO     root:appliance_manager.py:359 Compile successfully written to cache directory: cs_10097974384330522877
INFO     root:cluster_client.py:243 Initiating a new execute wsjob against the cluster server.
INFO     root:cluster_client.py:246 Execute job initiated
INFO     root:appliance_manager.py:149 Removing a framework GRPC client
INFO     root:cs_estimator_app.py:940 final generation of weights: 9
INFO     cerebras_appliance.appliance_client:appliance_client.py:435 Input fn serialized: 80036374657374732e77732e6d696c6573746f6e655f6d6f64656c732e74662e646174610a746f795f696e7075745f666e0a71002e
INFO     root:appliance_manager.py:135 Creating a framework GRPC client: localhost:50066, None,
INFO     root:appliance_manager.py:282 About to send initial weights
INFO     root:tf_appliance_manager.py:85 Dropping tensor: 'good_steps'
INFO     root:appliance_manager.py:284 Finished sending initial weights
INFO     root:cs_estimator_app.py:482 global step 2: loss = 0.0 (0.37 steps/sec)
INFO     root:cs_estimator_app.py:482 global step 4: loss = 0.0 (0.74 steps/sec)
INFO     root:cs_estimator_app.py:388 Taking checkpoint at step: 5
INFO     root:cs_estimator_app.py:437 saving last set of weights: 9
INFO     root:cs_estimator_app.py:482 global step 6: loss = 0.0 (1.06 steps/sec)
INFO     root:cs_estimator_app.py:482 global step 8: loss = 0.0 (1.41 steps/sec)
INFO     root:cs_estimator_app.py:388 Taking checkpoint at step: 10
INFO     root:cs_estimator_app.py:391 Taking final checkpoint
INFO     root:cs_estimator_app.py:437 saving last set of weights: 9
INFO     root:cs_estimator_app.py:482 global step 10: loss = 0.0 (1.69 steps/sec)
INFO     root:cs_estimator_app.py:489 Training complete. Completed 640 sample(s) in 5.9104249477386475 seconds
INFO     root:start_utils.py:587 Wait for server completion
INFO     root:start_utils.py:599 Servers Completed

Note

For the execution strategy argument in the example commands above, please specify pipeline for small to medium models with <1 billion parameters, and specify weight_streaming for large models with >= 1 billion parameters.

Note

Cerebras only supports using a single CS-2 when running in eval mode.

5. Explore output files and artifacts#

The contents of the model directory (as specified by --model_dir flag) contain all the results and artifacts of the latest run, including:

Checkpoints

Tensorboard event files

yaml files

Checkpoints#

Checkpoints are stored in <model_dir>/model-ckpt*.

Tensorboard event files#

Tensorboard event files are stored in the <model_dir> directory.

GPU Requirements for Running Cerebras ModelZoo

Cerebras job scheduling and monitoring