Running Large Models (Weight Streaming Execution)

Users interact with the Cerebras Wafer-Scale Cluster as if it were an appliance, meaning running large models on the Cerebras Wafer-Scale Cluster is as easy as running on a single device. For first-time user setup for PyTorch jobs, see Pytorch: Getting Started.

Activate your PyTorch environment

To run PyTorch jobs on Wafer-Scale Cluster, you first must activate your PyTorch environment on the user node.

Enter the Python environment using the following command:

source venv_cerebras_pt/bin/activate

Running the scripts to compile, train, or evaluate your model

The steps to train your model are as follows. We use GPT-2 model available in Cerebras Model Zoo git repository for this example. Check with your sysadmin whether your setup has a local copy of the Model Zoo repository available with pre-installed datasets. Otherwise, you can clone this git repository on your user node yourself and follow the instructions in the ReadMe files in the respository on how to set up training datasets.

  1. In the Model Zoo, you can find run.py scripts for PyTorch models supported with the weight streaming execution mode. For the GPT-2 model, navigate to the following directory in your copy of the Model Zoo:

    cd modelzoo/transformers/pytorch/gpt2
    
  2. Within this directory, run the following command that performs the initial stage of compilation to get feedback about whether your model is compatible with Cerebras Software Platform.

    python run.py --appliance --params params.yaml --num_csx=1  --model_dir
    model_dir --validate_only --mode train --credentials_path=<path to tls certificate> --mount_dirs <paths to data> --python_paths <paths to modelzoo and other python code if used>
    

This step can be skipped, if you are confident in your code. But it is very convinient for fast iteration on your code as it is considerably faster than a full compile.

  1. The next step is to run the full compile. The artifacts from this run are used in the training run.

This compile can take longer, depending on the size and complexity of the model (15 minutes to an hour).

python run.py --appliance --params params.yaml --num_csx=1  --model_dir
model_dir --compile_only --mode train --credentials_path=<path to tls certificate> --mount_dirs <paths to data> --python_paths <paths to modelzoo and other python code if used>

This compile can take longer, depending on the size and complexity of the model (30 minutes to two hours).

  1. This is the training step.

  • If you are running one CS-2, enter the following:

    python run.py --appliance --params params.yaml --num_csx=1 --model_dir=model_dir --mode train --credentials_path=<path to tls certificate> --mount_dirs <paths to data> --python_paths <paths to modelzoo and other python code if used>
    
  • If you are running multiple CS-2s, enter the following. Note that csx=2 in the below code block refers to the number of CS-2 systems you are using. In this case, you are running a distributed job on two CS-2 systems within the Wafer-Scale Cluster.

    python run.py --appliance --params params.yaml --num_csx=2 --model_dir=model_dir --mode train --credentials_path=<path to tls certificate> --mount_dirs <paths to data> --python_paths <paths to modelzoo and other python code if used>
    
  • The output logs are as follows:

    Transferring weights to server: 100%|██| 1165/1165 [01:00<00:00, 19.33tensors/s]
    INFO:   Finished sending initial weights
    INFO:   | Train Device=xla:0 Step=50 Loss=8.31250 Rate=69.37 GlobalRate=69.37
    INFO:   | Train Device=xla:0 Step=100 Loss=7.25000 Rate=68.41 GlobalRate=68.56
    INFO:   | Train Device=xla:0 Step=150 Loss=6.53125 Rate=68.31 GlobalRate=68.46
    INFO:   | Train Device=xla:0 Step=200 Loss=6.53125 Rate=68.54 GlobalRate=68.51
    INFO:   | Train Device=xla:0 Step=250 Loss=6.12500 Rate=68.84 GlobalRate=68.62
    INFO:   | Train Device=xla:0 Step=300 Loss=5.53125 Rate=68.74 GlobalRate=68.63
    INFO:   | Train Device=xla:0 Step=350 Loss=4.81250 Rate=68.01 GlobalRate=68.47
    INFO:   | Train Device=xla:0 Step=400 Loss=5.37500 Rate=68.44 GlobalRate=68.50
    INFO:   | Train Device=xla:0 Step=450 Loss=6.43750 Rate=68.43 GlobalRate=68.49
    INFO:   | Train Device=xla:0 Step=500 Loss=5.09375 Rate=66.71 GlobalRate=68.19
    INFO:   Training Complete. Completed 60500 sample(s) in 887.2672743797302 seconds.
    
  1. To run an eval job, run the following command:

    python run.py --appliance --params params.yaml --num_csx=1 --model_dir=model_dir -–mode eval –-eval_steps=10 --credentials_path=<path to tls certificate> --mount_dirs <paths to data> --python_paths <paths to modelzoo and other python code if used> --checkpoint_path <path to checkpoint to be evaluated>
    

Note

Cerebras only supports one CS-2 for eval mode.

Contents of run.py

For your reference, the contents of run.py is as shown in Cerebras Model Zoo.

Output files and artifacts

The output files and artifacts of the model directory (model_dir) contain all the results and artifacts of the latest run, including:

  • Checkpoints

  • Tensorboard event files

  • A copy of the yaml file for your run

Checkpoints

Checkpoints are stored in <model_dir>/checkpoint_*.mdl.

Tensorboard event files

Tensorboard event files are stored in the <model_dir> directory.