Running PyTorch Models

Users interact with the Cerebras Wafer-Scale Cluster as if it were an appliance, meaning models of various sizes on the Cerebras Wafer-Scale Cluster is as easy as running on a single device. For first-time user setup for PyTorch jobs, see Pytorch: Getting Started.

Activate your PyTorch environment

To run PyTorch jobs on Wafer-Scale Cluster, you first must activate your PyTorch environment on the user node.

Enter the Python environment using the following command:

source venv_cerebras_pt/bin/activate

Running the scripts to compile, train, or evaluate your model

The steps to train your model are as follows. We use GPT-2 model available in Cerebras Model Zoo git repository for this example. Check with your sysadmin whether your setup has a local copy of the Model Zoo repository available with pre-installed datasets. Otherwise, you can clone this git repository on your user node yourself and follow the instructions in the ReadMe files in the respository on how to set up training datasets.

  1. In the Model Zoo, you can find run.py scripts for PyTorch models. For the GPT-2 model, navigate to the following directory in your copy of the Model Zoo:

    cd modelzoo/transformers/pytorch/gpt2
    
  2. Within this directory, run the following command that performs the initial stage of compilation to get feedback about whether your model is compatible with Cerebras Software Platform.

    python run.py --appliance --execution_strategy <pipeline or weight_streaming> --params params.yaml --num_csx=1 --validate_only --mode train
    --model_dir model_dir --credentials_path <path to tls certificate> --mount_dirs <paths to data dir and paths to be mounted>
    --python_paths <paths to modelzoo and other python code if used> --mgmt_address <management node address for cluster>
    

This step can be skipped, if you are confident in your code. But it is very convenient for fast iteration on your code as it is considerably faster than a full compile.

  1. The next step is to run the full compile. The artifacts from this run are used in the training run.

This compile can take longer, depending on the size and complexity of the model (15 minutes to an hour).

python run.py --appliance --execution_strategy <pipeline or weight_streaming> --params params.yaml --num_csx=1 --compile_only --mode train
--model_dir model_dir --credentials_path <path to tls certificate> --mount_dirs <paths to data dir and paths to be mounted>
--python_paths <paths to modelzoo and other python code if used> --mgmt_address <management node address for cluster>
  1. This is the training step.

  • If you are running one CS-2, enter the following:

    python run.py --appliance --execution_strategy <pipeline or weight_streaming> --params params.yaml --num_csx=1 --mode train
    --model_dir model_dir --credentials_path <path to tls certificate> --mount_dirs <paths to data dir and paths to be mounted>
    --python_paths <paths to modelzoo and other python code if used> --mgmt_address <management node address for cluster>
    
  • If you are running multiple CS-2s, which is only allowed in Weight Streaming execution, enter the following. Note that --num_csx=2 in the below code block refers to the number of CS-2 systems you are using. In this case, you are running a data-parallel job on two CS-2 systems within the Wafer-Scale Cluster.

    python run.py --appliance --execution_strategy <pipeline or weight_streaming> --params params.yaml --num_csx=2 --mode train
    --model_dir model_dir --credentials_path <path to tls certificate> --mount_dirs <paths to data dir and paths to be mounted>
    --python_paths <paths to modelzoo and other python code if used> --mgmt_address <management node address for cluster>
    
  • The output logs are as follows:

    Transferring weights to server: 100%|████| 983/983 [00:13<00:00, 72.77tensors/s]
    2023-01-31 14:29:09,453 INFO:   Finished sending initial weights
    2023-01-31 14:35:14,468 INFO:   | Train Device=xla:0, Step=100, Loss=6.60547, Rate=38096.59 samples/sec, GlobalRate=38092.92 samples/sec
    2023-01-31 14:35:14,545 INFO:   | Train Device=xla:0, Step=200, Loss=6.27734, Rate=40148.59 samples/sec, GlobalRate=39732.25 samples/sec
    2023-01-31 14:35:14,619 INFO:   | Train Device=xla:0, Step=300, Loss=5.96484, Rate=41927.08 samples/sec, GlobalRate=40798.72 samples/sec
    2023-01-31 14:35:14,695 INFO:   | Train Device=xla:0, Step=400, Loss=5.92578, Rate=42184.20 samples/sec, GlobalRate=41177.08 samples/sec
    2023-01-31 14:35:14,769 INFO:   | Train Device=xla:0, Step=500, Loss=5.56641, Rate=42517.85 samples/sec, GlobalRate=41480.51 samples/sec
    ......
    2023-01-31 14:35:15,147 INFO:   | Train Device=xla:0, Step=1000, Loss=4.96094, Rate=41571.75 samples/sec, GlobalRate=41921.10 samples/sec
    2023-01-31 14:35:15,148 INFO:   Training Complete. Completed 32000 sample(s) in 0.7639429569244385 seconds.
    2023-01-31 14:35:30,443 INFO:   Monitoring is over without any issue
    
  1. To run an eval job, run the following command:

    python run.py --appliance --execution_strategy <pipeline or weight_streaming> --params params.yaml --num_csx=1 -–mode eval --model_dir model_dir
    --credentials_path <path to tls certificate> --mount_dirs <paths to data and paths to be mounted> --python_paths <paths to modelzoo and other python code if used>
    --checkpoint_path <path to checkpoint to be evaluated> --mgmt_address <management node address for cluster>
    

Note

For the --execution_strategy argument in the example commands above, as a rule of thumb, please specify --execution_strategy=pipeline for small to medium models with <1 billion parameters to run in Pipelined execution, and specify --execution_strategy=weight_streaming for large models with >= 1 billion parameters to run in Weight Streaming execution.

Example command to train a 117M GPT2 in PyTorch in Pipelined execution: python run.py --appliance --execution_strategy pipeline --params params_PT_GPT2_117M.yaml --num_csx=1 --num_workers_per_csx=8 -–mode train --model_dir model_dir_PT_GPT2_117M --credentials_path /path/to/certificate --mount_dirs /path/to/data /path/to/modelzoo /path/to/mount --python_paths /path/to/modelzoo /path/to/python/packages --mgmt_address management_node_address_for_cluster

Example command to train a 1.5B GPT2 in PyTorch in Weight Streaming execution: python run.py --appliance --execution_strategy weight_streaming --params params_PT_GPT2_1p5B.yaml --num_csx=1 -–mode train --model_dir model_dir_PT_GPT2_1p5B --credentials_path /path/to/certificate --mount_dirs /path/to/data /path/to/modelzoo /path/to/mount --python_paths /path/to/modelzoo /path/to/python/packages  --mgmt_address management_node_address_for_cluster

Note

Cerebras only supports one CS-2 for eval mode or for Pipelined execution.

Contents of run.py

For your reference, the contents of run.py is as shown in Cerebras Model Zoo.

Output files and artifacts

The output files and artifacts of the model directory (model_dir) contain all the results and artifacts of the latest run, including:

  • Checkpoints

  • Tensorboard event files in train/

  • A copy of the yaml file for your run in train/

  • Performance summary in performance/