Running TensorFlow Models

Users interact with the Cerebras Wafer-Scale Cluster as if it was an appliance, meaning running models of various sizes on the Cerebras Wafer-Scale Cluster is as easy as running on a single device. For first-time user setup for TensorFlow jobs, see TensorFlow: Getting Started.

Activate your TensorFlow environment

To run TensorFlow jobs on Wafer-Scale Cluster, you first need to activate your TensorFlow environment on the user node.

Enter the Python environment using the following command:

source venv_cerebras_tf/bin/activate

Running the scripts to compile, train, or evaluate your model

The steps to train your model are as follows. We will use GPT-2 model available in Cerebras Model Zoo git repository for this example. Check with your sysadmin if your setup has a local copy of the Model Zoo repository available with pre-installed datasets. Otherwise, you can clone this git repository on your user node yourself and follow the instructions in the readme files in the respository on how to set up training datasets.

  1. In the Model Zoo, you can find run_appliance.py scripts for TensorFlow models. For the GPT-2 model, navigate to the following directory in your copy of the Model Zoo:

    cd modelzoo/transformers/tf/gpt2
    
  2. Within this directory, run the following command that performs the initial stage of compilation to get feedback about whether your model is compatible with Cerebras Software Platform.

    python run_appliance.py --execution_strategy <pipeline or weight_streaming> --params params.yaml --num_csx=1 --validate_only --mode train
    --model_dir model_dir --credentials_path <path to tls certificate> --mount_dirs <paths to data and paths to be mounted>
    --python_paths <paths to modelzoo and other python code if used> --mgmt_address <management node address for cluster>
    

    This step can be skipped, if you are confident in your code. But it is very convenient for fast iteration on your code as it is considerably faster than a full compile.

  3. The next step is to run the full compile. The artifacts from this run are used in the training run.

    This compile can take longer, depending on the size and complexity of the model (15 minutes to an hour).

    python run_appliance.py --execution_strategy <pipeline or weight_streaming> --params params.yaml --num_csx=1 --compile_only --mode train
    --model_dir model_dir --credentials_path <path to tls certificate> --mount_dirs <paths to data and paths to be mounted>
    --python_paths <paths to modelzoo and other python code if used> --mgmt_address <management node address for cluster>
    
  4. This is the training step.

    • If you are running one CS-2, enter the following:

      python run_appliance.py --execution_strategy <pipeline or weight_streaming> --params params.yaml --num_csx=1 --mode train
      --model_dir model_dir --credentials_path <path to tls certificate> --mount_dirs <paths to data and paths to be mounted>
      --python_paths <paths to modelzoo and other python code if used> --mgmt_address <management node address for cluster>
      
    • Note that --num_csx=2 in the below code block refers to the number of CS-2 systems you are using. In this case, you are running a distributed job on two CS-2 systems within the Wafer-Scale Cluster.

    python run_appliance.py --execution_strategy <pipeline or weight_streaming> --params params.yaml --num_csx=2 --mode train
    --model_dir model_dir --credentials_path <path to tls certificate> --mount_dirs <paths to data and paths to be mounted>
    --python_paths <paths to modelzoo and other python code if used> --mgmt_address <management node address for cluster>
    

    The output log is as follows:

    Transferring weights to server: 100%|████| 566/566 [00:12<00:00, 44.31tensors/s]
    2023-01-29 20:21:31,937 INFO:root:Finished sending initial weights
    2023-01-29 20:24:11,480 INFO:tensorflow:global step 12: loss = 9.515625 (0.38 steps/sec)
    2023-01-29 20:24:25,356 INFO:tensorflow:global step 24: loss = 9.3828125 (0.53 steps/sec)
    2023-01-29 20:24:39,232 INFO:tensorflow:global step 36: loss = 8.796875 (0.6 steps/sec)
    2023-01-29 20:24:53,108 INFO:tensorflow:global step 48: loss = 8.5625 (0.65 steps/sec)
    2023-01-29 20:25:06,983 INFO:tensorflow:global step 60: loss = 8.3203125 (0.69 steps/sec)
    2023-01-29 20:25:34,723 INFO:tensorflow:global step 72: loss = 8.2890625 (0.63 steps/sec)
    2023-01-29 20:25:48,602 INFO:tensorflow:global step 84: loss = 8.171875 (0.65 steps/sec)
    2023-01-29 20:26:02,482 INFO:tensorflow:global step 96: loss = 8.0859375 (0.67 steps/sec)
    ......
    2023-01-29 20:34:52,323 INFO:tensorflow:global step 480: loss = 6.16015625 (0.71 steps/sec)
    2023-01-29 20:35:20,190 INFO:tensorflow:global step 492: loss = 6.203125 (0.7 steps/sec)
    2023-01-29 20:35:20,200 INFO:tensorflow:global step 500: loss = 6.28125 (0.71 steps/sec)
    2023-01-29 20:35:20,201 INFO:root:Training complete. Completed 2048000 sample(s) in 700.5476334095001 seconds
    2023-01-29 20:35:20,203 INFO:root:Taking final checkpoint at step: 500
    2023-01-29 20:35:35,774 INFO:tensorflow:Saved checkpoint for global step 500 in 15.573367357254028 seconds: /path/to/model_dir/model.ckpt-500
    2023-01-29 20:35:45,914 INFO:root:Monitoring is over without any issue
    
  5. To run an eval job, run the following command:

    python run_appliance.py --execution_strategy <pipeline or weight_streaming> --params params.yaml --num_csx=1 --model_dir model_dir -–mode eval
    --credentials_path <path to tls certificate> --checkpoint_path <path to trained checkpoint file> --mount_dirs <paths to data and paths to be mounted>
    --python_paths <paths to modelzoo and other python code if used> --mgmt_address <management node address for cluster>
    

Note

For the --execution_strategy argument in the example commands above, please specify --execution_strategy=pipeline for small to medium models with <1 billion parameters to run in Pipelined execution, and specify --execution_strategy=weight_streaming for large models with >= 1 billion parameters to run in Weight Streaming execution.

Example command to train a 117M GPT2 in TensorFlow in Pipelined execution: python run_appliance.py --execution_strategy pipeline --params params_TF_GPT2_117M.yaml --num_csx=1 --num_workers_per_csx=8 --model_dir model_dir_TF_GPT2_117M --credentials_path /path/to/certificate --mount_dirs /path/to/data /path/to/modelzoo /path/to/mount --python_paths /path/to/modelzoo /path/to/additional/python/packages --mgmt_address mangement_node_address_for_cluster

Example command to train a 1.5B GPT2 in TensorFlow in Weight Streaming execution: python run_appliance.py --execution_strategy weight_streaming --params params_TF_GPT2_1p5B.yaml --num_csx=1 --model_dir model_dir_TF_GPT2_1p5B --credentials_path /path/to/certificate --mount_dirs /path/to/data /path/to/modelzoo /path/to/mount --python_paths /path/to/modelzoo /path/to/additional/python/packages --mgmt_address mangement_node_address_for_cluster

Note

Cerebras only supports one CS-2 for eval mode or for Pipelined execution.

Contents of run_appliance.py

For your reference, the contents of run_appliance.py is as shown in Cerebras Model Zoo.

Output files and artifacts

The contents of the model directory (model_dir) contain all the results and artifacts of the latest run, including:

  • Checkpoints

  • Tensorboard event files

  • A copy of the yaml file for your run