Running Small to Medium Models (Pipelined Execution)

This section describes how to run small- to medium-sized models on the Cerebras Wafer-Scale cluster with pipelined execution using TensorFlow. For first-time user setup for TensorFlow jobs, see TensorFlow: Getting Started.

Note

We are transitioning from the Slurm-based workflow to run small to medium models in Pipelined execution on the Wafer-Scale Cluster. During this transition, we use scripts that are similar to the Slurm-based workflow but run Kubernetes under the hood. The eventual goal is to use the same workflow in Weight Streaming execution to launch jobs in Pipelined execution, too.

Activate your TensorFlow environment

To run TensorFlow jobs on Wafer-Scale Cluster, you first need to activate your TensorFlow environment on the user node.

Enter the Python environment using the following command:

source venv_cerebras_tf/bin/activate

Run Pipelined models on the Cerebras Wafer-Scale Cluster

Get a local copy of Cerebras Model Zoo

We will use code available in Cerebras Model Zoo git repository for this example. Check with your sysadmin if your setup has a local copy of the Model Zoo repository available with pre-installed datasets. Otherwise, you can clone this git repository on your user node yourself and follow the instructions in the readme files in the respository on how to set up training datasets.

  1. Copy or clone the Model Zoo repository to your preferred location in your home directory.

    If you don’t have a copy of the Model Zoo pre-configured for your setup, clone the repo from GitHub by running the following command:

    git clone https://github.com/Cerebras/modelzoo/tree/master/modelzoo
    

    Make sure to configure paths to the datasets in your local copy.

  2. Navigate to the model directory.

    cd modelzoo/fc_mnist/tf/
    

Compile on CPU

Cerebras recommends that you first compile your model successfully on a CPU node from the cluster before running it on the CS system.

  • You can run in validate_only mode that runs a fast, light-weight verification. In this mode, the compilation only runs through the first few stages, up until kernel library matching.

  • After a successful validate_only run, you can run full compilation with compile_only mode.

This section of the quick-start guide shows how to execute these steps on a CPU node.

Tip

The validate_only step is very fast, enabling you to rapidly iterate on your model code. Without needing access to the CS system wafer scale engine, you can determine in this validate_only step if you are using any TensorFlow layer or functionality that is unsupported by either XLA or CGC.

Follow these steps to compile on a CPU (uses FC-MNIST example from the Cerebras Model Zoo git repository).

  1. Run the compilation in validate_only mode.

    csrun_cpu --admin-defaults="/path/to/admin-defaults.yaml" --mount-dirs="/data/ml,/lab/ml" python run.py --mode train --validate_only
    ...
    XLA Extraction Complete
    =============== Starting Cerebras Compilation ===============
    Cerebras compilation completed: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02s,  1.23s/stages]
    =============== Cerebras Compilation Completed ===============
    

    Note

    The validate_only mode checks the kernel compatibility of your model. When your model passes this mode, run the full compilation with compile_only to generate the CS system executable.

  2. Run the full compilation process in compile_only mode. This step runs the full compilation through all stages of the Cerebras software stack to generate a CS system executable.

    csrun_cpu --admin-defaults="/path/to/admin-defaults.yaml" --mount-dirs="/data/ml,/lab/ml" python run.py --mode train --compile_only
    ...
    XLA Extraction Complete
    =============== Starting Cerebras Compilation ===============
    Cerebras compilation completed: |                    | 17/? [00:18s,  1.09s/stages]
    =============== Cerebras Compilation Completed ===============
    

    When the above compilation is successful, the model is guaranteed to run on the CS system. You can also use validate-only mode to run pre-compilations of many different model configurations offline so you can more fully utilize the allotted CS system cluster time.

    Note

    The compiler detects whether a binary already exists for a particular model config inside the model directory and skips compiling on the fly during training if it detects one.

Run the model on the CS system

To run a training job on the CS system, enter the following command. The below csrun_wse command compiles the code if no existing compile artifacts are found in the model directory, and then runs the compiled executable on the CS system.

csrun_wse --admin-defaults="/path/to/admin-defaults.yaml" --mount-dirs="/data/ml,/lab/ml" python run.py --mode=train --params=params.yaml

The command above mounts the directories /data/ml and /lab/ml to the container (in addition to the default mount directories) and then trains the FC-MNIST model on the CS System available.

Exact options are available using csrun_wse --help.

To run an eval job on the CS system, enter the following command:

csrun_wse --admin-defaults="/path/to/admin-defaults.yaml" --mount-dirs=”/data/ml,/lab/ml" python run.py  --mode=eval –eval_steps=1000

This command initiates an eval job for 1000 steps on the CS system.

Output files and artifacts

The output files and artifacts include a model directory (model_dir), which contains all the results and artifacts of the latest run, including:

  • Compile directory (cs_<checksum>)

  • performance.json file

  • Checkpoints

  • Tensorboard event files

  • yaml files

Compile dir – The directory containing the cs_<checksum>

The cs_<checksum> dir (also known as cached compile directory), contains the .elf, which is used to program the system.

Output of compilation indicates whether the compile passed or failed; if failed, then the logs show at which stage compilation failed.

performance.json file and its parameters

There is a performance directory performance.json <model_dir>/performance/performance.json that contains the performance.json file. This contains information as listed below:

  • compile_time - The amount of time that it took to compile the model to generate the Cerebras executable.

  • est_samples_per_sec - The estimated performance in terms of samples per second based on the Cerebras compile. Note that this number is theoretical and actual performance may vary.

  • programming_time - This is the time taken to prepare the system and load with the model that is compiled.

  • samples_per_sec - The actual performance of your run execution; i.e., the number of samples processed on the CS system per second.

  • suspected_input_bottleneck - This is a beta feature. It indicates whether you are input-starved and need more input workers to feed the Cerebras system.

  • total_samples - The total gross samples that were iterated during the execution.

  • total_time - The total time it took to complete the total samples. Note that this time does not include programming_time or compile_time. It measure the time it takes from sending the first sample to CS-X to receiving the last output from CS-X.

Checkpoints

Checkpoints are stored in <model_dir>; for example, <model_dir>/model-ckpt-0.index, <model_dir>/model-ckpt-0.meta, and <model_dir>/model-ckpt-1.data-00000-of-00001. They are saved with the frequency specified in the runconfig section of the params file. If checkpoint frequency is anything other than 0, we always save the last step’s checkpoint, regardless of whether that step aligns with the checkpoint frequency or not.

Tensorboard event files

Ternsorboard event files are also stored in the <model_dir>/train directory.

yaml files content after the run

A yaml file is stored in the <model_dir>/train directory. This yaml file contains information about the specifics of the run, such as model specific configuration (e.g., dropout, activation_fn), optimizer type and optimizer parameters, input data configuration, such as batch_size, and shuffle and run configuration, such as max_steps, checkpoint_steps, and num_epochs.

Train and evaulate on CPU/ GPU

To train or eval on the CPU directly, you can directly call Python. Training on GPU may require generating a new virtual environment configured for your GPU hardware requirements. To set up the environment for GPU, refer to these requirements.

# train on CPU
python run.py --mode train \
-–params=params.yaml

# run eval on CPU
python run.py --mode eval \
--eval_steps 1000