Launch your job#

Running jobs in the Cerebras Wafer-Scale cluster is as easy as running jobs on a single device.

Prerequisite#

You have already Set up a Cerebras virtual environment and Clone Cerebras Model Zoo in your environment.

Activate Cerebras virtual environment#

Activate the environment on the user node by issuing the following command:

source venv_cerebras_pt/bin/activate

Note that now you should be in the (venv_cerebras_pt) environment.

Important

Remember to activate your virtual environment before you start running jobs on the Cerebras Wafer-Scale Cluster.

Prepare your datasets#

Each model in the Cerebras Model Zoo contains scripts to prepare your datasets. You can find general guidance in the Data processing and dataloaders section. Also, you can find dataset examples for each model in the README file in Cerebras Model Zoo.

For example, the FC-MNIST model contains a prepare_data.py script that downloads sample data. For Language models, you can use one of Cerebras Model Zoo scripts for data processing. An example can be found in the trainining and finetuning LLMs tutorial.

Note

After preparing your data, you need to update the data path in the configuration file to point to the absolute path where your data is stored. Inside the configs/ folder, you’ll find YAML files corresponding to different model sizes. Locate the YAML file for the model size you’re working with and modify the data path to reflect the absolute path to your preprocessed data. This step ensures that the model training or evaluation process can correctly locate and utilize your prepared data.

train_input:
  data_dir: "/absolute/path/to/training/dataset"
  ...
eval_input:
  data_dir: "/absolute/path/to/evaluation/dataset/"

Launch your job#

All models in Cerebras Model Zoo contain a script called run.py. These scripts have been instrumented to launch the compilation, training, and evaluation of your models in the Cerebras Wafer-Scale cluster.

You will need to specify the following flags:

Flag

Mandatory

Description

CSX

Yes

Specifies that the target device for execution is a Cerebras Cluster.

--params <...>

Yes

Path to a YAML file containing model/run configuration options.

--mode <train,eval,train_and_eval,eval_all>

Yes

Whether to run train, evaluate, train and evaluate, or eval_all.

--mount_dirs <...>

Yes

List of paths to be mounted to the Appliance containers. It should include parent paths for Cerebras Model Zoo and other locations needed by the dataloader, including datasets and code. (Default: Pulled from path defined by env variable CEREBRAS_WAFER_SCALE_CLUSTER_DEFAULTS) For more information, see Cerebras cluster settings

--python_paths <...>

Yes

List of paths to be exported to PYTHONPATH when starting the workers on the Appliance container. It should include parent paths to Cerebras Model Zoo and python packages needed by input workers. (Default: Pulled from path defined by env variable CEREBRAS_WAFER_SCALE_CLUSTER_DEFAULTS) For more information, see Cerebras cluster settings

--compile_only

No

Compile the model including matching to Cerebras kernels and mapping to hardware. It does not execute on system. Upon success, compile artifacts are stored inside the Cerebras cluster, under the directory specified in --compile_dir. To start training from a pre-compiled model, use the same --compile_dir used in a compile-only run. Mutually exclusive with validate_only. (Default: None)

--validate_only

No

Validate model can be matched to Cerebras kernels. This is a lightweight compilation. It does not map to the hardware nor execute on system. Mutually exclusive with compile_only. (Default: None)

--model_dir <...>

No

Path to store model checkpoints, TensorBoard events files, etc. (Default: $CWD/model_dir)

--compile_dir <...>

No

Path to store the compile artifacts inside Cerebras cluster. (Default: None)

--num_csx <1,2,4,8,16>

No

Number of CS-X systems to use in training. (Default: 1)

For a more comprehensive list, issue the following command:

python run.py -h

Validate your job (optional)#

To validate that your model implementation is compatible with Cerebras software platform, use the --validate_only flag. This flag allows you to quickly iterate and check compatibility without requiring full model execution.

python run.py \
      CSX \
      --params params.yaml \
      --num_csx=1 \
      --mode {train,eval,eval_all,train_and_eval} \
      --mount_dirs {paths to modelzoo and to data} \
      --python_paths {paths to modelzoo and other python code if used} \
      --validate_only

Compile your job (optional)#

Use the flag compile_only to create executables to run your model in the Cerebras cluster. This compilation takes longer than validate_only, depending on the size and complexity of the model (15 minutes to an hour).

python run.py \
      CSX \
      --params params.yaml \
      --num_csx=1 \
      --model_dir model_dir \
      --mode {train,eval,eval_all,train_and_eval} \
      --mount_dirs {paths to modelzoo and to data} \
      --python_paths {paths to modelzoo and other python code if used} \
      --compile_only

Note

You can use pre-compiled artifacts obtained by --validate_only and --compile_only to speed up your training or evaluation runs. Use the same --compile_dir during compilation and execution, to reuse precompile artifacts.

Since train and eval modes require different fabric programming in the CS-X system, you will obtain different compile artifacts when running with flags --mode train --compile_only and --mode eval --compile_only.

Execute your job#

To execute your job, you need to provide the following information:

  1. A target device that you would like to execute on. To run the job on the Cerebras cluster, add CSX as the first positional argument in the command line.

  2. Information about the Cerebras cluster where the job will be executed using the flags --python_paths and --mount_dirs.

    Note

    You can choose to specify the python_paths and mount_dirs arguments either in the run.py script file or in the runconfig section of the params.yaml file.

    If running a model from Cerebras Model Zoo, both of these arguments should include paths to the parent directory where Cerebras Model Zoo is located. For example, for this directory structure: /path/to/parent/modelzoo, the specified arguments should be /path/to/parent. For more information, see py_paths_mt_dir.

  3. The mode of execution {train, eval, eval_all, train_and_eval} and a path to the configuration file must be passed.

python run.py \
      CSX \
      --params params.yaml \
      --num_csx=1 \
      --model_dir model_dir \
      --mode {train,eval,eval_all,train_and_eval} \
      --mount_dirs {paths modelzoo and to data} \
      --python_paths {paths to modelzoo and other python code if used}

Here is an example of a typical output log for a training job:

Transferring weights to server: 100%|██| 1165/1165 [01:00<00:00, 19.33tensors/s]
INFO:   Finished sending initial weights
INFO:   | Train Device=CSX, Step=50, Loss=8.31250, Rate=69.37 samples/sec, GlobalRate=69.37 samples/sec
INFO:   | Train Device=CSX, Step=100, Loss=7.25000, Rate=68.41 samples/sec, GlobalRate=68.56 samples/sec
INFO:   | Train Device=CSX, Step=150, Loss=6.53125, Rate=68.31 samples/sec, GlobalRate=68.46 samples/sec
INFO:   | Train Device=CSX, Step=200, Loss=6.53125, Rate=68.54 samples/sec, GlobalRate=68.51 samples/sec
INFO:   | Train Device=CSX, Step=250, Loss=6.12500, Rate=68.84 samples/sec, GlobalRate=68.62 samples/sec
INFO:   | Train Device=CSX, Step=300, Loss=5.53125, Rate=68.74 samples/sec, GlobalRate=68.63 samples/sec
INFO:   | Train Device=CSX, Step=350, Loss=4.81250, Rate=68.01 samples/sec, GlobalRate=68.47 samples/sec
INFO:   | Train Device=CSX, Step=400, Loss=5.37500, Rate=68.44 samples/sec, GlobalRate=68.50 samples/sec
INFO:   | Train Device=CSX, Step=450, Loss=6.43750, Rate=68.43 samples/sec, GlobalRate=68.49 samples/sec
INFO:   | Train Device=CSX, Step=500, Loss=5.09375, Rate=66.71 samples/sec, GlobalRate=68.19 samples/sec
INFO:   Training completed successfully!
INFO:   Processed 60500 sample(s) in 887.2672743797302 seconds.

Note

  • Cerebras only supports using a single CS-X when running in eval mode.

  • To scale to multiple CS-X systems, simply add the --num_csx flag specifying the number of CS-X systems. The global batch size divided by the number of CS-Xs will be the effective batch size per device.

  • Once you have submitted your job to execute in the Cerebras Wafer-Scale cluster, you can track the progress or kill your job using the csctl tool. You can also monitor the performance using a Grafana dashboard.

Explore output files and artifacts#

The contents of the model directory (specified by the --model_dir flag) contain all the results and artifacts of the latest run, that includes:

  • Checkpoints

    Checkpoints are stored in <model_dir> directory.

  • Tensorboard event files

    Tensorboard event files are stored in the <model_dir> directory. Events files can be visualized using Tensorboard. Here’s an example of how to launch Tensorboard:

    $ tensorboard --logdir <model_dir> --bind_all
    
    TensorBoard 2.2.2 at http://<url-to-user-node>:6006/ (Press CTRL+C to quit)
    
  • yaml files

    YAML files containing configuration parameters used in the run are stored in the <model_dir>/train or <model_dir>/eval directory depending on the execution mode.

  • Run logs

    Stdout from the run is located under <model_dir>/cerebras_logs/latest/run.log. If there are multiple runs, look under the corresponding <model_dir>/cerebras_logs/<train|eval>/<timestamp>/run.log.

Cancel your job#

For any reason if you wish to cancel your job, issue the following command:

csctl cancel job <jobid>

What’s next?#

Try out our LLM workflow by following the step-by-step instructional tutorial on Training and fine-tuning a Large Language Model (LLM).