Launch your job#

Running jobs in the Cerebras Wafer-Scale Cluster is as easy as running jobs on a single device. To start, you should have already Setup your environment and Clone Cerebras Model Zoo.

3. Prepare your datasets#

Each of the models at the Cerebras Model Zoo contains scripts to prepare your datasets.

In this example, the fc-mnist implementation automatically downloads the data if it is not available. You can review the location of the dataset in the configuration file inside config/:

train_input:
  data_dir: "/absolute/path/to/training/dataset
  ...
eval_input:
  data_dir: "/absolute/path/to/evaluation/dataset/

4. Launch your job#

All models in Cerebras Model Zoo contain the script run.py. You will use this script and the wrappers csrun_cpu and csrun_wse to submit your job. The run.py script is instrumented to launch compilation, training and evaluation of your models in the Cerebras Original Installation.

You will need to specify these flags

Flag

Mandatory

Description

Default Value

--params <...>

Yes

Path to a YAML file containing model/run configuration options.

--mode <train,eval>

Yes

Whether to train and/or evaluate.

--compile_only

No

Compile the model all the way but don’t execute on system. Mutually exclusive with validate_only.

--validate_only

No

Validate model can be matched to Cerebras kernels, but don’t execute on system. Mutually exclusive with compile_only.

--model_dir <...>

No

Path to store compilation artifacts, checkpoints, TensorBoard events files, etc.

$CWD/model_dir

4.1 (Optional) Compile your job#

For any compilation, you will use the script csrun_cpu that is instrumented with the flags:

Flag

Mandatory

Description

Default Value

--alloc_node

No | Reserves physical CPU node exclusively to execute command

True

--mount_dirs <..,..>

No | Comma-separated paths to mount in addition to the default specified in the csrun_cpu script.

--use-sbatch

No

Submit a slurm batch script to execute command. Script will stay in queue of pending jobs until resources allocated.

To validate that your model implementation is compatible with Cerebras Software Platform, you can use a validate_only compilation. This type of compilation allows you fast iteration on your code while developing new models.

csrun_cpu python-pt run.py --mode <train or eval> \
  --validate_only \
  --params configs/params.yaml \

You can also use a compile_only compilation, to create executables to run your model in the Cerebras Original Installation. This compilation takes longer than validate_only, depending on the size and complexity of the model (15 minutes to an hour). You need to provide the IP address of the CS-2 attached accelerator.

csrun_cpu python-pt run.py --mode <train or eval> \
  --compile_only \
  --params configs/params.yaml \
  --cs_ip ${CS_IP}

Note

The compiler detects whether a binary already exists for a particular model config in the provided model directory and skips compiling on the fly during training if it detects one.

4.2 Execute your job#

To execute your job in the Original Cerebras Installation, you need to provide the IP address of the CS-2 attached accelerator, as well as to specify the mode of execution (train or eval).

csrun_wse python-pt run.py --mode <train or eval> \
  --params configs/params.yaml \
  --cs_ip ${CS_IP}

Here is an example of typical output log for a training job:

INFO:   | Train Device=xla:0 Step=50 Loss=8.31250 Rate=69.37 GlobalRate=69.37
INFO:   | Train Device=xla:0 Step=100 Loss=7.25000 Rate=68.41 GlobalRate=68.56
INFO:   | Train Device=xla:0 Step=150 Loss=6.53125 Rate=68.31 GlobalRate=68.46
INFO:   | Train Device=xla:0 Step=200 Loss=6.53125 Rate=68.54 GlobalRate=68.51
INFO:   | Train Device=xla:0 Step=250 Loss=6.12500 Rate=68.84 GlobalRate=68.62
INFO:   | Train Device=xla:0 Step=300 Loss=5.53125 Rate=68.74 GlobalRate=68.63
INFO:   | Train Device=xla:0 Step=350 Loss=4.81250 Rate=68.01 GlobalRate=68.47
INFO:   | Train Device=xla:0 Step=400 Loss=5.37500 Rate=68.44 GlobalRate=68.50
INFO:   | Train Device=xla:0 Step=450 Loss=6.43750 Rate=68.43 GlobalRate=68.49
INFO:   | Train Device=xla:0 Step=500 Loss=5.09375 Rate=66.71 GlobalRate=68.19
INFO:   Training Complete. Completed 60500 sample(s) in 887.2672743797302 seconds.

Note

Cerebras Original Installation only supports models in pipelined execution.

5. Explore output files and artifacts#

The contents of the model directory (as specified by --model_dir flag) contain all the results and artifacts of the latest run, including:

  • Compile directory (cs_<checksum>)

  • performance.json file

  • Checkpoints

  • Tensorboard event files

  • yaml files

Compile dir – The directory containing the cs_<checksum>#

The cs_<checksum> dir (also known as cached compile directory), contains the .elf, which is used to program the system.

Output of compilation indicates whether the compile passed or failed; if failed, then the logs show at which stage compilation failed.

performance.json file and its parameters#

There is a performance directory that should contain the performance.json <model_dir>/performance/performance.json. This contains information as listed below:

  • compile_time: The amount of time that it took to compile the model to generate the Cerebras executable.

  • est_samples_per_sec: The estimated performance in terms of samples per second based on the Cerebras compile. Note that this number is theoretical and actual performance may vary.

  • programming_time: This is the time taken to prepare the system and load with the model that is compiled.

  • samples_per_sec: The actual performance of your run execution; i.e., the number of samples processed on the WST per second.

  • suspected_input_bottleneck: This is a beta feature. It indicates whether you are input-starved and need more input workers to feed the Cerebras system.

  • total_samples: The total gross samples that were iterated during the execution.

  • total_time: The total time it took to complete the total samples.

Checkpoints#

Checkpoints are stored in <model_dir>/model-ckpt*.

Tensorboard event files#

Tensorboard event files are stored in the <model_dir> directory.