Launch your job#
Running jobs in the Cerebras Wafer-Scale Cluster is as easy as running jobs on a single device. To start, you should have already Setup your environment and Clone Cerebras Model Zoo.
3. Prepare your datasets#
Each of the models at the Cerebras Model Zoo contains scripts to prepare your datasets.
In this example, the fc-mnist implementation automatically downloads the data if it is not available. You can review the location of the dataset in the configuration file inside config/
:
train_input:
data_dir: "/absolute/path/to/training/dataset
...
eval_input:
data_dir: "/absolute/path/to/evaluation/dataset/
4. Launch your job#
All models in Cerebras Model Zoo contain the script run.py
. You will use this script and the wrappers csrun_cpu
and csrun_wse
to submit your job. The run.py
script is instrumented to launch compilation, training and evaluation of your models in the Cerebras Original Installation.
You will need to specify these flags
Flag |
Mandatory |
Description |
Default Value |
---|---|---|---|
|
Yes |
Path to a YAML file containing model/run configuration options. |
|
|
Yes |
Whether to train and/or evaluate. |
|
|
No |
Compile the model all the way but don’t execute on system. Mutually exclusive with validate_only. |
|
|
No |
Validate model can be matched to Cerebras kernels, but don’t execute on system. Mutually exclusive with compile_only. |
|
|
No |
Path to store compilation artifacts, checkpoints, TensorBoard events files, etc. |
$CWD/model_dir |
4.1 (Optional) Compile your job#
For any compilation, you will use the script csrun_cpu
that is instrumented with the flags:
Flag |
Mandatory |
Description |
Default Value |
---|---|---|---|
|
No | Reserves physical CPU node exclusively to execute command |
True |
|
|
No | Comma-separated paths to mount in addition to the default specified in the |
||
|
No |
Submit a slurm batch script to execute command. Script will stay in queue of pending jobs until resources allocated. |
To validate that your model implementation is compatible with Cerebras Software Platform, you can use a validate_only
compilation. This type of compilation allows you fast iteration on your code while developing new models.
csrun_cpu python-pt run.py --mode <train or eval> \
--validate_only \
--params configs/params.yaml \
csrun_cpu python run.py --mode <train or eval> \
--validate_only \
--params configs/params.yaml \
You can also use a compile_only
compilation, to create executables to run your model in the Cerebras Original Installation. This compilation takes longer than validate_only
, depending on the size and complexity of the model (15 minutes to an hour). You need to provide the IP address of the CS-2 attached accelerator.
csrun_cpu python-pt run.py --mode <train or eval> \
--compile_only \
--params configs/params.yaml \
--cs_ip ${CS_IP}
csrun_cpu python run.py --mode <train or eval> \
--compile_only \
--params configs/params.yaml \
--cs_ip ${CS_IP}
Note
The compiler detects whether a binary already exists for a particular model config in the provided model directory and skips compiling on the fly during training if it detects one.
4.2 Execute your job#
To execute your job in the Original Cerebras Installation, you need to provide the IP address of the CS-2 attached accelerator, as well as to specify the mode of execution (train or eval).
csrun_wse python-pt run.py --mode <train or eval> \
--params configs/params.yaml \
--cs_ip ${CS_IP}
csrun_wse python run.py --mode <train or eval> \
--compile_only \
--params configs/params.yaml \
--cs_ip ${CS_IP}
Here is an example of typical output log for a training job:
INFO: | Train Device=xla:0 Step=50 Loss=8.31250 Rate=69.37 GlobalRate=69.37
INFO: | Train Device=xla:0 Step=100 Loss=7.25000 Rate=68.41 GlobalRate=68.56
INFO: | Train Device=xla:0 Step=150 Loss=6.53125 Rate=68.31 GlobalRate=68.46
INFO: | Train Device=xla:0 Step=200 Loss=6.53125 Rate=68.54 GlobalRate=68.51
INFO: | Train Device=xla:0 Step=250 Loss=6.12500 Rate=68.84 GlobalRate=68.62
INFO: | Train Device=xla:0 Step=300 Loss=5.53125 Rate=68.74 GlobalRate=68.63
INFO: | Train Device=xla:0 Step=350 Loss=4.81250 Rate=68.01 GlobalRate=68.47
INFO: | Train Device=xla:0 Step=400 Loss=5.37500 Rate=68.44 GlobalRate=68.50
INFO: | Train Device=xla:0 Step=450 Loss=6.43750 Rate=68.43 GlobalRate=68.49
INFO: | Train Device=xla:0 Step=500 Loss=5.09375 Rate=66.71 GlobalRate=68.19
INFO: Training Complete. Completed 60500 sample(s) in 887.2672743797302 seconds.
srun: job 5834 queued and waiting for resources
srun: job 5834 has been allocated resources
...
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into model_dir/model.ckpt.
INFO:tensorflow:Programming CS system fabric. This may take a couple of minutes - please do not interrupt.
INFO:tensorflow:Fabric programmed
INFO:tensorflow:Coordinator fully up. Waiting for Streaming (using 0.97% out of 301600 cores on the fabric)
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
...
INFO:tensorflow:Training finished with 25600000 samples in 187.465 seconds, 136558.69 samples / second
INFO:tensorflow:Saving checkpoints for 100000 into model_dir/model.ckpt.
INFO:tensorflow:global step 100000: loss = 1.901388168334961e-05 (532.0 steps/sec)
INFO:tensorflow:global step 100000: loss = 1.901388168334961e-05 (532.0 steps/sec)
INFO:tensorflow:Loss for final step: 1.9e-05.
Note
Cerebras Original Installation only supports models in pipelined execution.
5. Explore output files and artifacts#
The contents of the model directory (as specified by --model_dir
flag) contain all the results and artifacts of the latest run, including:
Compile directory (
cs_<checksum>
)
performance.json
fileCheckpoints
Tensorboard event files
yaml
files
Compile dir – The directory containing the cs_<checksum>
#
The cs_<checksum>
dir (also known as cached compile directory), contains the .elf
, which is used to program the system.
Output of compilation indicates whether the compile passed or failed; if failed, then the logs show at which stage compilation failed.
performance.json
file and its parameters#
There is a performance directory that should contain the performance.json <model_dir>/performance/performance.json
. This contains information as listed below:
compile_time
: The amount of time that it took to compile the model to generate the Cerebras executable.
est_samples_per_sec
: The estimated performance in terms of samples per second based on the Cerebras compile. Note that this number is theoretical and actual performance may vary.
programming_time
: This is the time taken to prepare the system and load with the model that is compiled.
samples_per_sec
: The actual performance of your run execution; i.e., the number of samples processed on the WST per second.
suspected_input_bottleneck
: This is a beta feature. It indicates whether you are input-starved and need more input workers to feed the Cerebras system.
total_samples
: The total gross samples that were iterated during the execution.
total_time
: The total time it took to complete the total samples.
Checkpoints#
Checkpoints are stored in <model_dir>/model-ckpt*
.
Tensorboard event files#
Tensorboard event files are stored in the <model_dir>
directory.