Integration with Slurm#

Overview#

Cerebras has streamlined its appliance workflow by integrating Slurm, allowing for a well-orchestrated division of responsibilities between Slurm and Kubernetes (k8s) for resource management. This integration introduces “surrogate jobs” to enhance job tracking and management.

Users can deploy their run.py script using two different Slurm commands: salloc and sbatch, each catering to different needs:

Using salloc#

When the user submits their job using salloc, it requests and obtains the necessary resource allocation from Slurm.

After acquiring the allocation, the run.py script is executed.

Importantly, relevant surrogate jobs are automatically created as part of the workflow when run.py is invoked through salloc. These surrogate jobs help with resource tracking and accounting.

Using sbatch#

Alternatively, the user can submit their job through sbatch, which invokes a bash script.

This bash script, in turn, calls the run.py script. Similar to the salloc method, the relevant surrogate jobs are also automatically generated when run.py is invoked through sbatch. These surrogate jobs serve the same purpose of aiding in resource tracking and accounting.

It’s important to note that the surrogate jobs are a crucial part of the Cerebras appliance workflow and are automatically created when run.py is submitted via either salloc or sbatch. However, these surrogate jobs will not be generated if run.py is submitted using the srun command.

Surrogate jobs#

A surrogate job, in the context of the Cerebras Wafer-Scale cluster and Slurm integration, is essentially a job step within the Slurm cluster that represents the execution of an appliance job on the Cerebras Wafer-Scale cluster. Here are key characteristics and behaviors of surrogate jobs:

Representation of Appliance Jobs:
A surrogate job serves as a representation or surrogate for an appliance job that is running on the Cerebras cluster.
Creation and Termination:
Surrogate jobs are automatically created in the Slurm cluster when the corresponding appliance job is in the “running” state on the Cerebras cluster.
Completion States:
A surrogate job can conclude with one of two states:
- COMPLETED: This state indicates that the associated appliance job completed successfully without errors.
- FAILED: This state signifies that the associated appliance job encountered issues or errors during execution and did not complete successfully.
Naming Convention:
Jobs that require the allocation of CS-Xs, which are Cerebras-specific hardware resources, may have a naming suffix of “-csxN,” where “N” represents the number of allocated CS-Xs. This naming convention helps identify the resource allocation associated with the appliance job.

The runtime of a surrogate job should match with the runtime of its appliance job.

Submit jobs with sbatch#

As an example, we will train GPT2 small with Weight Streaming using the PyTorch implementation in Cerebras Model Zoo.

Once you Clone Cerebras Model Zoo, you will the following contents in the folder <parent_dir_modelzoo>/modelzoo/modelzoo/models/nlp/gpt2/:

configs/
  params_gpt2_small.yaml
input/
data.py
gpt2_model.py
model.py
run.py
...

Here is an example of sbatch.sh script which invokes the run.py located at <parent_dir_modelzoo>/modelzoo/modelzoo/models/nlp/gpt2/. Assume that you have Set up a Cerebras virtual environment at $HOME/venv_cerebras_pt. The run.py will start a compile and a train appliance job for a pytorch GPT2 small test:

#!/bin/bash

#SBATCH --job-name=gpt2-small-test
#SBATCH --nodes=1
#SBATCH --tasks=1
#SBATCH --cpus-per-task 40
#SBATCH --output <parent_dir_modelzoo>/modelzoo/modelzoo/models/nlp/gpt2/run-%j.out

source $HOME/venv_cerebras_pt/bin/activate
cd <parent_dir_modelzoo>/modelzoo/modelzoo/models/nlp/gpt2
python \
  run.py CSX \
    --params configs/params_gpt2_small.yaml \
    --num_csx 1 \
    --model_dir model_dir/ \
    --num_wgt_servers 2 \
    --mode train

Invoke this script from the slurm with the command - sbatch sbatch.sh. When the script is run from the sacct command, you will see the surrogate jobs starting up:

sacct --format="JobID,JobName%34,Account,State,CPUTime,AllocCPUS,AllocNodes,ExitCode"
       JobID                        JobName    Account      State    CPUTime  AllocCPUS AllocNodes ExitCode
------------ ---------------------------------- ---------- ---------- ---------- ---------- ---------- --------
65                              gpt2-small-test               RUNNING   01:09:20         40          1      0:0
65.batch                                  batch               RUNNING   01:09:20         40          1      0:0
65.0               wsjob-vpd7iavtknrtvgonhe7mog             COMPLETED   00:02:40         40          1      0:0
65.1          wsjob-z3pbtbl8sgmfqtivpazrtn-csx1               RUNNING   00:02:40         40          1      0:0

In this example,

The gpt2-small-test is the Slurm specific top-level job.
The batch job step is a slurm specific job step.
The surrogate job step wsjob-vpd7iavtknrtvgonhe7mog is created for the compile appliance job, and it was completed. Since it is a compile job, no CS-Xs are required, so there is no -csxN suffix.
The surrogate job step wsjob-z3pbtbl8sgmfqtivpazrtn-csx1 is created for the train appliance job, and it was in running state. The suffix -csx1 indicates that 1 CS-X is allocated to the job.

Submit jobs with salloc#

The above run.py can also be invoked using salloc:

salloc --cpus-per-task 40 --tasks 1

source $HOME/venv_cerebras_pt/bin/activate
cd <parent_dir_modelzoo>/modelzoo/modelzoo/models/nlp/gpt2
python \
  run.py CSX \
    --params configs/params_gpt2_small.yaml \
    --num_csx 1 \
    --model_dir model_dir/ \
    --num_wgt_servers 2 \
    --mode train

When this is running, with sacct command, you should see the corresponding surrogate jobs:

sacct --format="JobID,JobName%34,Account,State,CPUTime,AllocCPUS,AllocNodes,ExitCode"
        JobID                        JobName    Account      State    CPUTime  AllocCPUS AllocNodes ExitCode
 ------------ ---------------------------------- ---------- ---------- ---------- ---------- ---------- --------
 68                                  interactive               RUNNING   00:03:30          2          1      0:0
 68.0               wsjob-lbrjprpjuj2dfsbfsebdq8             COMPLETED   00:00:08          2          1      0:0
 68.1          wsjob-dazjdtytvfn4njtcbchsik-csx1               RUNNING   00:00:06          2          1      0:0

In this example,

The interactive job is slurm specific top-level job.
The surrogate job step wsjob-lbrjprpjuj2dfsbfsebdq8 is created for the compile appliance job, and it was completed.
The surrogate job step wsjob-dazjdtytvfn4njtcbchsik-csx1 is created for the train appliance job, and it was in running state.

Time limits#

We support two types of time limits:

Time limit in Slurm#

This time limit defines the runtime limit for the sbatch or salloc job. It includes the time when the underline appliance jobs are in the appliance queue. An example of enabling this time limit is as follows.

  #SBATCH --job-name=gpt2-small-test
  #SBATCH --nodes=1
  #SBATCH --tasks=1
  #SBATCH --cpus-per-task 40
  #SBATCH --output <parent_dir_modelzoo>/modelzoo/modelzoo/models/nlp/gpt2/run-%j.out
  #SBATCH --time=0:60
  #SBATCH --signal=TERM@30

This sets the 60 seconds timeout for the sbatch job.

Time limit in the appliance#

This time limit defines the runtime limit for all appliance jobs in a run.py. It does not count the time when the appliance jobs are in the appliance queue. The limit can be specify through run.py command line argument --job-time-sec, an example as follows.

  source $HOME/venv_cerebras_pt/bin/activate
  cd <parent_dir_modelzoo>/modelzoo/modelzoo/models/nlp/gpt2
  python run.py \
      CSX \
      --params configs/params_gpt2_small.yaml \
      --num_csx 1 \
      --model_dir model_dir/ \
      --num_wgt_servers 2 \
      --mode train \
      --job_time_sec 60

This sets a 60 seconds timeout on the appliance jobs for this run.py invocation.

Conclusion#

The integration of Slurm with the Cerebras appliance workflow marks a significant advancement in resource management and job scheduling for deep learning tasks. By utilizing surrogate jobs, users can efficiently monitor and manage their workloads on the Cerebras Wafer-Scale Engine, ensuring that resources are optimally allocated and utilized. Whether through salloc or sbatch, the workflow allows for precise tracking and control over job execution, enhancing the overall efficiency and predictability of the training process. This integration not only streamlines the job submission process but also provides a robust framework for scaling and managing complex machine learning workloads in a high-performance computing environment.

Cluster monitoring with Grafana

Resource requirements for parallel training and compilation