Integration with Slurm#

In release 1.8.0, Cerebras introduced a streamlined integration with Slurm in their appliance workflow. This integration involves a division of responsibilities between Slurm and Kubernetes (k8s) to efficiently manage resources. We called these job steps surrogate jobs.

Here’s an overview of this integration:

In the Cerebras appliance workflow, users have the flexibility to submit their run.py script through two different Slurm commands: salloc and sbatch. Each of these commands serves a distinct purpose:

  • Using salloc

    When the user submits their job using salloc, it requests and obtains the necessary resource allocation from Slurm. After acquiring the allocation, the run.py script is executed. Importantly, relevant surrogate jobs are automatically created as part of the workflow when run.py is invoked through salloc. These surrogate jobs help with resource tracking and accounting.

  • Using sbatch

    Alternatively, the user can submit their job through sbatch, which invokes a Bash script. This Bash script, in turn, calls the run.py script. Similar to the salloc method, the relevant surrogate jobs are also automatically generated when run.py is invoked through sbatch. These surrogate jobs serve the same purpose of aiding in resource tracking and accounting.

It’s important to note that the surrogate jobs are a crucial part of the Cerebras appliance workflow and are automatically created when run.py is submitted via either salloc or sbatch. However, these surrogate jobs will not be generated if run.py is submitted using the srun command.

Surrogate jobs#

A surrogate job, in the context of the Cerebras Wafer-Scale cluster and Slurm integration, is essentially a job step within the Slurm cluster that represents the execution of an appliance job on the Cerebras Wafer-Scale cluster. Here are key characteristics and behaviors of surrogate jobs:

  1. Representation of Appliance Jobs:

    A surrogate job serves as a representation or surrogate for an appliance job that is running on the Cerebras cluster.

  2. Creation and Termination:

    Surrogate jobs are automatically created in the Slurm cluster when the corresponding appliance job is in the “running” state on the Cerebras cluster.

  3. Completion States:

    A surrogate job can conclude with one of two states:

    • COMPLETED: This state indicates that the associated appliance job completed successfully without errors.

    • FAILED: This state signifies that the associated appliance job encountered issues or errors during execution and did not complete successfully.

  4. Naming Convention:

    Jobs that require the allocation of CS-2s, which are Cerebras-specific hardware resources, may have a naming suffix of “-csxN,” where “N” represents the number of allocated CS-2s. This naming convention helps identify the resource allocation associated with the appliance job.

The runtime of a surrogate job should match with the runtime of its appliance job.

Submit jobs with sbatch#

As an example, we will train GPT2 small with Weight Streaming using the PyTorch implementation in Cerebras Model Zoo.

Once you modelzoo, you will the following contents in the folder <parent_dir_modelzoo>/modelzoo/modelzoo/transformers/pytorch/gpt2/:

configs/
  params_gpt2_small.yaml
input/
data.py
gpt2_model.py
model.py
run.py
...

Here is an example of sbatch.sh script which invokes the run.py located at <parent_dir_modelzoo>/modelzoo/modelzoo/transformers/pytorch/gpt2/. Assume that you have Set up a Cerebras virtual environment at $HOME/venv_cerebras_pt. The run.py will start a compile and a train appliance job for a pytorch GPT2 small test:

#!/bin/bash

#SBATCH --job-name=gpt2-small-test
#SBATCH --nodes=1
#SBATCH --tasks=1
#SBATCH --cpus-per-task 40
#SBATCH --output <parent_dir_modelzoo>/modelzoo/modelzoo/transformers/pytorch/gpt2/run-%j.out

source $HOME/venv_cerebras_pt/bin/activate
cd <parent_dir_modelzoo>/modelzoo/modelzoo/transformers/pytorch/gpt2
python \
  run.py CSX \
    --params configs/params_gpt2_small.yaml \
    --num_csx 1 \
    --model_dir model_dir/ \
    --num_wgt_servers 2 \
    --mode train

Invoke this script from the slurm with the command - sbatch sbatch.sh. When the script is run from the sacct command, you will see the surrogate jobs starting up:

sacct --format="JobID,JobName%34,Account,State,CPUTime,AllocCPUS,AllocNodes,ExitCode"
       JobID                        JobName    Account      State    CPUTime  AllocCPUS AllocNodes ExitCode
------------ ---------------------------------- ---------- ---------- ---------- ---------- ---------- --------
65                              gpt2-small-test               RUNNING   01:09:20         40          1      0:0
65.batch                                  batch               RUNNING   01:09:20         40          1      0:0
65.0               wsjob-vpd7iavtknrtvgonhe7mog             COMPLETED   00:02:40         40          1      0:0
65.1          wsjob-z3pbtbl8sgmfqtivpazrtn-csx1               RUNNING   00:02:40         40          1      0:0

In this example,

  • The gpt2-small-test is the Slurm specific top-level job.

  • The batch job step is a slurm specific job step.

  • The surrogate job step wsjob-vpd7iavtknrtvgonhe7mog is created for the compile appliance job, and it was completed. Since it is a compile job, no CS-2s are required, so there is no -csxN suffix.

  • The surrogate job step wsjob-z3pbtbl8sgmfqtivpazrtn-csx1 is created for the train appliance job, and it was in running state. The suffix -csx1 indicates that 1 CS-2 is allocated to the job.

Submit jobs with salloc#

The above run.py can also be invoked using salloc:

salloc --cpus-per-task 40 --tasks 1

source $HOME/venv_cerebras_pt/bin/activate
cd <parent_dir_modelzoo>/modelzoo/modelzoo/transformers/pytorch/gpt2
python \
  run.py CSX \
    --params configs/params_gpt2_small.yaml \
    --num_csx 1 \
    --model_dir model_dir/ \
    --num_wgt_servers 2 \
    --mode train

When this is running, with sacct command, you should see the corresponding surrogate jobs:

sacct --format="JobID,JobName%34,Account,State,CPUTime,AllocCPUS,AllocNodes,ExitCode"
        JobID                        JobName    Account      State    CPUTime  AllocCPUS AllocNodes ExitCode
 ------------ ---------------------------------- ---------- ---------- ---------- ---------- ---------- --------
 68                                  interactive               RUNNING   00:03:30          2          1      0:0
 68.0               wsjob-lbrjprpjuj2dfsbfsebdq8             COMPLETED   00:00:08          2          1      0:0
 68.1          wsjob-dazjdtytvfn4njtcbchsik-csx1               RUNNING   00:00:06          2          1      0:0

In this example,

  • The interactive job is slurm specific top-level job.

  • The surrogate job step wsjob-lbrjprpjuj2dfsbfsebdq8 is created for the compile appliance job, and it was completed.

  • The surrogate job step wsjob-dazjdtytvfn4njtcbchsik-csx1 is created for the train appliance job, and it was in running state.

Time limits#

We support two types of time limits:

Time limit in Slurm#

This time limit defines the runtime limit for the sbatch or salloc job. It includes the time when the underline appliance jobs are in the appliance queue. An example of enabling this time limit is as follows.

  #SBATCH --job-name=gpt2-small-test
  #SBATCH --nodes=1
  #SBATCH --tasks=1
  #SBATCH --cpus-per-task 40
  #SBATCH --output <parent_dir_modelzoo>/modelzoo/modelzoo/transformers/pytorch/gpt2/run-%j.out
  #SBATCH --time=0:60
  #SBATCH --signal=TERM@30

This sets the 60 seconds timeout for the sbatch job.

Time limit in the appliance#

This time limit defines the runtime limit for all appliance jobs in a run.py. It does not count the time when the appliance jobs are in the appliance queue. The limit can be specify through run.py command line argument --job-time-sec, an example as follows.

  source $HOME/venv_cerebras_pt/bin/activate
  cd <parent_dir_modelzoo>/modelzoo/modelzoo/transformers/pytorch/gpt2
  python run.py \
      CSX \
      --params configs/params_gpt2_small.yaml \
      --num_csx 1 \
      --model_dir model_dir/ \
      --num_wgt_servers 2 \
      --mode train \
      --job_time_sec 60

This sets a 60 seconds timeout on the appliance jobs for this run.py invocation.