This section describes how to use the
csrun_wse script for training, eval and prediction.
This applies only to pipelinedmodels for both the Slurm/Singularity workflow and the Kubernetes workflow.
Slurm wrapper scripts (
csrun_cpu) may be customized for your particular environment by your sysadmins and may look different than what is shown below. Check whether your Sysadmin’s local documentation is available and whether there are any special instructions for your CS-2.
> csrun_wse --help Usage: csrun_wse [--help] [--mount-dirs] [--total-nodes] [--tasks-per-node] [--cpus-per-task] [--single-task-nodes] [--use-sbatch] command_for_cs_execution ... ... ...
Runs the given
<command_for_cs_execution> command on the CS system.
The following applies:
The specific type of the execution task, i.e., training or prediction or evaluation is specified in the
The input pipelinedis run on a Cerebras server cluster with multiple workers, co-ordinated by Slurm.
Unless the optional arguments for Slurm configuration are specified, this script uses the following default values:
total-nodes: $DEF_NODES tasks-per-node: $DEF_TASKS cpus-per-task: $DEF_CPUS
We recommend that you first configure the
csrun_cpu script by setting the Slurm variables before running this
csrun_wse script. See The csrun_cpu Script.
command_for_cs_execution: A Python command to initiate a task (train, eval, predict) that will execute on the CS system.
--mount-dirs: (Optional) String of comma-seperated paths to mount, in addition to the standard paths listed in csrun_cpu. Default is an empty string, i.e., only paths listed in csrun_cpu are mounted.
--total-nodes: (Optional) Number of nodes to execute with. Default is as listed above.
--tasks-per-node: (Optional) Number of tasks per node to execute with. Default is as listed above.
--cpus-per-task: (Optional) Number of CPUs per task to execute with. Default is as listed above. Applies only to the Slurm workflow.
--single-task-nodes: (Optional) Number of nodes, among the total nodes, that will only run a single task. Default is 0 indicating that all nodes will have multiple tasks running on them.
--use-sbatch: (Optional) Adding this flag will submit a batch script to slurm to execute <command_to_execute>. sbatch will immediately exit after submitting the script. The script will stay on the slurm queue of pending jobs until resources are allocated. Applies only to the Slurm workflow.
csrun_wse --total-nodes=3 \ --tasks-per-node=5 \ --cpus-per-task=16 \ python run.py --mode=train \ --cs_ip=0.0.0.0
csrun_wse command executes the Python command:
python run.py --mode=train --cs_ip=0.0.0.0
which initiates model training on the CS system at the given
cs_ip address. As specified in the command line options for Slurm, 3 nodes with 5 workers each, and 16 CPUs assigned per worker, are used for this training task.
csrun_wse --mount-dirs="/data/ml,/lab/ml" \ --use-sbatch \ python run.py --mode=eval \ --eval_steps=1000 \ --cs_ip=0.0.0.0
csrun_wse command mounts “/data/ml/” and “/lab/ml” in addition to the default mount directories, and launches a batch script with the Python command:
python run.py --mode=eval --eval_steps=1000 --cs_ip=0.0.0.0"
which initiates model evaluation on the CS system at the given
cs_ip address. The default Slurm settings are used.
Checkpoints and logs#
The checkpoints, logs and event files will be stored in the
model_dir. If the directory exists, then the weights are loaded from the checkpoint file. Same as the
model_dir passed to the
When training with the
CerebrasEstimator, by default a checkpoint is always taken at the beginning and end of training.
If you wish to take checkpoints more frequently, use
save_checkpoints_steps in the
CSRunConfig. Refer to Setting the runtime configuration section.
Loss: The loss is stored for TensorBoard based on
save_summary_steps. You can set the default value in the
Similarly, TensorFlow logging is output based on
log_step_count_steps. You can set the default value in the
Debug in single-task mode#
When you run a command such as:
> csrun_wse --nodes=2 --tasks-per-node=6 --cpus-per-task=12 \ python run.py --mode train --params configs/your-params-file.yaml \ --model_dir your-model-dir --cs_ip 10.255.253.0
it will run a total of 12 tasks on 2 nodes, with 6 tasks per node. However, the chief tasks and the worker tasks are not synchronized, hence debugging becomes difficult.
For example, a worker task might be streaming the data into the CS system, and the chief task might be receiving the data from the CS system. To debug in this scenario requires stopping and starting both the worker task and chief tasks at the same time to examine the CS system output with its corresponding CS system input. Because these tasks run asynchronously, such starting and stopping them synchronously becomes difficult.
The single-task mode helps debugging by performing both the chief and workers on a single node as a single task. This means that you only need to start and stop this single task to examine both the data into and out of the CS system. The single-task mode can be used with the training, evaluation or the prediction jobs.
The single-task mode is intended for debugging purposes only.
Using single-task mode#
To set the single-task mode, set the following two command line arguments to 1, as in:
--tasks-per-node=1. See below.
> csrun_wse --nodes=1 --tasks-per-node=1 --cpus-per-task=12 \ python run.py --mode train --params configs/your-params-file.yaml \ --model_dir your-model-dir --cs_ip 10.255.253.0
With the above settings, the training job is run on the CS system in a single-task debug mode.
The following is an example command showing how to execute the
run.py to submit an evaluation job to a CS system cluster. This example uses the Slurm variables that are passed as command line argument values.
> csrun_wse --nodes=2 --tasks-per-node=6 --cpus-per-task=12 \ python run.py --mode eval --params configs/your-params-file.yaml \ --model_dir your-model-dir --cs_ip 10.255.253.0
Two evaluation modes are supported:
The following is an example command showing how to execute the
run.py to perform prediction with your neural network, using a CS system cluster. This example uses the Slurm variables from the
csrun_cpu script, provided by system administrator. See The csrun_cpu Script.
csrun_wse python run.py --mode predict --params configs/your-params-file.yaml --cs_ip 10.255.253.0
The results of the inference run are saved as follows:
If CS system is used, then in a file named,
If CS system is not used and instead CPU or GPU is used, then the inference is run using the TensorFlow Estimator and the prediction results are stored in a file named,
The default behavior of
srun, slurm will allocate resources and
csrun_cpu will exit once the slurm job is finished. By using the flag
csrun_cpu submits to slurm a batch script to execute the command
sbatch will immediately exit after submitting the script. The script will stay on the slurm queue of pending jobs until resources are allocated.
The command use will be stored as the file
CS_<date>.log and the standard output and standard error will be stored as
To properly schedule training jobs in the CS system using
crun_wse, one should define the enviromnet variables
GRES_RESOURCE corresponds to the generic resource identifying the CS system in the slurm configuration.
GRES_NODE corresponds to the dedicaded CPU nodeto manage the CS system.