The csrun_wse Script

This section describes how to use the csrun_wse script for training, eval and prediction.

Important

Follow the Cerebras Command Line Pattern to use the cs_input_analyzer Bash script correctly.

See also

Also see Train, Eval and Predict.

> csrun_wse --help
Usage: csrun_wse [--help] [--mount-dirs] [--total-nodes] [--tasks-per-node] [--cpus-per-task] [--single-task-nodes] [--use-sbatch] command_for_cs_execution
...
...
...

Description

Runs the given <command_for_cs_execution> command on the CS system.

The following applies:

  • The specific type of the execution task, i.e., training or prediction or evaluation is specified in the <command_for_cs_execution> command.

  • The input pipeline is run on a Cerebras server cluster with multiple workers, co-ordinated by Slurm.

  • Unless the optional arguments for Slurm configuration are specified, this script uses the following default values:

    Default values

    total-nodes: $DEF_NODES
    tasks-per-node: $DEF_TASKS
    cpus-per-task: $DEF_CPUS
    

Tip

We recommend that you first configure the csrun_cpu script by setting the Slurm variables before running this csrun_wse script. See Configuring csrun_cpu.

Arguments

  • command_for_cs_execution: A Python command to initiate a task (train, eval, predict) that will execute on the CS system.

  • --mount-dirs: (Optional) String of comma-seperated paths to mount, in addition to the standard paths listed in csrun_cpu. Default is an empty string, i.e., only paths listed in csrun_cpu are mounted.

  • --total-nodes: (Optional) Number of nodes to execute with. Passed to Slurm. Default is as listed above.

  • --tasks-per-node: (Optional) Number of tasks per node to execute with. Passed to Slurm. Default is as listed above.

  • --cpus-per-task: (Optional) Number of CPUs per task to execute with. Passed to Slurm. Default is as listed above.

  • --single-task-nodes: (Optional) Number of nodes, among the total nodes, that will only run a single task. Default is 0 indicating that all nodes will have multiple tasks running on them.

  • --use-sbatch: (Optional) Adding this flag will submit a batch script to slurm to execute <command_to_execute>. sbatch will immediately exit after submitting the script. The script will stay on the slurm queue of pending jobs until resources are allocated.

Examples

csrun_wse --total-nodes=3 \
        --tasks-per-node=5 \
        --cpus-per-task=16 \
        python run.py --mode=train \
        --cs_ip=0.0.0.0

The above csrun_wse command executes the Python command:

python run.py --mode=train --cs_ip=0.0.0.0

which initiates model training on the CS system at the given cs_ip address. As specified in the command line options for Slurm, 3 nodes with 5 workers each, and 16 CPUs assigned per worker, are used for this training task.

csrun_wse --mount-dirs="/data/ml,/lab/ml" \
        --use-sbatch \
        python run.py --mode=eval \
        --eval_steps=1000 \
        --cs_ip=0.0.0.0

The above csrun_wse command mounts “/data/ml/” and “/lab/ml” in addition to the default mount directories, and launches a batch script with the Python command:

python run.py --mode=eval --eval_steps=1000 --cs_ip=0.0.0.0"

which initiates model evaluation on the CS system at the given cs_ip address. The default Slurm settings are used.

csrun_wse --total_nodes=5 \
        --single-task-nodes=2 \
        --tasks-per-node=4 \
        python-ws run.py --mode=train \
        --cs_ip=0.0.0.0

The above csrun_wse command executes the Python command:

python-ws run.py --mode=train --cs_ip=0.0.0.0

which initiates the model training on the CS system in the weight streaming mode at the given cs_ip address. A total of 5 nodes are used, the first 2 of which run only a single task. The remaining 3 nodes run 4 tasks per node. Thus, a total of 14 tasks are associated with this job.

Checkpoints and logs

The checkpoints, logs and event files will be stored in the model_dir. If the directory exists, then the weights are loaded from the checkpoint file. Same as the model_dir passed to the tf.estimator.

When training with the CerebrasEstimator, by default a checkpoint is always taken at the beginning and end of training.

Tip

If you wish to take checkpoints more frequently, use save_checkpoints_steps in the CSRunConfig. Refer to Setting the runtime configuration section.

Loss: The loss is stored for TensorBoard based on save_summary_steps. You can set the default value in the CSRunConfig.

Similarly, TensorFlow logging is output based on log_step_count_steps. You can set the default value in the CSRunConfig.

Debug in single-task mode

When you run a command such as:

> csrun_wse --nodes=2 --tasks-per-node=6 --cpus-per-task=12 \
            python run.py --mode train --params configs/your-params-file.yaml \
            --model_dir your-model-dir  --cs_ip 10.255.253.0

it will run a total of 12 tasks on 2 nodes, with 6 tasks per node. However, the chief tasks and the worker tasks are not synchronized, hence debugging becomes difficult.

For example, a worker task might be streaming the data into the CS system, and the chief task might be receiving the data from the CS system. To debug in this scenario requires stopping and starting both the worker task and chief tasks at the same time to examine the CS system output with its corresponding CS system input. Because these tasks run asynchronously, such starting and stopping them synchronously becomes difficult.

The single-task mode helps debugging by performing both the chief and workers on a single node as a single task. This means that you only need to start and stop this single task to examine both the data into and out of the CS system. The single-task mode can be used with the training, evaluation or the prediction jobs.

Important

The single-task mode is intended for debugging purposes only.

Using single-task mode

To set the single-task mode, set the following two command line arguments to 1, as in: --nodes=1 and --tasks-per-node=1. See below.

> csrun_wse --nodes=1 --tasks-per-node=1 --cpus-per-task=12 \
            python run.py --mode train --params configs/your-params-file.yaml \
            --model_dir your-model-dir  --cs_ip 10.255.253.0

With the above settings, the training job is run on the CS system in a single-task debug mode.

Evaluate

The following is an example command showing how to execute the run.py to submit an evaluation job to a CS system cluster. This example uses the Slurm variables that are passed as command line argument values.

See the Arguments section of run.py.

> csrun_wse --nodes=2 --tasks-per-node=6 --cpus-per-task=12 \
    python run.py --mode eval --params configs/your-params-file.yaml \
    --model_dir your-model-dir --cs_ip 10.255.253.0

Two evaluation modes are supported:

  • eval

  • eval_all

Predict

The following is an example command showing how to execute the run.py to perform prediction with your neural network, using a CS system cluster. This example uses the Slurm variables from the csrun_cpu script, provided by system administrator. See Configuring csrun_cpu.

csrun_wse python run.py --mode predict --params configs/your-params-file.yaml --cs_ip 10.255.253.0

Prediction results

The results of the inference run are saved as follows:

  • If CS system is used, then in a file named, predictions_cs_{est_config.task_id}.npz in your model_dir directory.

  • If CS system is not used and instead CPU or GPU is used, then the inference is run using the TensorFlow Estimator and the prediction results are stored in a file named, predictions_tf_{est_config.task_id}.npz in your model_dir directory.

Sbatch mode

The default behavior of csrun_cpu uses srun. With srun, slurm will allocate resources and csrun_cpu will exit once the slurm job is finished. By using the flag --use-sbatch, csrun_cpu submits to slurm a batch script to execute the command <command_to_execute> using sbatch. sbatch will immediately exit after submitting the script. The script will stay on the slurm queue of pending jobs until resources are allocated.

The command use will be stored as the file CS_<date>.log and the standard output and standard error will be stored as CS_<date>_<slurm_job_id>.out.

To properly schedule training jobs in the CS system using crun_wse, one should define the enviromnet variables GRES_NODE or GRES_RESOURCE inside csrun_cpu. GRES_RESOURCE corresponds to the generic resource identifying the CS system in the slurm configuration. GRES_NODE corresponds to the dedicaded CPU nodeto manage the CS system.