.. _csrun-wse:

The ``csrun_wse`` Script
========================

This section describes how to use the ``csrun_wse`` script for training, eval and prediction.

.. note::

	This applies only to pipeline models for both the Slurm/Singularity workflow and the Kubernetes workflow.

.. seealso::

    Also see :ref:`train-eval-predict`.

.. Note::

	Slurm wrapper scripts (``csrun_wse`` and ``csrun_cpu``) may be customized for your particular environment by your sysadmins and may look different than what is shown below. Check whether your Sysadmin's local documentation is available and whether there are any special instructions for your CS-2.


.. code-block:: bash

    > csrun_wse --help
    Usage: csrun_wse [--help] [--mount-dirs] [--total-nodes] [--tasks-per-node] [--cpus-per-task] [--single-task-nodes] [--use-sbatch] command_for_cs_execution
    ...
    ...
    ...

Description
-----------

Runs the given ``<command_for_cs_execution>`` command on the CS system.

The following applies:

- The specific type of the execution task, i.e., training or prediction or evaluation is specified in the ``<command_for_cs_execution>`` command.
- The input pipeline is run on a Cerebras server cluster with multiple workers, co-ordinated by Slurm.
- Unless the optional arguments for Slurm configuration are specified, this script uses the following default values:

    .. admonition:: Default values

        .. code-block:: bash

            total-nodes: $DEF_NODES
            tasks-per-node: $DEF_TASKS
            cpus-per-task: $DEF_CPUS


.. tip::

	We recommend that you first configure the ``csrun_cpu`` script by setting the Slurm variables before running this ``csrun_wse`` script. See :ref:`config-csrun-cpu`.


Arguments
~~~~~~~~~

``command_for_cs_execution``: A Python command to initiate a task (train, eval, predict) that will execute on the CS system.

``--mount-dirs``: (Optional) String of comma-seperated paths to mount, in addition to the standard paths listed in csrun_cpu. Default is an empty string, i.e., only paths listed in csrun_cpu are mounted.

``--total-nodes``: (Optional) Number of nodes to execute with. Default is as listed above.

``--tasks-per-node``: (Optional) Number of tasks per node to execute with. Default is as listed above. 

``--cpus-per-task``: (Optional) Number of CPUs per task to execute with. Default is as listed above. Applies only to the Slurm workflow.

``--single-task-nodes``: (Optional) Number of nodes, among the total nodes, that will only run a single task. Default is 0 indicating that all nodes will have multiple tasks running on them.

``--use-sbatch``: (Optional) Adding this flag will submit a batch script to slurm to execute <command_to_execute>. sbatch will immediately exit after submitting the script. The script will stay on the slurm queue of pending jobs until resources are allocated. Applies only to the Slurm workflow.

Examples
--------

.. code-block:: bash

    csrun_wse --total-nodes=3 \
            --tasks-per-node=5 \
            --cpus-per-task=16 \
            python run.py --mode=train \
            --cs_ip=0.0.0.0

The above ``csrun_wse`` command executes the Python command:

    ``python run.py --mode=train --cs_ip=0.0.0.0``

which initiates model training on the CS system at the given ``cs_ip`` address. As specified in the command line options for Slurm, 3 nodes with 5 workers each, and 16 CPUs assigned per worker, are used for this training task.


.. code-block:: bash

    csrun_wse --mount-dirs="/data/ml,/lab/ml" \
            --use-sbatch \
            python run.py --mode=eval \
            --eval_steps=1000 \
            --cs_ip=0.0.0.0

The above ``csrun_wse`` command mounts "/data/ml/" and "/lab/ml" in addition to the default mount directories, and launches a batch script with the Python command:

    ``python run.py --mode=eval --eval_steps=1000 --cs_ip=0.0.0.0"``

which initiates model evaluation on the CS system at the given ``cs_ip`` address. The default Slurm settings are used.


.. code-block:: bash

    csrun_wse --total_nodes=5 \
            --single-task-nodes=2 \
            --tasks-per-node=4 \
            python-ws run.py --mode=train \
            --cs_ip=0.0.0.0

The above ``csrun_wse`` command executes the Python command:

    ``python-ws run.py --mode=train --cs_ip=0.0.0.0``

which initiates the model training on the CS system in the weight streaming mode at the given ``cs_ip`` address. A total of 5 nodes are used, the first 2 of which run only a single task. The remaining 3 nodes run 4 tasks per node. Thus, a total of 14 tasks are associated with this job.


Checkpoints and logs
--------------------

The checkpoints, logs and event files will be stored in the ``model_dir``. If the directory exists, then the weights are loaded from the checkpoint file. Same as the ``model_dir`` passed to the ``tf.estimator``.

When training with the ``CerebrasEstimator``, by default a checkpoint is always taken at the beginning and end of training.

.. tip::

 If you wish to take checkpoints more frequently, use ``save_checkpoints_steps`` in the ``CSRunConfig``. Refer to :ref:`cs-est-config` section.

**Loss**: The loss is stored for TensorBoard based on ``save_summary_steps``. You can set the default value in the ``CSRunConfig``.

Similarly, TensorFlow logging is output based on ``log_step_count_steps``. You can set the default value in the ``CSRunConfig``.

.. _debug-single-task:

Debug in single-task mode
~~~~~~~~~~~~~~~~~~~~~~~~~

When you run a command such as:

.. code-block:: bash

    > csrun_wse --nodes=2 --tasks-per-node=6 --cpus-per-task=12 \
		python run.py --mode train --params configs/your-params-file.yaml \
		--model_dir your-model-dir  --cs_ip 10.255.253.0

it will run a total of 12 tasks on 2 nodes, with 6 tasks per node. However, the chief tasks and the worker tasks are not synchronized, hence debugging becomes difficult.

For example, a worker task might be streaming the data into the CS system, and the chief task might be receiving the data from the CS system. To debug in this scenario requires stopping and starting both the worker task and chief tasks at the same time to examine the CS system output with its corresponding CS system input. Because these tasks run asynchronously, such starting and stopping them synchronously becomes difficult.

The single-task mode helps debugging by performing both the chief and workers on a single node as a single task. This means that you only need to start and stop this single task to examine both the data into and out of the CS system. The single-task mode can be used with the training, evaluation or the prediction jobs.

.. important::

	The single-task mode is intended for debugging purposes only.

Using single-task mode
^^^^^^^^^^^^^^^^^^^^^^

To set the single-task mode, set the following two command line arguments to 1, as in: ``--nodes=1`` and ``--tasks-per-node=1``. See below.

.. code-block:: bash

    > csrun_wse --nodes=1 --tasks-per-node=1 --cpus-per-task=12 \
		python run.py --mode train --params configs/your-params-file.yaml \
		--model_dir your-model-dir  --cs_ip 10.255.253.0

With the above settings, the training job is run on the CS system in a single-task debug mode.


Evaluate
~~~~~~~~

The following is an example command showing how to execute the ``run.py`` to submit an evaluation job to a CS system cluster. This example uses the Slurm variables that are passed as command line argument values.

See the :ref:`args-run-py` section of ``run.py``.

.. code-block:: bash

    > csrun_wse --nodes=2 --tasks-per-node=6 --cpus-per-task=12 \
        python run.py --mode eval --params configs/your-params-file.yaml \
        --model_dir your-model-dir --cs_ip 10.255.253.0

Two evaluation modes are supported:

- ``eval``
- ``eval_all``

.. seealso::
	 :ref:`cerebras-ml-workflow`.


Predict
~~~~~~~

The following is an example command showing how to execute the ``run.py`` to perform prediction with your neural network, using a CS system cluster. This example uses the Slurm variables from the ``csrun_cpu`` script, provided by system administrator. See :ref:`config-csrun-cpu`.

.. code-block:: bash

   csrun_wse python run.py --mode predict --params configs/your-params-file.yaml --cs_ip 10.255.253.0

Prediction results
^^^^^^^^^^^^^^^^^^

The results of the inference run are saved as follows:

- If CS system is used, then in a file named, ``predictions_cs_{est_config.task_id}.npz`` in your ``model_dir`` directory.
- If CS system is not used and instead CPU or GPU is used, then the inference is run using the TensorFlow Estimator and the prediction results are stored in a file named, ``predictions_tf_{est_config.task_id}.npz`` in your ``model_dir`` directory.


Sbatch mode
------------

The default behavior of ``csrun_cpu`` uses ``srun``. With ``srun``, slurm will allocate resources and ``csrun_cpu`` will exit once the slurm job is finished. By using the flag ``--use-sbatch``, ``csrun_cpu`` submits to slurm a batch script to execute the command ``<command_to_execute>`` using ``sbatch``. ``sbatch`` will immediately exit after submitting the script. The script will stay on the slurm queue of pending jobs until resources are allocated. 

The command use will be stored as the file ``CS_<date>.log`` and the standard output and standard error will be stored as ``CS_<date>_<slurm_job_id>.out``.

To properly schedule training jobs in the CS system using ``crun_wse``, one should define the enviromnet variables ``GRES_NODE`` or ``GRES_RESOURCE`` inside ``csrun_cpu``.  ``GRES_RESOURCE`` corresponds to the generic resource identifying the CS system in the slurm configuration. ``GRES_NODE`` corresponds to the dedicaded CPU nodeto manage the CS system.