.. _cs-pytorch-pl-k8s:

Running Small to Medium Models (Pipelined Execution) 
====================================================
 
This section describes how to run small- to medium-sized models on the Cerebras Wafer-Scale cluster with pipelined execution using PyTorch. For first-time user setup for PyTorch jobs, see :ref:`cs-pytorch-qs`. 

.. Note::

  We are transitioning from the Slurm-based workflow to run small to medium models in Pipelined execution on the Wafer-Scale Cluster. During this transition, we use scripts that are similar to the Slurm-based workflow, but run Kubernetes under the hood. The eventual goal is to use the same workflow in Weight Streaming execution to launch jobs in Pipelinedexecution, too. 
  

Activate your PyTorch environment
---------------------------------

To run PyTorch jobs on Wafer-Scale Cluster, you first must activate your PyTorch environment on the user node.

Enter the Python environment using the following command:

        .. code-block:: bash

                       source venv_cerebras_pt/bin/activate


Run pipelined models on the Cerebras Wafer-Scale Cluster
--------------------------------------------------------

Get a local copy of Cerebras Model Zoo 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We use code available in `Cerebras Model Zoo git repository <https://github.com/Cerebras/modelzoo/tree/main/modelzoo>`_ for this example.
Check with your sysadmin whether your setup has a local copy of the Model Zoo repository available with pre-installed datasets. Otherwise, you can clone this git repository on your user node yourself and follow the instructions in the ReadMe files in the respository on how to set up training datasets.

1. Copy or clone the Model Zoo  repository to your preferred location in your home directory.

    If you don't have a copy of the Model Zoo pre-configured for your setup, clone the repo from GitHub by running the following command:   

    .. code-block:: bash

        git clone https://github.com/Cerebras/modelzoo/tree/master/modelzoo 

    Make sure to configure paths to the datasets in your local copy.

2. Navigate to the model directory. 

    .. code-block:: bash

        cd modelzoo/fc_mnist/pytorch/ 

Compile on CPU 
~~~~~~~~~~~~~~

Cerebras recommends that you first compile your model successfully on a CPU node from the cluster before running it on the CS system. 

 - You can run in ``validate_only`` mode that runs a fast, lightweight verification. In this mode, the compilation only runs through the first few stages, up until kernel library matching. 

 - After a successful ``validate_only`` run, you can run full compilation with ``compile_only`` mode. 

This section of the Getting Started Guide shows how to execute these steps on a CPU node. 

.. Tip::

  The ``validate_only`` step is very fast, enabling you to rapidly iterate on your model code. Without needing access to the CS system wafer scale engine, you can determine in this ``validate_only`` step if you are using any PyTorch layer or functionality that is unsupported by either XLA or CGC. 
  
Follow these steps to compile on a CPU (uses FC-MNIST example from the Model Zoo git repository): 

1. Run the compilation in ``validate_only`` mode. This performs the initial stage of compilation to get feedback about whether your model can be lowered. This process should be considerably faster running than a full compile.
  
    .. code-block:: bash

        csrun_cpu --admin-defaults="/path/to/admin-defaults.yaml" --mount-dirs="/data/ml,/lab/ml" python-pt run.py --mode train –-validate_only 
        
2. Run the full compilation process in ``compile_only`` mode. 
  
   This step runs the full compilation through all stages of the Cerebras software stack to generate a CS system executable. 
  
    .. code-block:: bash

        csrun_cpu --admin-defaults="/path/to/admin-defaults.yaml" --mount-dirs="/data/ml,/lab/ml" python-pt run.py --mode train –-compile_only
        
    When the above compilation is successful, the model is guaranteed to run on the CS system. You can also use ``compile-only`` mode to run pre-compilations of many different model configurations offline so you can more fully use the allotted CS system cluster time. 
    
    
.. Tip::

  The compiler detects whether a binary already exists for a particular model config and skips compiling on the fly during training if it detects one. 
  
Run the model on the CS system
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To run a training job on the CS system, enter the following command:  

    .. code-block:: bash

        csrun_wse --admin-defaults="/path/to/admin-defaults.yaml" --mount-dirs="/data/ml,/lab/ml" python run.py --mode=train --params=params.yaml 

The command above mounts the directories ``/data/ml`` and ``/lab/ml`` to the container (in addition to the default mount directories) and then trains the FC-MNIST model on the CS System. 

To run an eval job on the CS system, enter the following command: 

    .. code-block:: bash

        csrun_wse --admin-defaults="/path/to/admin-defaults.yaml" --mount-dirs="/data/ml,/lab/ml" python-pt run.py  --mode=eval --eval_steps=1000   

You can view the exact options using ``csrun_wse –help``. 

Output files and artifacts 
~~~~~~~~~~~~~~~~~~~~~~~~~~
 
The output files and artifacts include a model directory (``model_dir``) that contains all the results and artifacts of the latest run, including: 

 - Compile directory (``cs_<checksum>``) 

 - ``performance.json`` file 

 - Checkpoints 

 - Tensorboard event files 

 - A copy of the ``yaml`` file for the run 
 
Compile dir – The directory containing the ``cs_<checksum>`` 
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The compilation artifacts during and after compilation are stored in the ``<model_dir>/cs_<checksum>`` directory. 

Compilation logs and intermediate outputs are helpful to debug compilations issues. 

The ``xla_service.log`` should contain information about the status of compilation, and whether it passed or failed. In case of failure, it should print an error message and stacktrace in ``xla_service.log``. 

``performance.json`` file and its parameters 
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
The performance directory should contain the ``performance.json <model_dir>/performance/performance.json``. This contains information as listed below: 

 - ``compile_time`` - The amount of time that it took to compile the model to generate the Cerebras executable. 

 - ``est_samples_per_sec`` - The estimated performance in terms of samples per second based on the Cerebras compile. Note that this number is theoretical and actual performance may vary. 

 - ``programming_time`` - This is the time taken to prepare the system and load with the model that is compiled. 

 - ``samples_per_sec`` - The actual performance of your run execution. 

 - ``suspected_input_bottleneck`` - This is a beta feature. It indicates whether you are input-starved and need more input workers to feed the Cerebras system. 

 - ``total_samples`` - The total gross samples that were iterated during the execution. 

 - ``total_time`` - The total time it took to complete the total samples. Note that this time does not include ``programming_time`` or ``compile_time``. It measure the time it takes from sending the first sample to CS-X to receiving the last output from CS-X. 
 
Checkpoints
^^^^^^^^^^^

Checkpoints are stored in ``<model_dir>/checkpoint_*.mdl``. They are saved with the frequency specified in the ``runconfig`` section of the params file. If checkpoint frequency is anything other than ``0``, we always save the last step's checkpoint, regardless of whether that step aligns with the checkpoint frequency or not.
 
Tensorboard event files
^^^^^^^^^^^^^^^^^^^^^^^
 
Tensorboard event files are stored in ``<model_dir>/train/`` directory. 
 
``yaml`` files content after the run 
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
The ``yaml`` file is stored in the train directory. This ``yaml`` file contains information about the specifics of the run, such as model specific configuration (eg., ``dropout``, ``activation_fn``), optimizer type, and optimizer parameters, input data configuration, such as ``batch_size``, and shuffle and run configuration, such as ``max_steps``, ``checkpoint_steps``, and ``num_epochs``.

Train and evaulate on CPU/ GPU
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To train on the CPU directly, since we are within a Python environment, you can directly call Python. Training on GPU may require generating a new virtual environment configured for your GPU hardware requirements. To set up the environment for GPU, refer to `these requirements <https://docs.cerebras.net/en/latest/getting-started/software-dependencies.html>`_.

    .. code-block:: bash

        python run.py --mode train --params=params.yaml
        python run.py --mode eval --params=params.yaml