.. _cs-pytorch-qs:

PyTorch Quickstart
==================

This quickstart is a step-by-step guide to compile a PyTorch FC-MNIST model (already ported to Cerebras) targeting your CS system.

Prerequisites
-------------

.. attention::

	Go over this :ref:`checklist-before-you-start` before you proceed.

.. _pt-qs-compile-model:

Compile the model
-----------------

1. Log in to your CS system cluster.

2. Clone the reference samples repository to your preferred location in your home directory.

    .. code-block:: bash

        git clone https://github.com/Cerebras/cerebras_reference_implementations.git

    In the reference samples directory you will see the following PyTorch model examples:

	- A `PyTorch version of FC-MNIST <https://github.com/Cerebras/cerebras_reference_implementations/tree/main/fc_mnist/pytorch>`_.

	- The `PyTorch versions of BERT Base and BERT Large <https://github.com/Cerebras/cerebras_reference_implementations/tree/main/bert/pytorch>`_.


	In this quickstart we will use the FC MNIST model. Navigate to the ``fc_mnist`` model directory.

            .. code-block:: bash

				cd cerebras_reference_implementations/fc_mnist/pytorch/


3. Compile the model targeting the CS system.

	The below ``csrun_cpu`` command will compile the code in the ``train`` mode for the CS system. Note that this step will only compile the code and will not run training on the CS system.

        .. code-block:: bash

            csrun_cpu python-pt run.py --mode train \
                --compile_only \
                --params configs/<name-of-the-params-file.yaml> \
                --cs_ip <specify your CS_IP>:<port>

	.. note::

		The parameters can also be set in the ``params.yaml`` file.


Train on GPU
------------


To train on a GPU, run:

    .. code-block:: bash

        python run.py --mode train --params configs/<name-of-the-params-file.yaml>

.. _pt-qs-train-on-cs:

Train on CS system
------------------

1. Execute the ``csrun_wse`` command to run the training on the CS system. See the command format below:

    .. attention::

        For PyTorch models only, the ``cs_ip`` flag must include both the IP address and the port number of the CS system. Only the IP address, for example: ``--cs_ip 192.168.1.1``, will not be sufficient. You must also include the port number, for example: ``--cs_ip 192.168.1.1:9000``.

    .. code-block:: bash

        csrun_wse python-pt run.py --mode train \
            --cs_ip <IP:port-number> \
            --params configs/<name-of-the-params-file.yaml> \

Output files and artifacts
--------------------------

The output files and artifacts include:

	* Model directory (``model_dir``) - Contains all of the results and artifacts of the latest run, including:
		+ Compile directory (``cs_<checksum>``)
		+ ``performance.json`` file
		+ Checkpoints
		+ Tensorboard event files
		+ ``yaml`` files


Model directory and its structure
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The Model directory (``model_dir``) contains all of the results and artifacts of the latest run. If you go into the ``model_dir`` directory, the following subdirectories are present.

Compile dir - The directory containing the  ``cs_<checksum>``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The compilation artifacts during and after compilation are stored in  ``<model_dir>/cs_<checksum>`` directory. 

Compilation logs and intermediate outputs are helpful to debug compilations issues. 

The ``xla_service.log`` should contain information about the status of compilation, and whether it passed or failed. In case of failure, it should print an error message and stacktrace in ``xla_service.log``.

``Performance.json`` file and its parameters
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

There is a performance directory that should contain the ``performance.json <model_dir>/performance/performance.json``. This contains information as listed below:

	* ``compile_time`` - The amount of time that it took to compile the model to generate the Cerebras executable.
	* ``est_samples_per_sec`` - The estimated performance in terms of samples per second based on the Cerebras compile. Note that this number is theoretical and actual performance may vary.
	* ``programming_time`` - This is the time taken to prepare the system and load with the model that is compiled.
	* ``samples_per_sec`` - The actual performance of your run execution.
	* ``suspected_input_bottleneck`` - This is a beta feature. It indicates whether you are input-starved and need more input workers to feed the Cerebras system.
	* ``total_samples`` - The total gross samples that were iterated during the execution.
	* ``total_time`` - The total time it took to complete the total samples.
	

Checkpoints
~~~~~~~~~~~

Checkpoints are stored in ``<model_dir>/checkpoint_*.mdl``. They are saved with the frequency specified in the ``runconfig`` file.

Tensorboard event files
~~~~~~~~~~~~~~~~~~~~~~~

Tensorboard event files are stored in ``<model_dir>/train/`` directory.

``yaml`` files content after the run
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``yaml`` file is stored in train directory. This yaml file contains information about the specifics of the run, such as model specific configuration (eg. ``dropout``,  ``activation_fn``), optimizer type and optimizer parameters, input data configuration, such as ``batch_size``, and shuffle and run configuration, such as ``max_steps``, ``checkpoint_steps``, and ``num_epochs``.