.. _weight-streaming-quickstart:

Weight Streaming Quickstart
===========================

Weight streaming (WS) is one of the Cerebras' execution modes ideal to train extreme scale models. To learn more about the differences between pipelined mode and WS mode, visit :ref:`cerebras-execution-modes`.

This page provides details on below two aspects:

1.	Changing execution mode (only system admins)

2.	Training a model in WS mode (all users)

Section 1: Changing execution mode (only system admins)
-------------------------------------------------------
This step can be performed by System administrators only. If you want the execution mode to be changed to Weight Streaming, contact your system admin. 

A system administrator who has access can log in the CS-2 system and change the execution mode. To change the execution mode, follow these steps:

**Step 1**: Log in into the CS system.

**Step 2**: Check current ``execmode`` using the command below:

    .. code-block:: bash

    	cs> config execmode show

The following message appears:

    .. code-block:: bash	

        Configured Execution Mode : PIPELINED

**Step 3**: To change ``execmode``, transition the system to ``STANDBY`` state using the command below.

    .. code-block:: bash

    	cs> system standby

The following message appears:

    .. code-block:: 

    	This puts the system in standby. Do you want to proceed?

Select ``yes``.

**Step 4**: Change ``execmode`` to weight streaming using the command below.

    .. code-block:: bash

    	cs> config execmode setup

Select ``Weight Streaming``. The following message appears:

    .. code-block:: bash

    	Selected execution mode configuration: ✔ Weight Streaming


**Step 5**: Activate the system.

    .. code-block:: bash

    	cs> system activate

The system reboots and is now activated in Weight Streaming mode.

Section 2: Training a model in WS mode (all users)
--------------------------------------------------

After the system admin updates the system to WS mode, you can now follow the below steps to run a training job on single CS-2.

In CSoft R1.4.0, TF implementations of GPT-2, GPT-3XL (1.3B params) and GPT-J (6B params) are supported in WS mode on a single CS-2 with existing support cluster via compatibility mode. You can access a reference implementation of GPT-J in TF as an example in the `Cerebras Reference Implementations <https://github.com/Cerebras/cerebras_reference_implementations/tree/main/gptj>`_ repo. 

**Step 1**: Clone Reference Implementations Repository

To clone the Cerebras Reference Implementations repository, use the following commands:

    .. code-block:: bash

    	git clone https://github.com/Cerebras/cerebras_reference_implementations.git

    .. code-block:: bash

    	cd cerebras_reference_implementations/gptj

**Step 2**: Run the model in a CS system 

Here, we use the wrapper script csrun_wse command to compile and execute the code on CS-2 system. See :ref:`csrun-wse` for more information.

    .. code-block:: 

    	csrun_wse --total-nodes 14 --tasks-per-node 8 --cpus-per-task 16 --single-task-nodes 2 --cyclic python-ws run.py --model_dir model_dir --cs_ip <CS IP> --params configs/params_continuous_pretraining.yaml --mode train --max_steps <num_train_steps>

The above command trains the GPT-J model for ``--max_steps`` by executing on the CS system at the IP address specified in the ``--cs_ip`` flag. 
Note that for Weight Streaming, you will use ``python-ws``.
On Weight streaming execution at least two single task cpu nodes are required. This is specified using ``--single-task-nodes 2``.

When the command executes, you will see an output similar to shown below:


    .. code-block:: 

    	srun: job ... queued and waiting for resources
        srun: job ... has been allocated resources
        INFO:tensorflow:Checkpoints and summaries will be saved in: model_dir
        ...
        INFO:tensorflow:Running the TF Client
        INFO:tensorflow:Calling model_fn.
        ...
        INFO:tensorflow:Done calling model_fn.
        ...
        INFO:tensorflow:Completed weight initialization on CPU in: ... seconds
        ...
        INFO:tensorflow:Calling model_fn.
        ...
        INFO:tensorflow:Loading CPU pre-initialized weights took ... seconds
        INFO:tensorflow:Saving checkpoint at global step 0
        ...
        ...: I tensorflow/compiler/jit/xla_compilation_cache.cc:241] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
        INFO:tensorflow:global_step = 1, loss = ...
        INFO:tensorflow:global_step = 2, loss = ...
        ..
        INFO:tensorflow:global_step = 10, loss = ...
        ...
        ...
        INFO:tensorflow:Saving checkpoint at global step ..
        INFO:tensorflow:Training finished with ... samples in ... seconds, ... samples/second.
        INFO:tensorflow:Loss for final step: ...

.. note:: 
    The compilation time for large-scale models in WS mode typically takes a very long time and can run over an hour. Reducing compile time is an active effort at Cerebras. 

.. note:: 
    Offline compilation is not supported in WS mode in CSoft R1.4. This includes no support for ``--validation-only``, ``--compile-only`` flags, and with every execution, the model is recompiled.

Output files and artifacts
--------------------------

The output files and artifacts include:

	* Model directory (``model_dir``) - Contains all of the results and artifacts of the latest run, including:
		+ Compile directory (``tfcs_<checksum>``)
		+ ``performance.json`` file
		+ Checkpoints
		+ Tensorboard event files
		+ ``yaml`` files


Model directory and its structure
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The model directory (``model_dir``) contains all of the results and artifacts of the latest run. If you go into the ``model_dir`` directory, the following subdirectories are present.

Compile dir - The directory containing the  ``tfcs_<checksum>``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The compilation artifacts during and after compilation are stored in  ``<model_dir>/tfcs_<checksum>`` directory. Compilation logs and intermediate outputs are helpful to debug compilations issues. 

``Performance.json`` file and its parameters
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

There is a performance directory that should contain the ``performance.json <model_dir>/performance/performance.json``. This contains information as listed below:

	* ``compile_time`` - The amount of time that it took to compile the model to generate the Cerebras executable.
	* ``est_samples_per_sec`` - The estimated performance in terms of samples per second based on the Cerebras compile. Note that this number is theoretical and actual performance may vary.
	* ``programming_time`` - This is the time taken to prepare the system and load with the model that is compiled.
	* ``samples_per_sec`` - The actual performance of your run execution.
	* ``suspected_input_bottleneck`` - This is a beta feature. It indicates whether you are input-starved and need more input workers to feed the Cerebras system.
	* ``total_samples`` - The total gross samples that were iterated during the execution.
	* ``total_time`` - The total time it took to complete the total samples.
	

Checkpoints
~~~~~~~~~~~

Checkpoints are stored in ``<model_dir>/model-ckpt*``.

Tensorboard event files
~~~~~~~~~~~~~~~~~~~~~~~

Tensorboard event files are stored in ``<model_dir>`` directory.

``yaml`` files content after the run
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``yaml`` file is stored in train directory. This yaml file contains information about the specifics of the run, such as model specific configuration (eg. ``dropout``,  ``activation_fn``), optimizer type and optimizer parameters, input data configuration, such as ``batch_size``, and shuffle and run configuration, such as ``max_steps``, ``checkpoint_steps``, and ``num_epochs``.