.. _cs-pytorch-pl-ws-unified-appliance: Running PyTorch Models ====================== Users interact with the Cerebras Wafer-Scale Cluster as if it were an appliance, meaning models of various sizes on the Cerebras Wafer-Scale Cluster is as easy as running on a single device. For first-time user setup for PyTorch jobs, see :ref:`cs-pytorch-qs`. Activate your PyTorch environment --------------------------------- To run PyTorch jobs on Wafer-Scale Cluster, you first must activate your PyTorch environment on the user node. Enter the Python environment using the following command: .. code-block:: bash source venv_cerebras_pt/bin/activate Running the scripts to compile, train, or evaluate your model ------------------------------------------------------------- The steps to train your model are as follows. We use `GPT-2 model `_ available in `Cerebras Model Zoo git repository `_ for this example. Check with your sysadmin whether your setup has a local copy of the Model Zoo repository available with pre-installed datasets. Otherwise, you can clone this git repository on your user node yourself and follow the instructions in the ReadMe files in the respository on how to set up training datasets. 1. In the Model Zoo, you can find ``run.py`` scripts for PyTorch models. For the GPT-2 model, navigate to the following directory in your copy of the Model Zoo: .. code-block:: bash cd modelzoo/transformers/pytorch/gpt2 2. Within this directory, run the following command that performs the initial stage of compilation to get feedback about whether your model is compatible with Cerebras Software Platform. .. code-block:: bash python run.py --appliance --execution_strategy --params params.yaml --num_csx=1 --validate_only --mode train --model_dir model_dir --credentials_path --mount_dirs --python_paths --mgmt_address This step can be skipped, if you are confident in your code. But it is very convenient for fast iteration on your code as it is considerably faster than a full compile. 3. The next step is to run the full compile. The artifacts from this run are used in the training run. This compile can take longer, depending on the size and complexity of the model (15 minutes to an hour). .. code-block:: bash python run.py --appliance --execution_strategy --params params.yaml --num_csx=1 --compile_only --mode train --model_dir model_dir --credentials_path --mount_dirs --python_paths --mgmt_address 4. This is the training step. - If you are running one CS-2, enter the following: .. code-block:: bash python run.py --appliance --execution_strategy --params params.yaml --num_csx=1 --mode train --model_dir model_dir --credentials_path --mount_dirs --python_paths --mgmt_address - If you are running multiple CS-2s, which is only allowed in Weight Streaming execution, enter the following. Note that ``--num_csx=2`` in the below code block refers to the number of CS-2 systems you are using. In this case, you are running a data-parallel job on two CS-2 systems within the Wafer-Scale Cluster. .. code-block:: bash python run.py --appliance --execution_strategy --params params.yaml --num_csx=2 --mode train --model_dir model_dir --credentials_path --mount_dirs --python_paths --mgmt_address - The output logs are as follows: .. code-block:: bash Transferring weights to server: 100%|████| 983/983 [00:13<00:00, 72.77tensors/s] 2023-01-31 14:29:09,453 INFO: Finished sending initial weights 2023-01-31 14:35:14,468 INFO: | Train Device=xla:0, Step=100, Loss=6.60547, Rate=38096.59 samples/sec, GlobalRate=38092.92 samples/sec 2023-01-31 14:35:14,545 INFO: | Train Device=xla:0, Step=200, Loss=6.27734, Rate=40148.59 samples/sec, GlobalRate=39732.25 samples/sec 2023-01-31 14:35:14,619 INFO: | Train Device=xla:0, Step=300, Loss=5.96484, Rate=41927.08 samples/sec, GlobalRate=40798.72 samples/sec 2023-01-31 14:35:14,695 INFO: | Train Device=xla:0, Step=400, Loss=5.92578, Rate=42184.20 samples/sec, GlobalRate=41177.08 samples/sec 2023-01-31 14:35:14,769 INFO: | Train Device=xla:0, Step=500, Loss=5.56641, Rate=42517.85 samples/sec, GlobalRate=41480.51 samples/sec ...... 2023-01-31 14:35:15,147 INFO: | Train Device=xla:0, Step=1000, Loss=4.96094, Rate=41571.75 samples/sec, GlobalRate=41921.10 samples/sec 2023-01-31 14:35:15,148 INFO: Training Complete. Completed 32000 sample(s) in 0.7639429569244385 seconds. 2023-01-31 14:35:30,443 INFO: Monitoring is over without any issue 5. To run an eval job, run the following command: .. code-block:: bash python run.py --appliance --execution_strategy --params params.yaml --num_csx=1 -–mode eval --model_dir model_dir --credentials_path --mount_dirs --python_paths --checkpoint_path --mgmt_address .. Note:: For the ``--execution_strategy`` argument in the example commands above, as a rule of thumb, please specify ``--execution_strategy=pipeline`` for small to medium models with <1 billion parameters to run in Pipelined execution, and specify ``--execution_strategy=weight_streaming`` for large models with >= 1 billion parameters to run in Weight Streaming execution. Example command to train a 117M GPT2 in PyTorch in Pipelined execution: ``python run.py --appliance --execution_strategy pipeline --params params_PT_GPT2_117M.yaml --num_csx=1 --num_workers_per_csx=8 -–mode train --model_dir model_dir_PT_GPT2_117M --credentials_path /path/to/certificate --mount_dirs /path/to/data /path/to/modelzoo /path/to/mount --python_paths /path/to/modelzoo /path/to/python/packages --mgmt_address management_node_address_for_cluster`` Example command to train a 1.5B GPT2 in PyTorch in Weight Streaming execution: ``python run.py --appliance --execution_strategy weight_streaming --params params_PT_GPT2_1p5B.yaml --num_csx=1 -–mode train --model_dir model_dir_PT_GPT2_1p5B --credentials_path /path/to/certificate --mount_dirs /path/to/data /path/to/modelzoo /path/to/mount --python_paths /path/to/modelzoo /path/to/python/packages --mgmt_address management_node_address_for_cluster`` .. Note:: Cerebras only supports one CS-2 for eval mode or for Pipelined execution. Contents of ``run.py`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For your reference, the contents of ``run.py`` is as shown in `Cerebras Model Zoo `_. Output files and artifacts ~~~~~~~~~~~~~~~~~~~~~~~~~~ The output files and artifacts of the model directory (``model_dir``) contain all the results and artifacts of the latest run, including: - Checkpoints - Tensorboard event files in ``train/`` - A copy of the ``yaml`` file for your run in ``train/`` - Performance summary in ``performance/``