.. _cs-pytorch-ws-appliance-mode: Running Large Models (Weight Streaming Execution) ================================================= Users interact with the Cerebras Wafer-Scale Cluster as if it were an appliance, meaning running large models on the Cerebras Wafer-Scale Cluster is as easy as running on a single device. For first-time user setup for PyTorch jobs, see :ref:`cs-pytorch-qs`. Activate your PyTorch environment --------------------------------- To run PyTorch jobs on Wafer-Scale Cluster, you first must activate your PyTorch environment on the user node. Enter the Python environment using the following command: .. code-block:: bash source venv_cerebras_pt/bin/activate Running the scripts to compile, train, or evaluate your model ------------------------------------------------------------- The steps to train your model are as follows. We use `GPT-2 model `_ available in `Cerebras Model Zoo git repository `_ for this example. Check with your sysadmin whether your setup has a local copy of the Model Zoo repository available with pre-installed datasets. Otherwise, you can clone this git repository on your user node yourself and follow the instructions in the ReadMe files in the respository on how to set up training datasets. 1. In the Model Zoo, you can find ``run.py`` scripts for PyTorch models supported with the weight streaming execution mode. For the GPT-2 model, navigate to the following directory in your copy of the Model Zoo: .. code-block:: bash cd modelzoo/transformers/pytorch/gpt2 2. Within this directory, run the following command that performs the initial stage of compilation to get feedback about whether your model is compatible with Cerebras Software Platform. .. code-block:: bash python run.py --appliance --params params.yaml --num_csx=1 --model_dir model_dir --validate_only --mode train --credentials_path= --mount_dirs --python_paths This step can be skipped, if you are confident in your code. But it is very convinient for fast iteration on your code as it is considerably faster than a full compile. 3. The next step is to run the full compile. The artifacts from this run are used in the training run. This compile can take longer, depending on the size and complexity of the model (15 minutes to an hour). .. code-block:: bash python run.py --appliance --params params.yaml --num_csx=1 --model_dir model_dir --compile_only --mode train --credentials_path= --mount_dirs --python_paths This compile can take longer, depending on the size and complexity of the model (30 minutes to two hours). 4. This is the training step. - If you are running one CS-2, enter the following: .. code-block:: bash python run.py --appliance --params params.yaml --num_csx=1 --model_dir=model_dir --mode train --credentials_path= --mount_dirs --python_paths - If you are running multiple CS-2s, enter the following. Note that ``csx=2`` in the below code block refers to the number of CS-2 systems you are using. In this case, you are running a distributed job on two CS-2 systems within the Wafer-Scale Cluster. .. code-block:: bash python run.py --appliance --params params.yaml --num_csx=2 --model_dir=model_dir --mode train --credentials_path= --mount_dirs --python_paths - The output logs are as follows: .. code-block:: bash Transferring weights to server: 100%|██| 1165/1165 [01:00<00:00, 19.33tensors/s] INFO: Finished sending initial weights INFO: | Train Device=xla:0 Step=50 Loss=8.31250 Rate=69.37 GlobalRate=69.37 INFO: | Train Device=xla:0 Step=100 Loss=7.25000 Rate=68.41 GlobalRate=68.56 INFO: | Train Device=xla:0 Step=150 Loss=6.53125 Rate=68.31 GlobalRate=68.46 INFO: | Train Device=xla:0 Step=200 Loss=6.53125 Rate=68.54 GlobalRate=68.51 INFO: | Train Device=xla:0 Step=250 Loss=6.12500 Rate=68.84 GlobalRate=68.62 INFO: | Train Device=xla:0 Step=300 Loss=5.53125 Rate=68.74 GlobalRate=68.63 INFO: | Train Device=xla:0 Step=350 Loss=4.81250 Rate=68.01 GlobalRate=68.47 INFO: | Train Device=xla:0 Step=400 Loss=5.37500 Rate=68.44 GlobalRate=68.50 INFO: | Train Device=xla:0 Step=450 Loss=6.43750 Rate=68.43 GlobalRate=68.49 INFO: | Train Device=xla:0 Step=500 Loss=5.09375 Rate=66.71 GlobalRate=68.19 INFO: Training Complete. Completed 60500 sample(s) in 887.2672743797302 seconds. 5. To run an eval job, run the following command: .. code-block:: bash python run.py --appliance --params params.yaml --num_csx=1 --model_dir=model_dir -–mode eval –-eval_steps=10 --credentials_path= --mount_dirs --python_paths --checkpoint_path .. Note:: Cerebras only supports one CS-2 for eval mode. Contents of ``run.py`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For your reference, the contents of ``run.py`` is as shown in `Cerebras Model Zoo `_. Output files and artifacts ~~~~~~~~~~~~~~~~~~~~~~~~~~ The output files and artifacts of the model directory (``model_dir``) contain all the results and artifacts of the latest run, including: - Checkpoints - Tensorboard event files - A copy of the ``yaml`` file for your run Checkpoints *********** Checkpoints are stored in ``/checkpoint_*.mdl``. Tensorboard event files *********************** Tensorboard event files are stored in the ```` directory.