.. _cs-tensorflow-pl-ws-unified-appliance: Running TensorFlow Models ========================= Users interact with the Cerebras Wafer-Scale Cluster as if it was an appliance, meaning running models of various sizes on the Cerebras Wafer-Scale Cluster is as easy as running on a single device. For first-time user setup for TensorFlow jobs, see :ref:`cs-tf-quickstart`. Activate your TensorFlow environment --------------------------------- To run TensorFlow jobs on Wafer-Scale Cluster, you first need to activate your TensorFlow environment on the user node. Enter the Python environment using the following command: .. code-block:: bash source venv_cerebras_tf/bin/activate Running the scripts to compile, train, or evaluate your model ------------------------------------------------------------- The steps to train your model are as follows. We will use `GPT-2 model `_ available in `Cerebras Model Zoo git repository `_ for this example. Check with your sysadmin if your setup has a local copy of the Model Zoo repository available with pre-installed datasets. Otherwise, you can clone this git repository on your user node yourself and follow the instructions in the readme files in the respository on how to set up training datasets. 1. In the Model Zoo, you can find ``run_appliance.py`` scripts for TensorFlow models. For the GPT-2 model, navigate to the following directory in your copy of the Model Zoo: .. code-block:: bash cd modelzoo/transformers/tf/gpt2 2. Within this directory, run the following command that performs the initial stage of compilation to get feedback about whether your model is compatible with Cerebras Software Platform. .. code-block:: bash python run_appliance.py --execution_strategy --params params.yaml --num_csx=1 --validate_only --mode train --model_dir model_dir --credentials_path --mount_dirs --python_paths --mgmt_address This step can be skipped, if you are confident in your code. But it is very convenient for fast iteration on your code as it is considerably faster than a full compile. 3. The next step is to run the full compile. The artifacts from this run are used in the training run. This compile can take longer, depending on the size and complexity of the model (15 minutes to an hour). .. code-block:: bash python run_appliance.py --execution_strategy --params params.yaml --num_csx=1 --compile_only --mode train --model_dir model_dir --credentials_path --mount_dirs --python_paths --mgmt_address 4. This is the training step. - If you are running one CS-2, enter the following: .. code-block:: bash python run_appliance.py --execution_strategy --params params.yaml --num_csx=1 --mode train --model_dir model_dir --credentials_path --mount_dirs --python_paths --mgmt_address - Note that ``--num_csx=2`` in the below code block refers to the number of CS-2 systems you are using. In this case, you are running a distributed job on two CS-2 systems within the Wafer-Scale Cluster. .. code-block:: bash python run_appliance.py --execution_strategy --params params.yaml --num_csx=2 --mode train --model_dir model_dir --credentials_path --mount_dirs --python_paths --mgmt_address The output log is as follows: .. code-block:: bash Transferring weights to server: 100%|████| 566/566 [00:12<00:00, 44.31tensors/s] 2023-01-29 20:21:31,937 INFO:root:Finished sending initial weights 2023-01-29 20:24:11,480 INFO:tensorflow:global step 12: loss = 9.515625 (0.38 steps/sec) 2023-01-29 20:24:25,356 INFO:tensorflow:global step 24: loss = 9.3828125 (0.53 steps/sec) 2023-01-29 20:24:39,232 INFO:tensorflow:global step 36: loss = 8.796875 (0.6 steps/sec) 2023-01-29 20:24:53,108 INFO:tensorflow:global step 48: loss = 8.5625 (0.65 steps/sec) 2023-01-29 20:25:06,983 INFO:tensorflow:global step 60: loss = 8.3203125 (0.69 steps/sec) 2023-01-29 20:25:34,723 INFO:tensorflow:global step 72: loss = 8.2890625 (0.63 steps/sec) 2023-01-29 20:25:48,602 INFO:tensorflow:global step 84: loss = 8.171875 (0.65 steps/sec) 2023-01-29 20:26:02,482 INFO:tensorflow:global step 96: loss = 8.0859375 (0.67 steps/sec) ...... 2023-01-29 20:34:52,323 INFO:tensorflow:global step 480: loss = 6.16015625 (0.71 steps/sec) 2023-01-29 20:35:20,190 INFO:tensorflow:global step 492: loss = 6.203125 (0.7 steps/sec) 2023-01-29 20:35:20,200 INFO:tensorflow:global step 500: loss = 6.28125 (0.71 steps/sec) 2023-01-29 20:35:20,201 INFO:root:Training complete. Completed 2048000 sample(s) in 700.5476334095001 seconds 2023-01-29 20:35:20,203 INFO:root:Taking final checkpoint at step: 500 2023-01-29 20:35:35,774 INFO:tensorflow:Saved checkpoint for global step 500 in 15.573367357254028 seconds: /path/to/model_dir/model.ckpt-500 2023-01-29 20:35:45,914 INFO:root:Monitoring is over without any issue 5. To run an eval job, run the following command: .. code-block:: bash python run_appliance.py --execution_strategy --params params.yaml --num_csx=1 --model_dir model_dir -–mode eval --credentials_path --checkpoint_path --mount_dirs --python_paths --mgmt_address .. Note:: For the ``--execution_strategy`` argument in the example commands above, please specify ``--execution_strategy=pipeline`` for small to medium models with <1 billion parameters to run in Pipelined execution, and specify ``--execution_strategy=weight_streaming`` for large models with >= 1 billion parameters to run in Weight Streaming execution. Example command to train a 117M GPT2 in TensorFlow in Pipelined execution: ``python run_appliance.py --execution_strategy pipeline --params params_TF_GPT2_117M.yaml --num_csx=1 --num_workers_per_csx=8 --model_dir model_dir_TF_GPT2_117M --credentials_path /path/to/certificate --mount_dirs /path/to/data /path/to/modelzoo /path/to/mount --python_paths /path/to/modelzoo /path/to/additional/python/packages --mgmt_address mangement_node_address_for_cluster`` Example command to train a 1.5B GPT2 in TensorFlow in Weight Streaming execution: ``python run_appliance.py --execution_strategy weight_streaming --params params_TF_GPT2_1p5B.yaml --num_csx=1 --model_dir model_dir_TF_GPT2_1p5B --credentials_path /path/to/certificate --mount_dirs /path/to/data /path/to/modelzoo /path/to/mount --python_paths /path/to/modelzoo /path/to/additional/python/packages --mgmt_address mangement_node_address_for_cluster`` .. Note:: Cerebras only supports one CS-2 for eval mode or for Pipelined execution. Contents of ``run_appliance.py`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For your reference, the contents of ``run_appliance.py`` is as shown in `Cerebras Model Zoo `_. Output files and artifacts ~~~~~~~~~~~~~~~~~~~~~~~~~~ The contents of the model directory (``model_dir``) contain all the results and artifacts of the latest run, including: - Checkpoints - Tensorboard event files - A copy of the ``yaml`` file for your run