.. _cs-pytorch-pl-slurm-singularity: Pipeline Slurm/Singularity Workflow =================================== This reference guide is for the original workflow that utilizes Slurm as the orchestrating software running on our CPU nodes to mediate communication between the CS system and the Original Cerebras Support-Cluster. Note that this is no longer the recommended workflow. Slurm/Singularity is supported only on Cerebras' legacy clusters while the latest Wafer-Scale Clusters only support a Kubernetes-based workflow. Refer to :ref:`cs-pytorch-pl-K8s` for more information. This is a step-by-step guide to compile a PyTorch FC-MNIST model (already ported to Cerebras) targeting your CS system. Prerequisites ------------- Go over this `Checklist Before You Quick Start `_ before you proceed. Compile the model ----------------- To compile your system, perform the following steps. 1. Log in to your CS system cluster. 2. Clone the reference samples repository to your preferred location in your home directory, using the following command: .. code-block:: bash git clone https://github.com/Cerebras/modelzoo In the reference samples directory, there are the following PyTorch model examples: + A PyTorch version of FC-MNIST. + The PyTorch versions of BERT Base and BERT Large. In this quick start, we use the FC-MNIST model. Navigate to the``fc_mnist`` model directory using the following command. .. code-block:: bash cd cerebras/modelzoo/fc_mnist/pytorch 3. Compile the model targeting the CS system. The below ``csrun_cpu`` command compiles the code in the ``train`` mode for the CS system. Note that this step only compiles the code and does not run training on the CS system. .. code-block:: bash csrun_cpu python-pt run.py --mode train \ --compile_only \ --params configs/ \ --cs_ip : .. Note:: The parameters can also be set in the ``params.yaml`` file. Train on GPU ------------ To train on a GPU, run the following command: .. code-block:: bash python run.py --mode train --params configs/ Train on CS system ------------------ Execute the ``csrun_wse`` command to run the training on the CS system. See the command format below. .. note:: For PyTorch models only, the ``cs_ip`` flag must include both the IP address and the port number of the CS system. Only the IP address, for example: ``--cs_ip 192.168.1.1``, is not sufficient. You must also include the port number, for example: ``--cs_ip 192.168.1.1:9000``. .. code-block:: bash csrun_wse python-pt run.py --mode train \ --cs_ip \ --params configs/ \ Output files and artifacts -------------------------- The output files and artifacts include a model directory (``model_dir``) that contains all the results and artifacts of the latest run, including: - Compile directory (``cs_``) - ``performance.json`` file - Checkpoints - Tensorboard event files - ``yaml`` files Compile dir – The directory containing the ``cs_`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The compilation artifacts during and after compilation are stored in the ``/cs_`` directory. Compilation logs and intermediate outputs are helpful to debug compilations issues. The ``xla_service.log`` should contain information about the status of compilation, and whether it passed or failed. In case of failure, it should print an error message and stacktrace in ``xla_service.log``. ``performance.json`` file and its parameters ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The performance directory should contain the ``performance.json /performance/performance.json``. This contains information as listed below: - ``compile_time`` - The amount of time that it took to compile the model to generate the Cerebras executable. - ``est_samples_per_sec ``- The estimated performance in terms of samples per second based on the Cerebras compile. Note that this number is theoretical and actual performance may vary. - ``programming_time`` - This is the time taken to prepare the system and load with the model that is compiled. - ``samples_per_sec`` - The actual performance of your run execution. - ``suspected_input_bottleneck`` - This is a beta feature. It indicates whether you are input-starved and need more input workers to feed the Cerebras system. - ``total_samples`` - The total gross samples that were iterated during the execution. - ``total_time`` - The total time it took to complete the total samples. Checkpoints ~~~~~~~~~~~ Checkpoints are stored in ``/checkpoint_*.mdl``. They are saved with the frequency specified in the ``runconfig`` file. Tensorboard event files ~~~~~~~~~~~~~~~~~~~~~~~ Tensorboard event files are stored in ``/train/`` directory. ``yaml`` files content after the run ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The ``yaml`` file is stored in the train directory. This ``yaml`` file contains information about the specifics of the run, such as model specific configuration (eg., ``dropout``, ``activation_fn``), optimizer type and optimizer parameters, input data configuration, such as ``batch_size``, and shuffle and run configuration, such as ``max_steps``, ``checkpoint_steps``, and ``num_epochs``.