.. _cs-pytorch-qs: PyTorch Quickstart ================== This quickstart is a step-by-step guide to compile a PyTorch FC-MNIST model (already ported to Cerebras) targeting your CS system. Prerequisites ------------- .. attention:: Go over this :ref:`checklist-before-you-start` before you proceed. .. _pt-qs-compile-model: Compile the model ----------------- 1. Log in to your CS system cluster. 2. Clone the reference samples repository to your preferred location in your home directory. .. code-block:: bash git clone https://github.com/Cerebras/cerebras_reference_implementations.git In the reference samples directory you will see the following PyTorch model examples: - A `PyTorch version of FC-MNIST `_. - The `PyTorch versions of BERT Base and BERT Large `_. In this quickstart we will use the FC MNIST model. Navigate to the ``fc_mnist`` model directory. .. code-block:: bash cd cerebras_reference_implementations/fc_mnist/pytorch/ 3. Compile the model targeting the CS system. The below ``csrun_cpu`` command will compile the code in the ``train`` mode for the CS system. Note that this step will only compile the code and will not run training on the CS system. .. code-block:: bash csrun_cpu python-pt run.py --mode train \ --compile_only \ --params configs/ \ --cs_ip : .. note:: The parameters can also be set in the ``params.yaml`` file. Train on GPU ------------ To train on a GPU, run: .. code-block:: bash python run.py --mode train --params configs/ .. _pt-qs-train-on-cs: Train on CS system ------------------ 1. Execute the ``csrun_wse`` command to run the training on the CS system. See the command format below: .. attention:: For PyTorch models only, the ``cs_ip`` flag must include both the IP address and the port number of the CS system. Only the IP address, for example: ``--cs_ip 192.168.1.1``, will not be sufficient. You must also include the port number, for example: ``--cs_ip 192.168.1.1:9000``. .. code-block:: bash csrun_wse python-pt run.py --mode train \ --cs_ip \ --params configs/ \ Output files and artifacts -------------------------- The output files and artifacts include: * Model directory (``model_dir``) - Contains all of the results and artifacts of the latest run, including: + Compile directory (``cs_``) + ``performance.json`` file + Checkpoints + Tensorboard event files + ``yaml`` files Model directory and its structure ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The Model directory (``model_dir``) contains all of the results and artifacts of the latest run. If you go into the ``model_dir`` directory, the following subdirectories are present. Compile dir - The directory containing the ``cs_`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The compilation artifacts during and after compilation are stored in ``/cs_`` directory. Compilation logs and intermediate outputs are helpful to debug compilations issues. The ``xla_service.log`` should contain information about the status of compilation, and whether it passed or failed. In case of failure, it should print an error message and stacktrace in ``xla_service.log``. ``Performance.json`` file and its parameters ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ There is a performance directory that should contain the ``performance.json /performance/performance.json``. This contains information as listed below: * ``compile_time`` - The amount of time that it took to compile the model to generate the Cerebras executable. * ``est_samples_per_sec`` - The estimated performance in terms of samples per second based on the Cerebras compile. Note that this number is theoretical and actual performance may vary. * ``programming_time`` - This is the time taken to prepare the system and load with the model that is compiled. * ``samples_per_sec`` - The actual performance of your run execution. * ``suspected_input_bottleneck`` - This is a beta feature. It indicates whether you are input-starved and need more input workers to feed the Cerebras system. * ``total_samples`` - The total gross samples that were iterated during the execution. * ``total_time`` - The total time it took to complete the total samples. Checkpoints ~~~~~~~~~~~ Checkpoints are stored in ``/checkpoint_*.mdl``. They are saved with the frequency specified in the ``runconfig`` file. Tensorboard event files ~~~~~~~~~~~~~~~~~~~~~~~ Tensorboard event files are stored in ``/train/`` directory. ``yaml`` files content after the run ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The ``yaml`` file is stored in train directory. This yaml file contains information about the specifics of the run, such as model specific configuration (eg. ``dropout``, ``activation_fn``), optimizer type and optimizer parameters, input data configuration, such as ``batch_size``, and shuffle and run configuration, such as ``max_steps``, ``checkpoint_steps``, and ``num_epochs``.