PyTorch Quickstart¶

This quickstart is a step-by-step guide to compile a PyTorch FC-MNIST model (already ported to Cerebras) targeting your CS system.

Prerequisites¶

Attention

Go over this Checklist Before You Quickstart before you proceed.

Compile the model¶

Log in to your CS system cluster.
Clone the reference samples repository to your preferred location in your home directory.
git clone https://github.com/Cerebras/cerebras_reference_implementations.git
In the reference samples directory you will see the following PyTorch model examples:
A PyTorch version of FC-MNIST.

The PyTorch versions of BERT Base and BERT Large.

In this quickstart we will use the FC MNIST model. Navigate to the fc_mnist model directory.

cd cerebras_reference_implementations/fc_mnist/pytorch/
Compile the model targeting the CS system.
The below csrun_cpu command will compile the code in the train mode for the CS system. Note that this step will only compile the code and will not run training on the CS system.
csrun_cpu python-pt run.py --mode train \ --compile_only \ --params configs/<name-of-the-params-file.yaml> \ --cs_ip <specify your CS_IP>:<port>
Note

The parameters can also be set in the params.yaml file.

Train on GPU¶

To train on a GPU, run:

python run.py --mode train --params configs/<name-of-the-params-file.yaml>

Train on CS system¶

Execute the csrun_wse command to run the training on the CS system. See the command format below:
Attention

For PyTorch models only, the cs_ip flag must include both the IP address and the port number of the CS system. Only the IP address, for example: --cs_ip 192.168.1.1, will not be sufficient. You must also include the port number, for example: --cs_ip 192.168.1.1:9000.
csrun_wse python-pt run.py --mode train \ --cs_ip <IP:port-number> \ --params configs/<name-of-the-params-file.yaml> \

Output files and artifacts¶

The output files and artifacts include:

Model directory (model_dir) - Contains all of the results and artifacts of the latest run, including:

Compile directory (cs_<checksum>)

performance.json file

Checkpoints

Tensorboard event files

yaml files

Model directory and its structure¶

The Model directory (model_dir) contains all of the results and artifacts of the latest run. If you go into the model_dir directory, the following subdirectories are present.

Compile dir - The directory containing the `cs_<checksum>`¶

The compilation artifacts during and after compilation are stored in <model_dir>/cs_<checksum> directory.

Compilation logs and intermediate outputs are helpful to debug compilations issues.

The xla_service.log should contain information about the status of compilation, and whether it passed or failed. In case of failure, it should print an error message and stacktrace in xla_service.log.

`Performance.json` file and its parameters¶

There is a performance directory that should contain the performance.json <model_dir>/performance/performance.json. This contains information as listed below:

compile_time - The amount of time that it took to compile the model to generate the Cerebras executable.

est_samples_per_sec - The estimated performance in terms of samples per second based on the Cerebras compile. Note that this number is theoretical and actual performance may vary.

programming_time - This is the time taken to prepare the system and load with the model that is compiled.

samples_per_sec - The actual performance of your run execution.

suspected_input_bottleneck - This is a beta feature. It indicates whether you are input-starved and need more input workers to feed the Cerebras system.

total_samples - The total gross samples that were iterated during the execution.

total_time - The total time it took to complete the total samples.

Checkpoints¶

Checkpoints are stored in <model_dir>/checkpoint_*.mdl. They are saved with the frequency specified in the runconfig file.

Tensorboard event files¶

Tensorboard event files are stored in <model_dir>/train/ directory.

`yaml` files content after the run¶

The yaml file is stored in train directory. This yaml file contains information about the specifics of the run, such as model specific configuration (eg. dropout, activation_fn), optimizer type and optimizer parameters, input data configuration, such as batch_size, and shuffle and run configuration, such as max_steps, checkpoint_steps, and num_epochs.

Software Documentation (Version 1.6.0)

PyTorch Quickstart

On This Page

PyTorch Quickstart¶

Prerequisites¶

Compile the model¶

Train on GPU¶

Train on CS system¶

Output files and artifacts¶

Model directory and its structure¶

Compile dir - The directory containing the `cs_<checksum>`¶

`Performance.json` file and its parameters¶

Checkpoints¶

Tensorboard event files¶

`yaml` files content after the run¶

Software Documentation (Version 1.6.0)

PyTorch Quickstart

On This Page

PyTorch Quickstart¶

Prerequisites¶

Compile the model¶

Train on GPU¶

Train on CS system¶

Output files and artifacts¶

Model directory and its structure¶

Compile dir - The directory containing the cs_<checksum>¶

Performance.json file and its parameters¶

Checkpoints¶

Tensorboard event files¶

yaml files content after the run¶

Compile dir - The directory containing the `cs_<checksum>`¶

`Performance.json` file and its parameters¶

`yaml` files content after the run¶