TensorFlow Quickstart¶

If you are new to Cerebras, then begin with this quickstart. Before you get into in-depth development, follow this quickstart to familiarize yourself at a high level with the Cerebras system workflow.

If you already explored this quickstart

If you already explored this quickstart, then skip to Cerebras Command Line Pattern you will need to know to compile on CPU and execute on Cerebras system.

If you are ready to develop

For an in-depth development guide using TensorFlow for Cerebras, skip to: Workflow for TensorFlow on CS.

This quickstart provides step-by-step instructions to:

Clone the reference samples GitHub repository. This repository contains the neural network models that are validated on Cerebras system.
Compile a simple, fully-connected MNIST (FC-MNIST) model on a CPU node in your Cerebras cluster. This step is recommended before you run the model on the Cerebras system.
Run on a CPU node for training and evaluating the model. This approach is recommended for your development workflow, as it gives you better control of debugging your model before you run the model on the Cerebras system. Note that this only might be useful for a tiny model like MNIST. Any other model would take a significant amount of time.
Finally, run the model training job directly on your CS system. In this approach, the compiling is also done automatically directly on the CS system before training starts.
Check the output files and generated artifacts for the training run.

Prerequisites¶

Attention

Go over this Checklist Before You Quickstart before you proceed.

Clone the reference samples repository¶

Log in to your CS system cluster.
Clone the reference samples repository to your preferred location in your home directory.
git clone https://github.com/Cerebras/cerebras_reference_implementations.git

In the reference samples directory you will see a few models for PyTorch and TensorFlow. In this quickstart we will use the FC MNIST model.

Navigate to the fc_mnist model directory.

cd cerebras_reference_implementations/fc_mnist/tf/

Compile on CPU¶

We recommend that you first compile your model successfully on a support cluster CPU node before running it on the CS system.

You can run in validate_only mode that runs a fast, light-weight verification. In this mode, the compilation will only run through the first few stages, up until kernel library matching.
After a successful validate_only run, you can run full compilation with compile_only mode.

This section of the quickstart shows how to execute these steps on a CPU node.

Tip

The validate_only step is very fast, enabling you to rapidly iterate on your model code. Without needing access to the CS system wafer scale engine, you can determine in this validate_only step if you are using any TensorFlow layer or functionality that is unsupported by either XLA or CGC.

Follow these steps:

Navigate to the model directory.

cd cerebras_reference_implementations/fc_mnist/tf/

Run the compilation in validate_only mode.

csrun_cpu python run.py --mode train --validate_only
      ...
      XLA Extraction Complete
      =============== Starting Cerebras Compilation ===============
      Cerebras compilation completed: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02s,  1.23s/stages]
      =============== Cerebras Compilation Completed ===============

Note

The validate_only mode checks the kernel compatibility of your model. When your model passes this mode, run the full compilation with compile_only to generate the CS system executable.

Run the full compilation process in compile_only mode.

This steps runs the full compilation through all stages of the Cerebras software stack to generate a CS system executable.

csrun_cpu python run.py --mode train --compile_only --cs_ip <specify your CS_IP>
...
XLA Extraction Complete
=============== Starting Cerebras Compilation ===============
Cerebras compilation completed: |                    | 17/? [00:18s,  1.09s/stages]
=============== Cerebras Compilation Completed ===============

When the above compilation is successful, the model is guaranteed to run on CS system. You can also use this mode to run pre-compilations of many different model configurations offline, so that you can more fully utilize allotted CS system cluster time.

Note

The compiler will detect if a binary already exists for a particular model config and will skip compiling on-the-fly during training if it detects one.

Train and evaluate on CPU¶

To train and eval on CPU follow these steps:

Navigate to model directory.

cd cerebras_reference_implementations/fc_mnist/tf/

Train and evaluate the model on the CPU.

# train on CPU
csrun_cpu python run.py --mode train

# run eval on CPU
csrun_cpu python run.py --mode eval  --eval_steps 1000

Run the model on the CS system¶

Run the model on the CS system.

The below csrun_wse command will compile the code if no existing compile artifacts are found, and will then run the compiled executable on the CS system.

csrun_wse python run.py --mode train \
    --cs_ip <YOUR-CS1-IP-ADDRESS> \
    --max_steps 100000

Note

The max_steps and other parameters such as save_checkpoints_steps can also be set in the params.yaml file.

The above command trains the FC-MNIST model for 100,000 steps by executing on the CS system at the IP address specified in the --cs_ip flag. When the command executes, you will see an output similar to the below:

srun: job 5834 queued and waiting for resources
srun: job 5834 has been allocated resources
...
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into model_dir/model.ckpt.
INFO:tensorflow:Programming CS system fabric. This may take a couple of minutes - please do not interrupt.
INFO:tensorflow:Fabric programmed
INFO:tensorflow:Coordinator fully up. Waiting for Streaming (using 0.97% out of 301600 cores on the fabric)
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
...
INFO:tensorflow:Training finished with 25600000 samples in 187.465 seconds, 136558.69 samples / second
INFO:tensorflow:Saving checkpoints for 100000 into model_dir/model.ckpt.
INFO:tensorflow:global step 100000: loss = 1.901388168334961e-05 (532.0 steps/sec)
INFO:tensorflow:global step 100000: loss = 1.901388168334961e-05 (532.0 steps/sec)
INFO:tensorflow:Loss for final step: 1.9e-05.

Output files and artifacts¶

The output files and artifacts include:

Model directory (model_dir) - Contains all of the results and artifacts of the latest run, including:

Compile directory (cs_<checksum>)

performance.json file

Checkpoints

Tensorboard event files

yaml files

Model directory and its structure¶

The Model directory (model_dir) contains all of the results and artifacts of the latest run. If you go into the model_dir directory, the following subdirectories are present.

Compile dir - The directory containing the `cs_<checksum>`¶

The cs_<checksum> dir (also known as cached compile directory), contains the .elf, which is used to program the system.

Output of compilation - indicates whether the compile passed or failed; if failed, then the logs show at which stage compilation failed.

`Performance.json` file and its parameters¶

There is a performance directory that should contain the performance.json <model_dir>/performance/performance.json. This contains information as listed below:

compile_time - The amount of time that it took to compile the model to generate the Cerebras executable.

est_samples_per_sec - The estimated performance in terms of samples per second based on the Cerebras compile. Note that this number is theoretical and actual performance may vary.

programming_time - This is the time taken to prepare the system and load with the model that is compiled.

samples_per_sec - The actual performance of your run execution.

suspected_input_bottleneck - This is a beta feature. It indicates whether you are input-starved and need more input workers to feed the Cerebras system.

total_samples - The total gross samples that were iterated during the execution.

total_time - The total time it took to complete the total samples.

Checkpoints¶

Checkpoints are stored in <model_dir>; for example, <model_dir>/model-ckpt-0.index, <model_dir>/model-ckpt-0.meta and <model_dir>/model-ckpt-1.data-00000-of-00001. They are saved with the frequency specified in the runconfig file.

Tensorboard event files¶

Tensorboard event files are also stored in the <model_dir>.

`yaml` files content after the run¶

The yaml file is stored in train directory. This yaml file contains information about the specifics of the run, such as model specific configuration (eg. dropout, activation_fn), optimizer type and optimizer parameters, input data configuration, such as batch_size, and shuffle and run configuration, such as max_steps, checkpoint_steps, and num_epochs.

Software Documentation (Version 1.5.0)

TensorFlow Quickstart

On This Page

TensorFlow Quickstart¶

Prerequisites¶

Clone the reference samples repository¶

Compile on CPU¶

Train and evaluate on CPU¶

Run the model on the CS system¶

Output files and artifacts¶

Model directory and its structure¶

Compile dir - The directory containing the `cs_<checksum>`¶

`Performance.json` file and its parameters¶

Checkpoints¶

Tensorboard event files¶

`yaml` files content after the run¶

Software Documentation (Version 1.5.0)

TensorFlow Quickstart

On This Page

TensorFlow Quickstart¶

Prerequisites¶

Clone the reference samples repository¶

Compile on CPU¶

Train and evaluate on CPU¶

Run the model on the CS system¶

Output files and artifacts¶

Model directory and its structure¶

Compile dir - The directory containing the cs_<checksum>¶

Performance.json file and its parameters¶

Checkpoints¶

Tensorboard event files¶

yaml files content after the run¶

Compile dir - The directory containing the `cs_<checksum>`¶

`Performance.json` file and its parameters¶

`yaml` files content after the run¶