Weight Streaming Appliance Workflow

Weight Streaming (WS) is one of the Cerebras execution modes ideal to train extreme scale models (>1B params). To learn more about the differences between pipelined mode and WS mode, visit Cerebras Execution Modes.

Note

From CSoft R1.6 onwards, Cerebras has migrated weight streaming support to the appliance model, which is based on Kubernetes (K8s) workflow and requires a Cerebras Wafer-Scale Cluster to train large scale models (>1B params) in weight streaming mode.

To learn more about our Wafer-Scale Cluster capabilities, contact Cerebras Support by sending an email to support@cerebras.net.

Appliance mode workflow

The appliance mode is a workflow that simplifies running large models across the Wafer-Scale Cluster as if it were a single device.

Perform the following steps to run the appliance:

  1. Ensure that the admin setup is complete. Check this with your Sysadmin.

  2. Follow the first-time user setup procedure.

  3. Run the scripts to train or evaluate your model.

Admin setup

Your admin should have set up the following parameters:

  • Kubernetes is set up.

  • Cluster management is already running on the appliance and is ready to interact with the user node.

  • TLS certificate is generated, and location known.

  • Python 3.7 is available.

  • The path to the Cerebras packages is available.

First-time user setup

The first time you use this mode, you must set it up as shown below for weight streaming execution.

Note

Make sure that you have the TLS Certificate available from your sysadmin. You will need this to communicate between the user node and the Wafer-Scale Cluster. Your admin will have shared the path to this file during the setup.

  1. Set up the Python virtual environment using Python 3.7. Create the environment named venv_appliance using the following command:

    python3.7 -m venv venv_appliance
    
  2. There are three main sets of packages available. There is the cerebras_appliance software package, the cerebras_tensorflow package if you wish to use TensorFlow, and the cerebras_pytorch package if you wish to use PyTorch.

  3. Enter the following commands on the user node (make sure to execute the commands in this order to install the appliance wheel first):

    source venv_appliance/bin/activate
    pip install <path_to_wheels>/cerebras_appliance-<Cerebras release version>_<date>_<build>_<commit>-py3-none-any.whl --find-links=<path_to_wheels>
    pip install <path_to_wheels>/cerebras_tensorflow-<Cerebras release version>_<date>_<build>_<commit>-py3-none-any.whl --find-links=<path_to_wheels>
    

Running the scripts to compile, train, or evaluate your model

The steps to train your model are as follows.

  1. Enter the Python environment using the following command:

    source venv_appliance/bin/activate
    

In the Model Zoo, you find run-appliance.py for TensorFlow models supported in the weight streaming appliance mode. For example, for the GPT-2 model, navigate to the following directory in the Model Zoo:

cd modelzoo/transformers/tf/gpt2
  1. Within this directory, run the following command that performs the initial stage of compilation to get feedback about whether your model can be lowered.

    python run-appliance.py --params params.yaml --num_csx=1  --model_dir
    model_dir --validate_only --mode train --credentials_path=<path to tls certificate> --mount_dirs <paths to data> --python_paths <paths to modelzoo and other python code if used>
    

This process should be considerably faster running than a full compile.

  1. This step runs the full compile. The artifacts from this run are used in the training run.

    python run-appliance.py --params params.yaml --num_csx=1  --model_dir
    model_dir --compile_only --mode train --credentials_path=<path to tls certificate> --mount_dirs <paths to data> --python_paths <paths to modelzoo and other python code if used>
    

This compile can take longer, depending on the size and complexity of the model (30 minutes to two hours).

  1. This is the training step.

  • If you are running one CS-2, enter the following:

    python run-appliance.py --params params.yaml --num_csx=1 --model_dir=model_dir --num_steps=10 --mode train --credentials_path=<path to tls certificate> --mount_dirs <paths to data> --python_paths <paths to modelzoo and other python code if used>
    
  • If you are running multiple CS-2s, enter the following. Note that csx=2 refers to the number of appliances you are using. In this case, you are running two appliances.

    python run-appliance.py --params params.yaml --num_csx=2 --model_dir=model_dir --num_steps=10 --mode train --credentials_path=<path to tls certificate> --mount_dirs <paths to data> --python_paths <paths to modelzoo and other python code if used>
    

The output log is as follows:

INFO     root:start_utils.py:519 # 1. Start Coordinator on separate process
INFO     root:start_utils.py:534 # 2. Begin Run
INFO     root:start_utils.py:545 # 3. Start Workers on separate processes
INFO     root:start_utils.py:554 # 4. Start Chief on separate processes
INFO     root:start_utils.py:564 # 5. Start WS Runtime servers (i.e. ws-srv) on separate processes
INFO     root:cs_estimator_app.py:274 Loaded global step 0
INFO     root:cs_estimator_app.py:817 Output activation tensors: ['truediv_3_1']
INFO     root:cluster_client.py:217 Initiating a new compile wsjob against the cluster server.
INFO     root:cluster_client.py:220 Compile job initiated
INFO     root:appliance_manager.py:135 Creating a framework GRPC client: localhost:50065, None,
INFO     root:appliance_manager.py:359 Compile successfully written to cache directory: cs_10097974384330522877
INFO     root:cluster_client.py:243 Initiating a new execute wsjob against the cluster server.
INFO     root:cluster_client.py:246 Execute job initiated
INFO     root:appliance_manager.py:149 Removing a framework GRPC client
INFO     root:cs_estimator_app.py:940 final generation of weights: 9
INFO     cerebras_appliance.appliance_client:appliance_client.py:435 Input fn serialized: 80036374657374732e77732e6d696c6573746f6e655f6d6f64656c732e74662e646174610a746f795f696e7075745f666e0a71002e
INFO     root:appliance_manager.py:135 Creating a framework GRPC client: localhost:50066, None,
INFO     root:appliance_manager.py:282 About to send initial weights
INFO     root:tf_appliance_manager.py:85 Dropping tensor: 'good_steps'
INFO     root:appliance_manager.py:284 Finished sending initial weights
INFO     root:cs_estimator_app.py:482 global step 2: loss = 0.0 (0.37 steps/sec)
INFO     root:cs_estimator_app.py:482 global step 4: loss = 0.0 (0.74 steps/sec)
INFO     root:cs_estimator_app.py:388 Taking checkpoint at step: 5
INFO     root:cs_estimator_app.py:437 saving last set of weights: 9
INFO     root:cs_estimator_app.py:482 global step 6: loss = 0.0 (1.06 steps/sec)
INFO     root:cs_estimator_app.py:482 global step 8: loss = 0.0 (1.41 steps/sec)
INFO     root:cs_estimator_app.py:388 Taking checkpoint at step: 10
INFO     root:cs_estimator_app.py:391 Taking final checkpoint
INFO     root:cs_estimator_app.py:437 saving last set of weights: 9
INFO     root:cs_estimator_app.py:482 global step 10: loss = 0.0 (1.69 steps/sec)
INFO     root:cs_estimator_app.py:489 Training complete. Completed 640 sample(s) in 5.9104249477386475 seconds
INFO     root:start_utils.py:587 Wait for server completion
INFO     root:start_utils.py:599 Servers Completed
  1. To run an eval job, run the following command:

    python run-appliance.py --params params.yaml --num_csx=1 --model_dir=model_dir -–mode eval –-eval_steps=10 --credentials_path=<path to tls certificate> --mount_dirs <paths to data> --python_paths <paths to modelzoo and other python code if used>
    

Note

Cerebras only supports one CS-2 for eval mode.

Contents of run-appliance.py

For your reference, the contents of run_appliance.py is as shown in Cerebras Model Zoo.

Output files and artifacts

The output files and artifacts of the model directory (model_dir) contain all the results and artifacts of the latest run, including:

  • Checkpoints

  • Tensorboard event files

  • yaml files

Checkpoints

Checkpoints are stored in model_dir>/model-ckpt*.

Tensorboard event files

Tensorboard event files are stored in the <model_dir> directory.