Model Studio Launchpad quickstart#

On the Cerebras Model Studio Launchpad one can either train or finetune their model. One can run validation on their train and finetuning results as well.

Once you have access to the server and have prepared your data, you setup and run your experiments through the Launchpad shell. You will be in the shell when you ssh into the node.

Welcome to Launchpad. Type help or ? to list commands.

>

Get around the CLI#

help#

help will show all available (12) commands. Alternatively you can use ? instead of help

> help

Documented commands (type help <topic>):
========================================
add  eval  exit  experiment  help  history  list  run  status  view

> ?

Documented commands (type help <topic>):
========================================
add  eval  exit  experiment  help  history  list  run  status  view

You can also do help <cmd> to learn more about what <cmd> does. For example, to learn more about what help does, you can do help help

>help help
List available commands with "help" or detailed help with "help cmd".

Beyond the help command, all the commands have a -h option which shows the details of the arguments they accept.

exit#

exit is used to leave the Launchpad shell and will log the user out of their node.

> exit -h
usage: exit [-h]

Exits the shell

Note

You can exit the shell while an experiment is running. The experiment will continue to run even if you exit the shell. Your user history will be available when you enter back into the shell.

history#

history allows you to track all the commands you have issued, similar to how you do it in any shell.

> history -h
usage: history [-h] [--jobs]

Print history of commands/runs launched on Wafer-Scale Cluster

optional arguments:
  -h, --help  show this help message and exit
  --jobs      List all completed jobs.

As an example, the output after doing the commands ?, list, exit, help, experiment add is:

> history
0: ?
1: list
2: exit
3: help
4: experiment add

The commands start from the earliest to latest. It will limit to last 1000 commands run by the user.

history also provide a list of completed jobs (experiments) with the --jobs flag.

> history --jobs
+----------+-------------+------------+------------+----------------------------+----------------------------+--------------------+
| Job ID   | Job State   | Job Mode   | Model      | Start Time                 | End Time                   | Latest Status      |
+==========+=============+============+============+============================+============================+====================+
| 1f42de95 | Succeeded   | train      | GPT-3 1.3B | 2023-03-29 00:17:55.813985 | 2023-03-29 00:31:46.179743 | Training succeeded |
+----------+-------------+------------+------------+----------------------------+----------------------------+--------------------+
| a4399488 | Succeeded   | train      | GPT-3 1.3B | 2023-03-29 08:09:51.950483 | 2023-03-29 08:49:28.399129 | Training succeeded |
+----------+-------------+------------+------------+----------------------------+----------------------------+--------------------+
| da64ff8b | Succeeded   | train      | GPT-3 1.3B | 2023-03-29 08:30:03.181184 | 2023-03-29 13:37:08.164565 | Training succeeded |
+----------+-------------+------------+------------+----------------------------+----------------------------+--------------------+

Explore model#

list#

list prints out which model has been selected, available datasets, checkpoints and hyperparameters that the user can set in their experiments.

> list -h
usage: list [-h]

Lists selected model, corresponding checkpoints, available datasets, and
parameters that can be set.

optional arguments:
  -h, --help  show this help message and exit

For example, this is the output for GPT3 1.3B size model

> list
Model name: GPT-3-1.3B

Available datasets:
    - pile_2048

Available checkpoints:
    - ID: 4, Timestamp: 2023-03-29 00:12:25.398312, global_step: 0
    - ID: 5, Timestamp: 2023-03-29 00:31:36.099650, global_step: 100
    - ID: 9, Timestamp: 2023-03-29 13:36:47.741818, global_step: 10100

Hyperparameters:
    model:
        dropout_rate:
            constraints:
            - Number must be in range [0.0, 1.0]
            - Type of the value must be one of [float]
            default: 0.0
            description: Dropout rate to use.
            required: false
    optimizer: Refer to documentation for a full list of available optimizers and learning
        rate schedulers.
    runconfig:
        checkpoint:
            constraints:
            - Type of the value must be one of [int]
            description: The checkpoint ID to start load initial weights from.
            required: true
        checkpoint_steps:
            constraints:
            - Number must be in range [1, inf]
            - Type of the value must be one of [int]
            description: Step frequency at which to take chekpoints. Last checkpoint will
                always be taken.
            required: true
        log_steps:
            constraints:
            - Number must be in range [1, inf]
            - Type of the value must be one of [int]
            description: Step frequency at which to log results to stdout and TensorBoard.
            required: true
        num_steps:
            constraints:
            - Number must be in range [1, inf]
            - Type of the value must be one of [int]
            description: Total number of steps (i.e., batches) to run.
            required: false
        seed:
            constraints:
            - Type of the value must be one of [int]
            description: Set the seed for random number generators.
            required: false
    train_input:
        batch_size:
            constraints:
            - Number must be in range [1, inf]
            - Type of the value must be one of [int]
            description: Batch size to train with.
            required: true
        dataset:
            constraints:
            - Type of the value must be one of [str]
            description: Name of the dataset to use for creating samples.
            required: true
        shuffle:
            constraints:
            - Type of the value must be one of [bool]
            description: Whether to shuffle input data.
            required: true

view#

view gives you access to another shell where you can go through the PyTorch model and dataloader code - using Cerebras Modelzoo - that is used for running the experiments.

> view -h
usage: view [-h]

View modelzoo to examine model and dataloader code

optional arguments:
  -h, --help  show this help message and exit

The shell is essentially the same as the bash shell one is familiar with on a terminal. To return back to the Launchpad shell, type exit.

Add datasets#

add#

add is used to add datasets to the shell so they can be used for experiments. It assumes the data directories are already structured correctly. For more information on how to prepare the data visit Creating HDF5 dataset for GPT Models.

> add -h
usage: add [-h] {dataset} ...

Add a new item to the database.

positional arguments:
  {dataset}
    dataset   Add a new dataset to registry of available datasets.

optional arguments:
  -h, --help  show this help message and exit

> add dataset -h
usage: add dataset [-h] --name NAME --paths PATHS [PATHS ...]

Add a new dataset to registry of available datasets.

optional arguments:
  -h, --help            show this help message and exit
  --name NAME           Unique name of the dataset
  --paths PATHS [PATHS ...]
                        List of data directories for this dataset.
>

Here’s an example of adding a dataset:

> add dataset --name owt_dataset --paths /datasets/language/owt_512k_msl2k/
Verifying dataset ...

  0%|          | 0/8 [00:00<?, ?it/s]
 12%|█▎        | 1/8 [00:06<00:48,  6.97s/it]
 75%|███████▌  | 6/8 [00:07<00:01,  1.12it/s]
100%|██████████| 8/8 [00:07<00:00,  1.11it/s]

Dataset `owt_dataset` was successfully added

Setup experiment#

experiment#

experiment allows you to manage your experiments for the selected model. You can add experiments by changing the parameters specified in the list command. You can also view queued up experiments and remove experiments as well.

> experiment -h
usage: experiment [-h] {add,view,delete} ...

Manage experiments to run.

positional arguments:
  {add,view,delete}
    add              Add a new experiment
    view             View existing experiments
    delete           Delete an experiment

optional arguments:
  -h, --help         show this help message and exit

To parametrize your experiment you will run experiment add. This will open the parameters in a vim editor. In this, you will set the model parameters of your choice (options seen in list, including:

  • Dataset

  • Number of training steps

  • Checkpoint to start training from. If the checkpoint is not set, then the training will start from scratch

> experiment add -h
usage: experiment add [-h] [--base_index BASE_INDEX | --default]

optional arguments:
  -h, --help            show this help message and exit
  --base_index BASE_INDEX
                        Base the new experiment's spec off of an existing
                        experiment. If not provided, the last spec created (or
                        default if no spec)was created will be used as the
                        base.
  --default             Base the new experiment's spec off of the default
                        values. If not provided, the last spec created (or
                        default if no specwas created) will be used as the
                        base.

For example, when working with the GPT-3 1.3B size model, executing experiment add will open the vim editor with

model:
    dropout_rate: 0.0
optimizer:
    amsgrad: false
    beta1: 0.9
    beta2: 0.95
    correct_bias: true
    eps: 1.0e-08
    learning_rate:
    -   cycle: false
        decay_steps: 1500
        end_learning_rate: 0.0001
        initial_learning_rate: 0.0
        scheduler: Linear
    -   decay_steps: 1050000
        end_learning_rate: 1.0e-05
        initial_learning_rate: 0.0001
        scheduler: CosineDecay
    -   learning_rate: 1.0e-05
        scheduler: Constant
    max_gradient_norm: 1.0
    optimizer_type: AdamW
    weight_decay_rate: 0.1
runconfig:
    checkpoint: 5
    checkpoint_steps: 5000
    log_steps: 500
    num_steps: 10000
    seed: 1
train_input:
    batch_size: 121
    dataset: pile_2048
    max_sequence_length: 2048
    shuffle: true
    vocab_size: 50527

#
# Refer to the schema below for a list of available options
#
# Model name: GPT-3-1.3B
#
# Available datasets:
#     - pile_2048
#
# Available checkpoints:
#     - ID: 4, Timestamp: 2023-03-29 00:12:25.398312, global_step: 0

To view queued up experiments use experiment view

> experiment view -h
usage: experiment view [-h] [--index INDEX]

optional arguments:
  -h, --help     show this help message and exit
  --index INDEX  Index of the experiment to view. If not provided, shows all.

For example, the experiment added above will look as below:

> experiment view
Experiment 0
-------------------------------------------------------
model:
    dropout_rate: 0.0
optimizer:
    amsgrad: false
    beta1: 0.9
    beta2: 0.95
    correct_bias: true
    eps: 1.0e-08
    learning_rate:
    -   cycle: false
        decay_steps: 1500
        end_learning_rate: 0.0001
        initial_learning_rate: 0.0
        scheduler: Linear
    -   decay_steps: 1050000
        end_learning_rate: 1.0e-05
        initial_learning_rate: 0.0001
        scheduler: CosineDecay
    -   learning_rate: 1.0e-05
        scheduler: Constant
    max_gradient_norm: 1.0
    optimizer_type: AdamW
    weight_decay_rate: 0.1
runconfig:
    checkpoint: 5
    checkpoint_steps: 5000
    log_steps: 500
    num_steps: 10000
    seed: 1
train_input:
    batch_size: 121
    dataset: pile_2048
    max_sequence_length: 2048
    shuffle: true
    vocab_size: 50527

To delete experiments use experiment delete

> experiment delete -h
usage: experiment delete [-h] --index INDEX

optional arguments:
  -h, --help     show this help message and exit
  --index INDEX  The experiment index to delete.

To delete the experiment added above will look like:

> experiment delete --index 0

>

Launch training job#

run#

Use run is to launch training experiments. You can either:

  • launch the last added experiment - by no specifying any arguments -,

  • or the last n experiments you added using the -n flag

> run -h
usage: run [-h] [-n N]

optional arguments:
  -h, --help  show this help message and exit
  -n N        Run the last `n` experiments. If not provided, it will run the
              last available experiment.

An example of running a job would look like:

> run
Created training job 4634442e for experiment 0

Launch validation job#

eval#

Use eval to runs validation jobs on top of the checkpoints generated during your experiments. There are two ways to specify which checkpoints to run eval on * Specify the dataset and checkpoint id using the dataset and checkpoint flags. * Specify the dataset and job_id using the dataset and train_job flags. It will run on every checkpoint generated by the train job.

The job will run on 1 epoch of the dataset per checkpoint.

> eval -h
usage: eval [-h] --dataset DATASET
            (--checkpoints CHECKPOINTS [CHECKPOINTS ...] | --train_job TRAIN_JOB)

Run evaluation on various checkpoints.

optional arguments:
  -h, --help            show this help message and exit
  --dataset DATASET     Name of the dataset to run evaluation on.
  --checkpoints CHECKPOINTS [CHECKPOINTS ...]
                        List of checkpoint IDs to run evaluation on.
  --train_job TRAIN_JOB
                        Run evaluation on all checkpoints generated from this
                        job.

Monitor your experiments#

status#

To view the status of the job, or observe the results either on the terminal or on tensorboard, you use the status command. With status you can:

  • see which experiments are queued and running.

  • cancel queued and running jobs.

  • obtain a tensorboard link to see the experiment results.

  • obtain information on your reservation such as number of tokens that have been used.

> status -h
usage: status [-h] {view,cancel} ...

Prints the status of in-progress jobs

positional arguments:
  {view,cancel}
    view         View job info
    cancel       Cancel a job

optional arguments:
  -h, --help     show this help message and exit

status view also gives further information for a specific experiment such as:

  • The parameters used.

  • The losses and summaries.

  • The checkpoints generated.

  • The performance seen.

> status view -h
usage: status view [-h] --job_id JOB_ID
                   [--summaries | --hyperparams | --checkpoints]

optional arguments:
  -h, --help       show this help message and exit
  --job_id JOB_ID  Filter by the given job id
  --summaries      Display loss summaries
  --hyperparams    Display hyperparameters used in this job
  --checkpoints    Display all checkpoints collected from this run

status cancel cancels a specific job providing their job-id

> status cancel -h
usage: status cancel [-h] --job_id JOB_ID

optional arguments:
  -h, --help       show this help message and exit
  --job_id JOB_ID  Filter by the given job id

Note

To obain the history of completed jobs use history --jobs