Model Studio Launchpad quickstart#
On the Cerebras Model Studio Launchpad one can either train or finetune their model. One can run validation on their train and finetuning results as well.
Once you have access to the server and have prepared your data, you setup and run your experiments through the Launchpad shell. You will be in the shell when you ssh into the node.
Welcome to Launchpad. Type help or ? to list commands.
>
Contents
Get around the CLI#
help
#
help
will show all available (12) commands. Alternatively you can use ?
instead of help
> help
Documented commands (type help <topic>):
========================================
add eval exit experiment help history list run status view
> ?
Documented commands (type help <topic>):
========================================
add eval exit experiment help history list run status view
You can also do help <cmd>
to learn more about what <cmd>
does. For example, to learn more about what help
does, you can do help help
>help help
List available commands with "help" or detailed help with "help cmd".
Beyond the help
command, all the commands have a -h
option which shows the details of the arguments they accept.
exit
#
exit
is used to leave the Launchpad shell and will log the user out of their node.
> exit -h
usage: exit [-h]
Exits the shell
Note
You can exit the shell while an experiment is running. The experiment will continue to run even if you exit the shell. Your user history will be available when you enter back into the shell.
history
#
history
allows you to track all the commands you have issued, similar to how you do it in any shell.
> history -h
usage: history [-h] [--jobs]
Print history of commands/runs launched on Wafer-Scale Cluster
optional arguments:
-h, --help show this help message and exit
--jobs List all completed jobs.
As an example, the output after doing the commands ?
, list
, exit
, help
, experiment add
is:
> history
0: ?
1: list
2: exit
3: help
4: experiment add
The commands start from the earliest to latest. It will limit to last 1000 commands run by the user.
history
also provide a list of completed jobs (experiments) with the --jobs
flag.
> history --jobs
+----------+-------------+------------+------------+----------------------------+----------------------------+--------------------+
| Job ID | Job State | Job Mode | Model | Start Time | End Time | Latest Status |
+==========+=============+============+============+============================+============================+====================+
| 1f42de95 | Succeeded | train | GPT-3 1.3B | 2023-03-29 00:17:55.813985 | 2023-03-29 00:31:46.179743 | Training succeeded |
+----------+-------------+------------+------------+----------------------------+----------------------------+--------------------+
| a4399488 | Succeeded | train | GPT-3 1.3B | 2023-03-29 08:09:51.950483 | 2023-03-29 08:49:28.399129 | Training succeeded |
+----------+-------------+------------+------------+----------------------------+----------------------------+--------------------+
| da64ff8b | Succeeded | train | GPT-3 1.3B | 2023-03-29 08:30:03.181184 | 2023-03-29 13:37:08.164565 | Training succeeded |
+----------+-------------+------------+------------+----------------------------+----------------------------+--------------------+
Explore model#
list
#
list
prints out which model has been selected, available datasets, checkpoints and hyperparameters that the user can set in their experiments.
> list -h
usage: list [-h]
Lists selected model, corresponding checkpoints, available datasets, and
parameters that can be set.
optional arguments:
-h, --help show this help message and exit
For example, this is the output for GPT3 1.3B size model
> list
Model name: GPT-3-1.3B
Available datasets:
- pile_2048
Available checkpoints:
- ID: 4, Timestamp: 2023-03-29 00:12:25.398312, global_step: 0
- ID: 5, Timestamp: 2023-03-29 00:31:36.099650, global_step: 100
- ID: 9, Timestamp: 2023-03-29 13:36:47.741818, global_step: 10100
Hyperparameters:
model:
dropout_rate:
constraints:
- Number must be in range [0.0, 1.0]
- Type of the value must be one of [float]
default: 0.0
description: Dropout rate to use.
required: false
optimizer: Refer to documentation for a full list of available optimizers and learning
rate schedulers.
runconfig:
checkpoint:
constraints:
- Type of the value must be one of [int]
description: The checkpoint ID to start load initial weights from.
required: true
checkpoint_steps:
constraints:
- Number must be in range [1, inf]
- Type of the value must be one of [int]
description: Step frequency at which to take chekpoints. Last checkpoint will
always be taken.
required: true
log_steps:
constraints:
- Number must be in range [1, inf]
- Type of the value must be one of [int]
description: Step frequency at which to log results to stdout and TensorBoard.
required: true
num_steps:
constraints:
- Number must be in range [1, inf]
- Type of the value must be one of [int]
description: Total number of steps (i.e., batches) to run.
required: false
seed:
constraints:
- Type of the value must be one of [int]
description: Set the seed for random number generators.
required: false
train_input:
batch_size:
constraints:
- Number must be in range [1, inf]
- Type of the value must be one of [int]
description: Batch size to train with.
required: true
dataset:
constraints:
- Type of the value must be one of [str]
description: Name of the dataset to use for creating samples.
required: true
shuffle:
constraints:
- Type of the value must be one of [bool]
description: Whether to shuffle input data.
required: true
view
#
view
gives you access to another shell where you can go through the PyTorch model and dataloader code - using Cerebras Modelzoo - that is used for running the experiments.
> view -h
usage: view [-h]
View modelzoo to examine model and dataloader code
optional arguments:
-h, --help show this help message and exit
The shell is essentially the same as the bash shell one is familiar with on a terminal. To return back to the Launchpad shell, type exit
.
Add datasets#
add
#
add
is used to add datasets to the shell so they can be used for experiments.
It assumes the data directories are already structured correctly. For more information on how to prepare the data visit Creating HDF5 dataset for GPT Models.
> add -h
usage: add [-h] {dataset} ...
Add a new item to the database.
positional arguments:
{dataset}
dataset Add a new dataset to registry of available datasets.
optional arguments:
-h, --help show this help message and exit
> add dataset -h
usage: add dataset [-h] --name NAME --paths PATHS [PATHS ...]
Add a new dataset to registry of available datasets.
optional arguments:
-h, --help show this help message and exit
--name NAME Unique name of the dataset
--paths PATHS [PATHS ...]
List of data directories for this dataset.
>
Here’s an example of adding a dataset:
> add dataset --name owt_dataset --paths /datasets/language/owt_512k_msl2k/
Verifying dataset ...
0%| | 0/8 [00:00<?, ?it/s]
12%|█▎ | 1/8 [00:06<00:48, 6.97s/it]
75%|███████▌ | 6/8 [00:07<00:01, 1.12it/s]
100%|██████████| 8/8 [00:07<00:00, 1.11it/s]
Dataset `owt_dataset` was successfully added
Setup experiment#
experiment
#
experiment
allows you to manage your experiments for the selected model. You can add
experiments by changing the parameters specified in the list
command. You can also view queued up experiments and remove experiments as well.
> experiment -h
usage: experiment [-h] {add,view,delete} ...
Manage experiments to run.
positional arguments:
{add,view,delete}
add Add a new experiment
view View existing experiments
delete Delete an experiment
optional arguments:
-h, --help show this help message and exit
To parametrize your experiment you will run experiment add
. This will open the parameters in a vim
editor. In this, you will set the model parameters of your choice (options seen in list
, including:
Dataset
Number of training steps
Checkpoint to start training from. If the checkpoint is not set, then the training will start from scratch
> experiment add -h
usage: experiment add [-h] [--base_index BASE_INDEX | --default]
optional arguments:
-h, --help show this help message and exit
--base_index BASE_INDEX
Base the new experiment's spec off of an existing
experiment. If not provided, the last spec created (or
default if no spec)was created will be used as the
base.
--default Base the new experiment's spec off of the default
values. If not provided, the last spec created (or
default if no specwas created) will be used as the
base.
For example, when working with the GPT-3 1.3B size model, executing experiment add
will open the vim
editor with
model:
dropout_rate: 0.0
optimizer:
amsgrad: false
beta1: 0.9
beta2: 0.95
correct_bias: true
eps: 1.0e-08
learning_rate:
- cycle: false
decay_steps: 1500
end_learning_rate: 0.0001
initial_learning_rate: 0.0
scheduler: Linear
- decay_steps: 1050000
end_learning_rate: 1.0e-05
initial_learning_rate: 0.0001
scheduler: CosineDecay
- learning_rate: 1.0e-05
scheduler: Constant
max_gradient_norm: 1.0
optimizer_type: AdamW
weight_decay_rate: 0.1
runconfig:
checkpoint: 5
checkpoint_steps: 5000
log_steps: 500
num_steps: 10000
seed: 1
train_input:
batch_size: 121
dataset: pile_2048
max_sequence_length: 2048
shuffle: true
vocab_size: 50527
#
# Refer to the schema below for a list of available options
#
# Model name: GPT-3-1.3B
#
# Available datasets:
# - pile_2048
#
# Available checkpoints:
# - ID: 4, Timestamp: 2023-03-29 00:12:25.398312, global_step: 0
To view queued up experiments use experiment view
> experiment view -h
usage: experiment view [-h] [--index INDEX]
optional arguments:
-h, --help show this help message and exit
--index INDEX Index of the experiment to view. If not provided, shows all.
For example, the experiment added above will look as below:
> experiment view
Experiment 0
-------------------------------------------------------
model:
dropout_rate: 0.0
optimizer:
amsgrad: false
beta1: 0.9
beta2: 0.95
correct_bias: true
eps: 1.0e-08
learning_rate:
- cycle: false
decay_steps: 1500
end_learning_rate: 0.0001
initial_learning_rate: 0.0
scheduler: Linear
- decay_steps: 1050000
end_learning_rate: 1.0e-05
initial_learning_rate: 0.0001
scheduler: CosineDecay
- learning_rate: 1.0e-05
scheduler: Constant
max_gradient_norm: 1.0
optimizer_type: AdamW
weight_decay_rate: 0.1
runconfig:
checkpoint: 5
checkpoint_steps: 5000
log_steps: 500
num_steps: 10000
seed: 1
train_input:
batch_size: 121
dataset: pile_2048
max_sequence_length: 2048
shuffle: true
vocab_size: 50527
To delete experiments use experiment delete
> experiment delete -h
usage: experiment delete [-h] --index INDEX
optional arguments:
-h, --help show this help message and exit
--index INDEX The experiment index to delete.
To delete the experiment added above will look like:
> experiment delete --index 0
>
Launch training job#
run
#
Use run
is to launch training experiments. You can either:
launch the last added experiment - by no specifying any arguments -,
or the last
n
experiments you added using the-n
flag
> run -h
usage: run [-h] [-n N]
optional arguments:
-h, --help show this help message and exit
-n N Run the last `n` experiments. If not provided, it will run the
last available experiment.
An example of running a job would look like:
> run
Created training job 4634442e for experiment 0
Launch validation job#
eval
#
Use eval
to runs validation jobs on top of the checkpoints generated during your experiments.
There are two ways to specify which checkpoints to run eval
on
* Specify the dataset and checkpoint id using the dataset
and checkpoint
flags.
* Specify the dataset and job_id using the dataset
and train_job
flags. It will run on every checkpoint generated by the train job.
The job will run on 1 epoch of the dataset per checkpoint.
> eval -h
usage: eval [-h] --dataset DATASET
(--checkpoints CHECKPOINTS [CHECKPOINTS ...] | --train_job TRAIN_JOB)
Run evaluation on various checkpoints.
optional arguments:
-h, --help show this help message and exit
--dataset DATASET Name of the dataset to run evaluation on.
--checkpoints CHECKPOINTS [CHECKPOINTS ...]
List of checkpoint IDs to run evaluation on.
--train_job TRAIN_JOB
Run evaluation on all checkpoints generated from this
job.
Monitor your experiments#
status
#
To view the status of the job, or observe the results either on the terminal or on tensorboard, you use the status
command. With status
you can:
see which experiments are queued and running.
cancel queued and running jobs.
obtain a tensorboard link to see the experiment results.
obtain information on your reservation such as number of tokens that have been used.
> status -h
usage: status [-h] {view,cancel} ...
Prints the status of in-progress jobs
positional arguments:
{view,cancel}
view View job info
cancel Cancel a job
optional arguments:
-h, --help show this help message and exit
status view
also gives further information for a specific experiment such as:
The parameters used.
The losses and summaries.
The checkpoints generated.
The performance seen.
> status view -h
usage: status view [-h] --job_id JOB_ID
[--summaries | --hyperparams | --checkpoints]
optional arguments:
-h, --help show this help message and exit
--job_id JOB_ID Filter by the given job id
--summaries Display loss summaries
--hyperparams Display hyperparameters used in this job
--checkpoints Display all checkpoints collected from this run
status cancel
cancels a specific job providing their job-id
> status cancel -h
usage: status cancel [-h] --job_id JOB_ID
optional arguments:
-h, --help show this help message and exit
--job_id JOB_ID Filter by the given job id
Note
To obain the history of completed jobs use history --jobs