Evaluate your model during training#
run.py script found in the Cerebras Model Zoo can be used to evaluate your models (i.e., only forward pass). This script provides three types of evaluation with the
Evaluates a specific checkpoint. The latest checkpoint will be used if you don’t provide the
Evaluates all the checkpoints inside a model directory once the model has been trained
Evaluates a model at a fixed frequency during training. This is convenient for identifying issues early in long training runs
When evaluating a model with
run.py, the latest saved checkpoint will be used by default. If no checkpoint exists, then weights
will be initialized as stated in the``yaml`` file, and the model will be evaluated using these weights.
If you want to evaluate a previously trained model, make sure that the checkpoints are available in the
model_dir or provide
Eval All Mode#
This feature allows users to run evaluation on multiple model checkpoints within a provided model_dir. This permits users to evaluate models that have already been trained, though train_and_eval (Train and Eval Mode) may be more suitable for evaluating a model throughout a training run.
How to use this feature#
Provide eval_all as the argument for the
--mode flag, specify the directory with model checkpoints with the
(venv_cerebras_pt) $ python run.py --mode=eval_all --model_dir=<path> ...<rest of the args>
Train and Eval Mode#
This feature allows users to evaluate models throughout long training runs. This is beneficial to identify any issues with models earlier, rather than after training runs finish.
How to use this feature#
Within the runconfig portion of the config yaml:
num_stepsmust be defined (but not both)
num_epochepochs, with each epoch being followed by an evaluation
num_stepsis defined, the user must also define
eval_frequencyin the config.
num_stepsgoverns the total number of steps the model will train for and
eval_frequencyindicates how many steps will pass between each evaluation.
For example, if
num_stepsis 100 and
eval_frequencyis 20, the model will train for 100 steps and will be evaluated after 20, 40, 60, 80, and 100 steps.
When running your model, enable the
--mode flag with
(venv_cerebras_pt) $ python run.py --mode=train_and_eval --model_dir=<path> --params=<config_path> ...<rest of the args>
eval modes require different fabric programming in the CS-2 system. Therefore, using
train_and_eval mode in the Cerebras Wafer-Scale cluster results in additional overheads any time training is stopped to perform evaluation. When possible, we recommend using the
eval_all mode instead.