Evaluate your model during training#

Eval All Mode#

This feature allows users to run evaluation on multiple model checkpoints within a provided model_dir. This permits users to evaluate models that have already been trained, though train_and_eval (Train and Eval Mode) may be more suitable for evaluating a model throughout a training run.

How to use this feature#

Provide eval_all as the argument for the --mode flag, specify the directory with model checkpoints with the --model_dir flag

Example:

python run.py --mode=eval_all --model_dir=<path> ...<rest of the args>

Train and Eval Mode#

This feature allows users to evaluate models throughout long training runs. This is beneficial to identify any issues with models earlier, rather than after training runs finish.

How to use this feature#

Within the runconfig portion of the config yaml:

either num_epochs or num_steps must be defined (but not both)

If num_epochs is defined, train_and_eval trains for num_epoch epochs, with each epoch being followed by an evaluation

If num_steps is defined, the user must also define eval_frequency in the config. num_steps governs the total number of steps the model will train for and eval_frequency indicates how many steps will pass between each evaluation.

For example, if num_steps is 100 and eval_frequency is 20, the model will train for 100 steps and will be evaluated after 20, 40, 60, 80, and 100 steps.

When running your model, enable the --mode flag with train_and_eval

Example:

python run.py --mode=train_and_eval --model_dir=<path> --params=<config_path> ...<rest of the args>

YAML params for TensorFlow models

Summarize scalars and tensors in PyTorch