The run.py Template#

Whether you compile on a CPU node using the csrun_cpu script, or run on the Cerebras system using the csrun_wse script, you must pass a full Python command along with its command line arguments, as an argument to the script.

This section presents an example Python template run.py with a detailed description of the supported options and flags. Note that this applies to the pipeline workflow, not for weight streaming.

Note

The run.py described below is an example template only. If you are developing in TensorFlow, you can organize your code whichever way that best suits you, as long as you use CerebrasEstimator. If you are developing in PyTorch you can use the run function, common to all implementations in the Cerebras Model Zoo git repository. You can find more about the run function in the adapting-pytorch-to-cs document.

For TensorFlow code, see the following diagram showing a simplified run.py example using the CerebrasEstimator and the CGC flow.

../_images/run-py-cerebrasestimator-cgc.jpg — Fig. 2 Cerebras Graph Compiler for the CS system#

Syntax#

python run.py -p --params <PATH-TO-YAML-FILE> \
              -o --model_dir <PATH-TO-MODEL-DIR> \
              --cs_ip <CS system-IP-ADDRESS> \
              --steps <INTEGER> \
              --max_steps <INTEGER> \
              -m --mode <MODE-TO-RUN> \
              --compile_only \
              --device <SPECIFY-DEVICE-TO-RUN-ON> \
              --checkpoint_path <INITIALIZE WEIGHTS FROM> \

where:

Arguments#

params#

-p --params: Required. String. Path to the YAML file that contains the model parameters.

For example: --params configs/params_bert_base_msl512.yaml.

model_dir#

-o --model_dir: Optional. String. The location where your model and all the outputs such as checkpoints and event files are stored. If the directory exists, then the weights are loaded from the checkpoint file. Same as the model_dir passed to the tf.estimator. Default value is current directory of execution. See also tf.estimator.Estimator.

cs_ip#

--cs_ip: Optional. The IP address of the CS system. Format should be IP-ADDRESS:PORT. Default value is None. This option is ignored on GPU.

steps#

--steps: Optional. Integer. The number of steps to run the train mode. Runs for the specified number of steps either while starting from the beginning or while continuing from a checkpoint. Default value is None.

max_steps#

--max_steps: Optional. Integer. The total number of steps to run the train mode, or for training in the train_and_eval mode. If the run is continuing from a checkpoint that was made at or after max_steps, then the run will stop. If the run is continuing from a checkpoint that was made at less than max_steps, then it will run for the number of steps remaining between the checkpoint and the max_steps. Default value is None.

eval_steps#

--eval_steps: Integer. Optional on the GPU but required when running on the CS system. The total number of steps to run eval or eval_all modes, or for evaluation in train_and_eval mode. Runs once for the specified number.

mode#

-m --mode: Required. String. Set the mode for your neural network. Allowed choices are:
- train: In this mode the compiler will compile, and run the training on the CS system. If cs_ip is not specified, then runs the training on the CPU or GPU.
- eval: In this mode the evaluation will run on CPU or GPU. For some neural network models, this is experimentally supported on the CS system.
- eval_all: In this mode the evaluation will run on CPU or GPU for all the available checkpoints. This mode is not yet supported on the CS system.
- train_and_eval: In this mode the training and evaluation will be run on CPU or GPU.
- predict: In this mode the prediction (inference) will run on CPU or GPU. For some neural network models, this is experimentally supported on the CS system.

compilation flags#

--validate_only: In this mode the compiler will stop after the kernel matching phase. Available in Pipelined execution mode for TensorFlow.

--compile_only: In this mode the compiler will continue after the kernel matching until it finishes. If the compile is successful it will generate the CS system bitstream. Available in Pipelined execution mode.

device#

--device: Optional. String. This option should be used only to specify a GPU device. The compiler will compile on CPU, and will run on the GPU device specified in this setting.

For example, --device /gpu:0 will run on the GPU 0.

checkpoint_path#

--checkpoint_path: Optional. String. The weights are initialized from the checkpoint specified with this option. Default value is None. If this option is used with an initial checkpoint, and if the model_dir already contains a checkpoint, then the compiler will alert you to:
- Either provide an empty model_dir so the weights are initialized using the value provided to this checkpoint_path option, or
- Remove this option checkpoint_path in order to initialize weights from the model_dir.

Example: BERT run.py#

Shown below is the run.py example code of the BERT model.

 # Example run.py script for BERT

 # Copyright 2021 Cerebras Systems.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.

 import argparse
 import os
 import sys

 import tensorflow as tf

 # Relative path imports
 sys.path.append(os.path.join(os.path.dirname(__file__), '../../..'))
 from common_zoo.estimator.tf.cs_estimator import CerebrasEstimator
 from common_zoo.estimator.tf.run_config import CSRunConfig
 from common_zoo.run_utils import (
     check_env,
     create_warm_start_settings,
     get_csconfig,
     get_csrunconfig_dict,
     is_cs,
     save_params,
     save_predictions,
     update_params_from_args,
 )
 from transformers.bert.tf.data import eval_input_fn, train_input_fn
 from transformers.bert.tf.model import model_fn
 from transformers.bert.tf.utils import get_params


 def create_arg_parser(default_model_dir):
     """
     Create parser for command line args.

     :param str default_model_dir: default value for the model_dir
     :returns: ArgumentParser
     """
     parser = argparse.ArgumentParser()
     parser.add_argument(
         "-p",
         "--params",
         required=True,
         help="Path to .yaml file with model parameters",
     )
     parser.add_argument(
         "-o",
         "--model_dir",
         default=default_model_dir,
         help="Model directory where checkpoints will be written. "
         + "If directory exists, weights are loaded from the checkpoint file.",
     )
     parser.add_argument(
         "--cs_ip",
         default=None,
         help="CS system IP address, defaults to None. Ignored on GPU.",
     )
     parser.add_argument(
         "--steps",
         type=int,
         default=None,
         help=(
             "Number of steps to run mode train."
             + " Runs repeatedly for the specified number."
         ),
     )
     parser.add_argument(
         "--max_steps",
         type=int,
         default=None,
         help=(
             "Number of total steps to run mode train or for defining training"
             + " configuration for train_and_eval. Runs incrementally till"
             + " the specified number."
         ),
     )
     parser.add_argument(
         "--eval_steps",
         type=int,
         default=None,
         help=(
             "Number of total steps to run mode eval, eval_all or for defining"
             + " eval configuration for train_and_eval. Runs once for"
             + " the specified number."
         ),
     )
     parser.add_argument(
         "-m",
         "--mode",
         required=True,
         choices=["train", "eval", "eval_all", "train_and_eval", "predict",],
         help=(
             "Can train, eval, eval_all, train_and_eval, or predict."
             + "  Train, eval, and predict will compile and train if on CS system,"
             + "  and just run locally (CPU/GPU) if not on CS system."
             + "  train_and_eval will run locally."
             + "  Eval_all will run eval locally for all available checkpoints."
         ),
     )
     parser.add_argument(
         "--validate_only",
         action="store_true",
         help="Compile model up to kernel matching.",
     )
     parser.add_argument(
         "--compile_only",
         action="store_true",
         help="Compile model completely, generating compiled executables.",
     )
     parser.add_argument(
         "--device",
         default=None,
         help="Force model to run on a specific device (e.g., --device /gpu:0)",
     )
     parser.add_argument(
         "--checkpoint_path",
         default=None,
         help="Checkpoint to initialize weights from.",
     )

     return parser


 def validate_runtime_params(params):
     # check validate_only/compile_only
     assert not (
         params["validate_only"] and params["compile_only"]
     ), "Please only use one of validate_only and compile_only."
     if params["validate_only"] or params["compile_only"]:
         assert params["mode"] in [
             "train",
             "eval",
             "predict",
         ], "Can only validate/compile model in train, eval, or predict mode."

     # check for gpu optimization flags
     if (
         params["mode"] not in ["compile_only", "validate_only"]
         and not is_cs(params)
         and not params["enable_gpu_optimizations"]
     ):
         tf.compat.v1.logging.warn(
             "Set enable_gpu_optimizations to True in training params "
             "to improve GPU performance."
         )


 def run(
     args,
     params,
     model_fn,
     train_input_fn=None,
     eval_input_fn=None,
     predict_input_fn=None,
     output_layer_name=None,
 ):
     """
     Set up estimator and run based on mode

     :params dict params: dict to handle all parameters
     :params tf.estimator.EstimatorSpec model_fn: Model function to run with
     :params tf.data.Dataset train_input_fn: Dataset to train with
     :params tf.data.Dataset eval_input_fn: Dataset to validate against
     :params tf.data.Dataset predict_input_fn: Dataset to run inference on
     :params str output_layer_name: name of the output layer to be excluded
         from weight initialization when performing fine-tuning.
     """
     # update and validate runtime params
     runconfig_params = params["runconfig"]
     update_params_from_args(args, runconfig_params)
     validate_runtime_params(runconfig_params)
     # save params for reproducibility
     save_params(params, model_dir=runconfig_params["model_dir"])

     # get cs-specific configs
     cs_config = get_csconfig(params.get("csconfig", dict()))
     # get runtime configurations
     use_cs = is_cs(runconfig_params)
     csrunconfig_dict = get_csrunconfig_dict(runconfig_params)

     stack_params = dict()
     if (
         use_cs
         or runconfig_params["validate_only"]
         or runconfig_params["compile_only"]
     ):
         from cerebras.pb.stack.full_pb2 import FullConfig

         config = FullConfig()
         if params['train_input']['max_sequence_length'] <= 128:
             config.matching.kernel.no_dcache_spill_splits = True
         stack_params['config'] = config

     # prep cs1 run environment, run config and estimator
     check_env(runconfig_params)
     est_config = CSRunConfig(
         cs_ip=runconfig_params["cs_ip"],
         cs_config=cs_config,
         stack_params=stack_params,
         **csrunconfig_dict,
     )
     warm_start_settings = create_warm_start_settings(
         runconfig_params, exclude_string=output_layer_name
     )
     est = CerebrasEstimator(
         model_fn=model_fn,
         model_dir=runconfig_params["model_dir"],
         config=est_config,
         params=params,
         warm_start_from=warm_start_settings,
     )

     # execute based on mode
     if runconfig_params["validate_only"] or runconfig_params["compile_only"]:
         if runconfig_params["mode"] == "train":
             input_fn = train_input_fn
             mode = tf.estimator.ModeKeys.TRAIN
         elif runconfig_params["mode"] == "eval":
             input_fn = eval_input_fn
             mode = tf.estimator.ModeKeys.EVAL
         else:
             input_fn = predict_input_fn
             mode = tf.estimator.ModeKeys.PREDICT
         est.compile(
             input_fn, validate_only=runconfig_params["validate_only"], mode=mode
         )
     elif runconfig_params["mode"] == "train":
         est.train(
             input_fn=train_input_fn,
             steps=runconfig_params["steps"],
             max_steps=runconfig_params["max_steps"],
             use_cs=use_cs,
         )
     elif runconfig_params["mode"] == "eval":
         est.evaluate(
             input_fn=eval_input_fn,
             checkpoint_path=runconfig_params["checkpoint_path"],
             steps=runconfig_params["eval_steps"],
             use_cs=use_cs,
         )
     elif runconfig_params["mode"] == "eval_all":
         ckpt_list = tf.train.get_checkpoint_state(
             runconfig_params["model_dir"]
         ).all_model_checkpoint_paths
         for ckpt in ckpt_list:
             est.evaluate(
                 eval_input_fn,
                 checkpoint_path=ckpt,
                 steps=runconfig_params["eval_steps"],
                 use_cs=use_cs,
             )
     elif runconfig_params["mode"] == "train_and_eval":
         train_spec = tf.estimator.TrainSpec(
             input_fn=train_input_fn, max_steps=runconfig_params["max_steps"]
         )
         eval_spec = tf.estimator.EvalSpec(
             input_fn=eval_input_fn,
             steps=runconfig_params["eval_steps"],
             throttle_secs=runconfig_params["throttle_secs"],
         )
         tf.estimator.train_and_evaluate(est, train_spec, eval_spec)
     elif runconfig_params["mode"] == "predict":
         sys_name = "cs" if use_cs else "tf"
         file_to_save = f"predictions_{sys_name}_{est_config.task_id}.npz"
         predictions = est.predict(
             input_fn=predict_input_fn,
             checkpoint_path=runconfig_params["checkpoint_path"],
             num_samples=runconfig_params["predict_steps"],
             use_cs=use_cs,
         )
         save_predictions(
             model_dir=runconfig_params["model_dir"],
             outputs=predictions,
             name=file_to_save,
         )


 def main():
     """
     Main function
     """
     default_model_dir = os.path.join(
         os.path.dirname(os.path.abspath(__file__)), "model_dir"
     )
     parser = create_arg_parser(default_model_dir)
     args = parser.parse_args(sys.argv[1:])
     params = get_params(args.params)
     run(
         args=args,
         params=params,
         model_fn=model_fn,
         train_input_fn=train_input_fn,
         eval_input_fn=eval_input_fn,
     )


 if __name__ == "__main__":
     tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)
     main()

Software Documentation (Version 1.6.1)

The run.py Template

On This Page