.. _run-py-template: The run.py Template =================== Whether you compile on a CPU node using the ``csrun_cpu`` script, or run on the Cerebras system using the ``csrun_wse`` script, you must pass a full Python command along with its command line arguments, as an argument to the script. This section presents an example Python template ``run.py`` with a detailed description of the supported options and flags. Note that this applies to the pipeline workflow, not for weight streaming. .. note:: The ``run.py`` described below is an example template only. If you are developing in **TensorFlow**, you can organize your code whichever way that best suits you, as long as you use ``CerebrasEstimator``. If you are developing in **PyTorch** you can use the ``run`` function, common to all implementations in the Cerebras Model Zoo git repository. You can find more about the ``run`` function in the :ref:`adapting-pytorch-to-cs` document. For TensorFlow code, see the following diagram showing a simplified ``run.py`` example using the ``CerebrasEstimator`` and the CGC flow. .. figure:: ../images/run-py-cerebrasestimator-cgc.jpg :align: center :width: 900 px Cerebras Graph Compiler for the CS system Syntax ------ .. code-block:: bash python run.py -p --params \ -o --model_dir \ --cs_ip \ --steps \ --max_steps \ -m --mode \ --compile_only \ --device \ --checkpoint_path \ where: .. _args-run-py: Arguments --------- params ~~~~~~ - ``-p --params``: Required. *String*. Path to the YAML file that contains the model parameters. For example: ``--params configs/params_bert_base_msl512.yaml``. model_dir ~~~~~~~~~ - ``-o --model_dir``: Optional. *String*. The location where your model and all the outputs such as checkpoints and event files are stored. If the directory exists, then the weights are loaded from the checkpoint file. Same as the ``model_dir`` passed to the ``tf.estimator``. Default value is current directory of execution. See also `tf.estimator.Estimator `__. cs_ip ~~~~~ - ``--cs_ip``: Optional. The IP address of the CS system. Format should be ``IP-ADDRESS:PORT``. Default value is ``None``. This option is ignored on GPU. steps ~~~~~ - ``--steps``: Optional. *Integer*. The number of steps to run the ``train`` mode. Runs for the specified number of steps either while starting from the beginning or while continuing from a checkpoint. Default value is ``None``. max_steps ~~~~~~~~~ - ``--max_steps``: Optional. *Integer*. The total number of steps to run the ``train`` mode, or for training in the ``train_and_eval`` mode. If the run is continuing from a checkpoint that was made at or after ``max_steps``, then the run will stop. If the run is continuing from a checkpoint that was made at less than ``max_steps``, then it will run for the number of steps remaining between the checkpoint and the ``max_steps``. Default value is ``None``. eval_steps ~~~~~~~~~~ - ``--eval_steps``: *Integer*. Optional on the GPU but required when running on the CS system. The total number of steps to run ``eval`` or ``eval_all`` modes, or for evaluation in ``train_and_eval`` mode. Runs once for the specified number. mode ~~~~ - ``-m --mode``: Required. *String*. Set the mode for your neural network. Allowed choices are: - ``train``: In this mode the compiler will compile, and run the training on the CS system. If ``cs_ip`` is not specified, then runs the training on the CPU or GPU. - ``eval``: In this mode the evaluation will run on CPU or GPU. For some neural network models, this is experimentally supported on the CS system. - ``eval_all``: In this mode the evaluation will run on CPU or GPU for all the available checkpoints. This mode is not yet supported on the CS system. - ``train_and_eval``: In this mode the training and evaluation will be run on CPU or GPU. - ``predict``: In this mode the prediction (inference) will run on CPU or GPU. For some neural network models, this is experimentally supported on the CS system. compilation flags ~~~~~~~~~~~~~~~~~ - ``--validate_only``: In this mode the compiler will stop after the kernel matching phase. Available in Pipelined execution mode for TensorFlow. - ``--compile_only``: In this mode the compiler will continue after the kernel matching until it finishes. If the compile is successful it will generate the CS system bitstream. Available in Pipelined execution mode. device ~~~~~~ - ``--device``: Optional. *String*. This option should be used only to specify a GPU device. The compiler will compile on CPU, and will run on the GPU device specified in this setting. For example, ``--device /gpu:0`` will run on the GPU 0. checkpoint_path ~~~~~~~~~~~~~~~ - ``--checkpoint_path``: Optional. *String*. The weights are initialized from the checkpoint specified with this option. Default value is ``None``. If this option is used with an initial checkpoint, and if the ``model_dir`` already contains a checkpoint, then the compiler will alert you to: - Either provide an empty ``model_dir`` so the weights are initialized using the value provided to this ``checkpoint_path`` option, or - Remove this option ``checkpoint_path`` in order to initialize weights from the ``model_dir``. .. _bert-run-py-script: Example: BERT run.py -------------------- Shown below is the ``run.py`` example code of the BERT model. .. code-block:: python :linenos: # Example run.py script for BERT # Copyright 2021 Cerebras Systems. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. import argparse import os import sys import tensorflow as tf # Relative path imports sys.path.append(os.path.join(os.path.dirname(__file__), '../../..')) from common_zoo.estimator.tf.cs_estimator import CerebrasEstimator from common_zoo.estimator.tf.run_config import CSRunConfig from common_zoo.run_utils import ( check_env, create_warm_start_settings, get_csconfig, get_csrunconfig_dict, is_cs, save_params, save_predictions, update_params_from_args, ) from transformers.bert.tf.data import eval_input_fn, train_input_fn from transformers.bert.tf.model import model_fn from transformers.bert.tf.utils import get_params def create_arg_parser(default_model_dir): """ Create parser for command line args. :param str default_model_dir: default value for the model_dir :returns: ArgumentParser """ parser = argparse.ArgumentParser() parser.add_argument( "-p", "--params", required=True, help="Path to .yaml file with model parameters", ) parser.add_argument( "-o", "--model_dir", default=default_model_dir, help="Model directory where checkpoints will be written. " + "If directory exists, weights are loaded from the checkpoint file.", ) parser.add_argument( "--cs_ip", default=None, help="CS system IP address, defaults to None. Ignored on GPU.", ) parser.add_argument( "--steps", type=int, default=None, help=( "Number of steps to run mode train." + " Runs repeatedly for the specified number." ), ) parser.add_argument( "--max_steps", type=int, default=None, help=( "Number of total steps to run mode train or for defining training" + " configuration for train_and_eval. Runs incrementally till" + " the specified number." ), ) parser.add_argument( "--eval_steps", type=int, default=None, help=( "Number of total steps to run mode eval, eval_all or for defining" + " eval configuration for train_and_eval. Runs once for" + " the specified number." ), ) parser.add_argument( "-m", "--mode", required=True, choices=["train", "eval", "eval_all", "train_and_eval", "predict",], help=( "Can train, eval, eval_all, train_and_eval, or predict." + " Train, eval, and predict will compile and train if on CS system," + " and just run locally (CPU/GPU) if not on CS system." + " train_and_eval will run locally." + " Eval_all will run eval locally for all available checkpoints." ), ) parser.add_argument( "--validate_only", action="store_true", help="Compile model up to kernel matching.", ) parser.add_argument( "--compile_only", action="store_true", help="Compile model completely, generating compiled executables.", ) parser.add_argument( "--device", default=None, help="Force model to run on a specific device (e.g., --device /gpu:0)", ) parser.add_argument( "--checkpoint_path", default=None, help="Checkpoint to initialize weights from.", ) return parser def validate_runtime_params(params): # check validate_only/compile_only assert not ( params["validate_only"] and params["compile_only"] ), "Please only use one of validate_only and compile_only." if params["validate_only"] or params["compile_only"]: assert params["mode"] in [ "train", "eval", "predict", ], "Can only validate/compile model in train, eval, or predict mode." # check for gpu optimization flags if ( params["mode"] not in ["compile_only", "validate_only"] and not is_cs(params) and not params["enable_gpu_optimizations"] ): tf.compat.v1.logging.warn( "Set enable_gpu_optimizations to True in training params " "to improve GPU performance." ) def run( args, params, model_fn, train_input_fn=None, eval_input_fn=None, predict_input_fn=None, output_layer_name=None, ): """ Set up estimator and run based on mode :params dict params: dict to handle all parameters :params tf.estimator.EstimatorSpec model_fn: Model function to run with :params tf.data.Dataset train_input_fn: Dataset to train with :params tf.data.Dataset eval_input_fn: Dataset to validate against :params tf.data.Dataset predict_input_fn: Dataset to run inference on :params str output_layer_name: name of the output layer to be excluded from weight initialization when performing fine-tuning. """ # update and validate runtime params runconfig_params = params["runconfig"] update_params_from_args(args, runconfig_params) validate_runtime_params(runconfig_params) # save params for reproducibility save_params(params, model_dir=runconfig_params["model_dir"]) # get cs-specific configs cs_config = get_csconfig(params.get("csconfig", dict())) # get runtime configurations use_cs = is_cs(runconfig_params) csrunconfig_dict = get_csrunconfig_dict(runconfig_params) stack_params = dict() if ( use_cs or runconfig_params["validate_only"] or runconfig_params["compile_only"] ): from cerebras.pb.stack.full_pb2 import FullConfig config = FullConfig() if params['train_input']['max_sequence_length'] <= 128: config.matching.kernel.no_dcache_spill_splits = True stack_params['config'] = config # prep cs1 run environment, run config and estimator check_env(runconfig_params) est_config = CSRunConfig( cs_ip=runconfig_params["cs_ip"], cs_config=cs_config, stack_params=stack_params, **csrunconfig_dict, ) warm_start_settings = create_warm_start_settings( runconfig_params, exclude_string=output_layer_name ) est = CerebrasEstimator( model_fn=model_fn, model_dir=runconfig_params["model_dir"], config=est_config, params=params, warm_start_from=warm_start_settings, ) # execute based on mode if runconfig_params["validate_only"] or runconfig_params["compile_only"]: if runconfig_params["mode"] == "train": input_fn = train_input_fn mode = tf.estimator.ModeKeys.TRAIN elif runconfig_params["mode"] == "eval": input_fn = eval_input_fn mode = tf.estimator.ModeKeys.EVAL else: input_fn = predict_input_fn mode = tf.estimator.ModeKeys.PREDICT est.compile( input_fn, validate_only=runconfig_params["validate_only"], mode=mode ) elif runconfig_params["mode"] == "train": est.train( input_fn=train_input_fn, steps=runconfig_params["steps"], max_steps=runconfig_params["max_steps"], use_cs=use_cs, ) elif runconfig_params["mode"] == "eval": est.evaluate( input_fn=eval_input_fn, checkpoint_path=runconfig_params["checkpoint_path"], steps=runconfig_params["eval_steps"], use_cs=use_cs, ) elif runconfig_params["mode"] == "eval_all": ckpt_list = tf.train.get_checkpoint_state( runconfig_params["model_dir"] ).all_model_checkpoint_paths for ckpt in ckpt_list: est.evaluate( eval_input_fn, checkpoint_path=ckpt, steps=runconfig_params["eval_steps"], use_cs=use_cs, ) elif runconfig_params["mode"] == "train_and_eval": train_spec = tf.estimator.TrainSpec( input_fn=train_input_fn, max_steps=runconfig_params["max_steps"] ) eval_spec = tf.estimator.EvalSpec( input_fn=eval_input_fn, steps=runconfig_params["eval_steps"], throttle_secs=runconfig_params["throttle_secs"], ) tf.estimator.train_and_evaluate(est, train_spec, eval_spec) elif runconfig_params["mode"] == "predict": sys_name = "cs" if use_cs else "tf" file_to_save = f"predictions_{sys_name}_{est_config.task_id}.npz" predictions = est.predict( input_fn=predict_input_fn, checkpoint_path=runconfig_params["checkpoint_path"], num_samples=runconfig_params["predict_steps"], use_cs=use_cs, ) save_predictions( model_dir=runconfig_params["model_dir"], outputs=predictions, name=file_to_save, ) def main(): """ Main function """ default_model_dir = os.path.join( os.path.dirname(os.path.abspath(__file__)), "model_dir" ) parser = create_arg_parser(default_model_dir) args = parser.parse_args(sys.argv[1:]) params = get_params(args.params) run( args=args, params=params, model_fn=model_fn, train_input_fn=train_input_fn, eval_input_fn=eval_input_fn, ) if __name__ == "__main__": tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO) main()