.. _limitations-of-cerebras-estimator: Limitations of the CerebrasEstimator ==================================== Though ``CerebrasEstimator`` inherits from the TensorFlow Estimator, the ``CerebrasEstimator`` does not yet support the full breadth of features provided by TensorFlow Estimator. There are a few differences and limitations in ``CerebrasEstimator``. These are described below. .. important:: All the feature limitations listed below, such as lack of support for user hooks, apply *only* when training on the CS system. The ``CerebrasEstimator`` supports all TensorFlow Estimator features when training on GPU or CPU. Model function limitations -------------------------- Hooks The ``CerebrasEstimator`` currently does not support user-defined hooks, which allow a way to 'hook into' certain points of the ``CerebrasEstimator`` execution. If such user-defined hooks are present, then ``CerebrasEstimator`` will error by complaining about accessing a tensor that is not initialized. Eval_metrics The ``CerebrasEstimator`` does not currently support the TensorFlow Metrics API *only* when running on CS system. This means that during a training run on CS system, you cannot run ``eval_metrics`` operations such as accuracy. If you would like to use ``eval_metrics`` for debugging, the ``CerebrasEstimator`` supports the usage of ``eval_metrics`` on CPU or GPU. If the parameter ``eval_metric_ops`` is set in the ``EstimatorSpec`` returned by the model function, then running Estimator will produce the following error: ``[Unsupported TensorFlow] Detected unsupported eval_metric_ops in CS system training run``. .. _cs-estimator-input-function: Input function differences -------------------------- Dataset repeating ~~~~~~~~~~~~~~~~~ Instead of requiring you to specify the number of epochs you would like to train for, the Estimator requires that you: 1. Explicitly set the **number of steps** you want to train for in the Estimator ``train`` function, and 2. Use the default parameter of the ``repeat`` function (``count=None``) provided by the Dataset API to ensure that the ``input_fn`` will keep providing samples to the CS system until the number of training steps set in the ``train`` function is complete. See below code example: ``dataset = dataset.shuffle(1000).repeat().batch(batch_size, drop_remainder=True)`` Multiple input workers ~~~~~~~~~~~~~~~~~~~~~~ To utilize the full computational capabilities of the CS system, multiple input workers are used to send the training data, simultaneously from each input worker, to the CS system. .. note:: This means that each worker node must shuffle its data differently. In the simplest setup, the same input data is replicated across every input worker. Because the dataset is large, the ``CerebrasEstimator`` can approximate distributed training with dataset shards by ensuring that each worker shuffles its data differently. In other words, make sure that you are not providing a deterministic random seed. Return Dataset ~~~~~~~~~~~~~~ The ``CerebrasEstimator`` requires that your input function returns a Dataset (``tf.data.Dataset``). Each element of the Dataset must be structured to consist of features and labels. See .. note:: Features must be a tensor. Labels can be a tensor or None. If the input function does not return a dataset, then ``CerebrasEstimator`` will error out with the following error: ``[Unsupported TensorFlow] Input function must return a tf.data.Dataset``. Single dictionary input ~~~~~~~~~~~~~~~~~~~~~~~ The input function in ``CerebrasEstimator`` only takes a single dictionary parameter, ``params``, as input. This will be passed in through the Estimator constructor. Input function limitations -------------------------- Drop remainder to enforce fixed batch size The ``CerebrasEstimator`` requires that your input function outputs batches of a fixed ``batch_size`` across **all steps**. To enforce this, you must set the ``drop_remainder`` parameter provided by the Dataset API to ``True`` when batching the Dataset. See `TensorFlow documentation for batch `__. ``dataset = dataset.shuffle(1000).repeat().batch(batch_size, drop_remainder=True)`` If you do not provide a fixed ``batch_size``, the ``CerebrasEstimator`` will error out with the following error: ``[Unsupported TensorFlow] Inconsistent batch sizes detected. To ensure a fixed batch size across all steps, set `drop_remainder=True` when batching your Dataset in the input function.`` Config differences ------------------ Lower bound on save_checkpoint_steps Because the CS system trains faster than alternative systems, saving checkpoints too frequently can have a significant overall performance impact. TF Env Config This environment variable (see section 1 under Configuration) **must be specified**, while training on the CS system. A default for this is already provided in our example scripts. Ensure that its called during training. Parameters not supported a. ``save_checkpoint_secs`` b. ``train_distribute`` c. ``device_fn`` d. ``protocol`` e. ``eval_distribute`` f. ``experimental_distribute`` g. ``Experimental_max_worker_delay_secs`` Compilation differences ----------------------- Like most high performance compute devices, the CS system requires application compilation before execution. In a typical training run, this is handled automatically by ``CerebrasEstimator``. However, because this process can take many minutes, thereby increasing with the complexity of your model, Cerebras makes available a standalone ``CerebrasEstimator.compile()`` function. This function allows you to quickly validate your model code and perform full batched precompiles without connecting to the CS system. However, note that when you compile your model on one CS system, you cannot run this compiled model on another CS system.