Limitations of the CerebrasEstimator

Though CerebrasEstimator inherits from the TensorFlow Estimator, the CerebrasEstimator does not yet support the full breadth of features provided by TensorFlow Estimator. There are a few differences and limitations in CerebrasEstimator. These are described below.

Important

All the feature limitations listed below, such as lack of support for user hooks, apply only when training on the CS system. The CerebrasEstimator supports all TensorFlow Estimator features when training on GPU or CPU.

Model function limitations

Hooks

The CerebrasEstimator currently does not support user-defined hooks, which allow a way to ‘hook into’ certain points of the CerebrasEstimator execution. If such user-defined hooks are present, then CerebrasEstimator will error by complaining about accessing a tensor that is not initialized.

Eval_metrics

The CerebrasEstimator does not currently support the TensorFlow Metrics API only when running on CS system. This means that during a training run on CS system, you cannot run eval_metrics operations such as accuracy.

If you would like to use eval_metrics for debugging, the CerebrasEstimator supports the usage of eval_metrics on CPU or GPU.

If the parameter eval_metric_ops is set in the EstimatorSpec returned by the model function, then running Estimator will produce the following error:

[Unsupported TensorFlow] Detected unsupported eval_metric_ops in CS system training run.

Input function differences

Dataset repeating

Instead of requiring you to specify the number of epochs you would like to train for, the Estimator requires that you:

  1. Explicitly set the number of steps you want to train for in the Estimator train function, and

  2. Use the default parameter of the repeat function (count=None) provided by the Dataset API to ensure that the input_fn will keep providing samples to the CS system until the number of training steps set in the train function is complete. See below code example:

    dataset = dataset.shuffle(1000).repeat().batch(batch_size, drop_remainder=True)

Multiple input workers

To utilize the full computational capabilities of the CS system, multiple input workers are used to send the training data, simultaneously from each input worker, to the CS system.

Note

This means that each worker node must shuffle its data differently.

In the simplest setup, the same input data is replicated across every input worker. Because the dataset is large, the CerebrasEstimator can approximate distributed training with dataset shards by ensuring that each worker shuffles its data differently.

In other words, make sure that you are not providing a deterministic random seed.

Return Dataset

The CerebrasEstimator requires that your input function returns a Dataset (tf.data.Dataset). Each element of the Dataset must be structured to consist of features and labels. See <link to features and labels discussion>

Note

Features must be a tensor. Labels can be a tensor or None.

If the input function does not return a dataset, then CerebrasEstimator will error out with the following error:

[Unsupported TensorFlow] Input function must return a tf.data.Dataset.

Single dictionary input

The input function in CerebrasEstimator only takes a single dictionary parameter, params, as input. This will be passed in through the Estimator constructor.

Input function limitations

Drop remainder to enforce fixed batch size

The CerebrasEstimator requires that your input function outputs batches of a fixed batch_size across all steps. To enforce this, you must set the drop_remainder parameter provided by the Dataset API to True when batching the Dataset. See TensorFlow documentation for batch.

dataset = dataset.shuffle(1000).repeat().batch(batch_size, drop_remainder=True)

If you do not provide a fixed batch_size, the CerebrasEstimator will error out with the following error:

[Unsupported TensorFlow] Inconsistent batch sizes detected. To ensure a fixed batch size across all steps, set `drop_remainder=True` when batching your Dataset in the input function.

Config differences

Lower bound on save_checkpoint_steps

Because the CS system trains faster than alternative systems, saving checkpoints too frequently can have a significant overall performance impact.

TF Env Config

This environment variable (see section 1 under Configuration) must be specified, while training on the CS system. A default for this is already provided in our example scripts. Ensure that its called during training.

Parameters not supported
  1. save_checkpoint_secs

  2. train_distribute

  3. device_fn

  4. protocol

  5. eval_distribute

  6. experimental_distribute

  7. Experimental_max_worker_delay_secs

Compilation differences

Like most high performance compute devices, the CS system requires application compilation before execution. In a typical training run, this is handled automatically by CerebrasEstimator.

However, because this process can take many minutes, thereby increasing with the complexity of your model, Cerebras makes available a standalone CerebrasEstimator.compile() function. This function allows you to quickly validate your model code and perform full batched precompiles without connecting to the CS system. However, note that when you compile your model on one CS system, you cannot run this compiled model on another CS system.