Input Function Report#

When you compile your TensorFlow model for training on the Cerebras system, the compiler automatically analyzes your input function and generates a detailed log. This log contains information identifying any missing functions and provides recommendations on parameter values to enhance the training performance on the CS system. This section describes how to interpret this log.

Important

The analyzer will run automatically even when you only precompile your network, either with validate_only or compile_only option. You do not need to run your network on the CS system to run the analyzer.

The analyzer will display its recommendations in the output log of the compiler run, usually displayed on stdout.

Example analyzer output on stdout#

The following shows an example of the analyzer output, displayed as a part of the compiler output log on the terminal (or stdout).

Hint

In the compiler log output on stdout, each analyzer output statement will begin with the text string [input_fn].

INFO:tensorflow:Running analyze_input_fn_compile
WARNING:root:[input_fn] - interleave(): in ParallelInterleaveDatasetV3, cycle_length is not being set to CS_AUTOTUNE. Currently, it is set to 64. If determinism is not required, Using CS_AUTOTUNE is likely to improve performance unless you are deliberately using a fine-tuned value.e.g. dataset = dataset.interleave(map_func, cycle_length=cerebras.tf.tools.analyze_input_fn.CS_AUTOTUNE)
WARNING:root:Tensorflow recommends that most dataset input pipelines end with a call to prefetch, but ShuffleDataset used in input_fn after prefetch(). Unless this is a careful design choice, consider calling prefetch last
WARNING:root:[input_fn] - interleave(): in ParallelInterleaveDatasetV3_1, cycle_length is not being set to CS_AUTOTUNE. Currently, it is set to 64. If determinism is not required, Using CS_AUTOTUNE is likely to improve performance unless you are deliberately using a fine-tuned value.e.g. dataset = dataset.interleave(map_func, cycle_length=cerebras.tf.tools.analyze_input_fn.CS_AUTOTUNE)
WARNING:root:Tensorflow recommends that most dataset input pipelines end with a call to prefetch, but RepeatDataset used in input_fn after prefetch(). Unless this is a careful design choice, consider calling prefetch last
WARNING:root:Tensorflow recommends that most dataset input pipelines end with a call to prefetch, but BatchDatasetV2 used in input_fn after prefetch(). Unless this is a careful design choice, consider calling prefetch last
WARNING:root:Map is called prior to Batch. Consider reverting the order and performing the map function in a batched fashion to increase the performance of the input function
WARNING:root:[input_fn] - flat_map(): use map() instead of flat_map() to improve performance and parallelize reads. If you are not calling flat_map directly, check if you are using: from_generator, TextLineDataset, TFRecordDataset, or FixedLenthRecordDataset. If so, set num_parallel_reads to > 1 or cerebras.tf.tools.analyze_input_fn.CS_AUTOTUNE, and map() will be used automatically.

Analyzer recommendations#

The analyzer output will contain detailed recommendations. This section describes these recommendations and how to interpret them. See also:

Interleave#

  • If num_parallel_calls is set to CS_AUTOTUNE with a statement: num_parallel_calls = CS_AUTOTUNE then the compiler will automatically set num_parallel_calls to the number of threads. The worker or the job running the input pipeline has access to the number of threads.

  • If num_parallel_calls is not set to CS_AUTOTUNE then:

    • If you have not set num_parallel_calls at all, then the analyzer recommends you to set this to CS_AUTOTUNE.

    • If you have set num_parallel_calls to a value other than CS_AUTOTUNE, then the analyzer displays the following message:

      num_parallel_calls is not being set to CS_AUTOTUNE. Currently, it is being set to {value}. Using CS_AUTOTUNE is likely to improve performance unless you are deliberately using a fine-tuned value”.

  • If cycle_length is set to CS_AUTOTUNE with a statement: cycle_length = CS_AUTOTUNE then the compiler will automatically set cycle_length to the number of threads. The worker or the job running the input pipeline has access to the number of threads.

  • If cycle_length is not set to CS_AUTOTUNE then:

    • If you have not set cycle_length at all, then the analyzer recommends you to set this to CS_AUTOTUNE.

    • If you have set cycle_length to a value other than CS_AUTOTUNE, then the analyzer displays the following message:

      cycle_length is not being set to CS_AUTOTUNE. Currently, it is being set to {value}. Using CS_AUTOTUNE is likely to improve performance unless you are deliberately using a fine-tuned value”.

Map#

  • Checks if the method flat_map() is used. If flat_map() is used, then recommends using the method map() instead of flat_map() to improve performance and parallelize reads. If you are not calling flat_map directly, check if you are using: from_generator, TextLineDataset, TFRecordDataset, or FixedLenthRecordDataset. If you using either of the above, then set num_parallel_reads to > 1 or to CS_AUTOTUNE, and map() will be used automatically.

  • Recommends setting num_parallel_calls to CS_AUTOTUNE as in:

    dataset = dataset.map(extract_fn, num_parallel_calls=CS_AUTOTUNE)

  • Checks if this is called before batch. If called before, then suggests to vectorize the op and call it after batch as it is more performant.

ListFiles#

  • Ensure a deterministic seed is used, especially if you are sharding after.

  • If shuffle=False (note this is not default), then states that not shuffling here may require a very large shuffle buffer (not that important).

Prefetch#

Shard#

  • If the data is being pulled from multiple files, then suggests that you shard the data amongst the number of workers.

  • This is generally done best early in the pipeline (see also Summary of best practices). Applies only for multiple files. Doing sharding later has diminishing returns.

  • Follow this: num_shards = min(num_files, num_workers)

  • Ensure this is done prior to any randomized ops (such as shuffle or map with deterministic set to False).

Shuffle#

  • Ensure this is done prior to batch.

  • Recommends setting the shuffle buffer_size = CS_AUTOTUNE.

  • Checks if the seed parameter in the dataset.shuffle() is set to None.

    • If not set to None, then recommends that the global seed must be set to None, as in: dataset = dataset.shuffle(buffer_size, seed=None). This is recommended so that each worker does not send the same batch of data and cause overfitting. Also displays the current value of seed.

Batch#

  • Checks if the input_fn uses the function batch(). If the repeat() function is not found, then displays the following message:

    “Input function is not using batch(). Setting this is required for static graph compilation. Make use of the batch(), as, for example: dataset = dataset.batch(batch_size, drop_remainder=True).

  • Checks if drop_remainder is set to True. If not set to True, then recommends the following:

    Setting to True is required to provide a static graph to the CS system. Set it, for example, as: dataset = dataset.batch(batch_size, drop_remainder=True).

  • shuffle() repeat() batch() is slightly more performant than shuffle() batch() repeat(), although a batch could straddle epoch boundaries.

Repeat#

  • Checks if the input_fn uses the function repeat(). If the repeat() function is not found, then displays the following message:

    “Input function is not using repeat(). This is required during CS system training for the many parallel input workers to send enough samples to the system. Make use of the repeat(), as, for example: dataset = dataset.repeat(). Note: Make sure this is specified for training only (not validation)”.

  • Checks if the count parameter for dataset.repeat() is set to its default value of None. If not set to None, then recommends that the count parameter must be None for the CS system to correctly stream data. None is the default value. Also displays the current value of count.

FixedLengthRecordDataset#

For tf.data.FixedLengthRecordDataset:

  • Recommends setting num_parallel_reads to CS_AUTOTUNE.

  • Recommends setting buffer_size to CS_AUTOTUNE.

TFRecordDataset#

For tf.data.TFRecordDataset:

  • Recommends setting num_parallel_reads to CS_AUTOTUNE.

  • Recommends setting buffer_size to CS_AUTOTUNE.

TextLineDataset#

For tf.data.TextLineDataset:

  • Recommends setting num_parallel_reads to CS_AUTOTUNE.

  • Recommends setting buffer_size to CS_AUTOTUNE.

Input function requirements#

The input function not only must adhere to the above analyzer recommendations, but it must also satisfy the following requirements:

  • The input function must return tf.data.Dataset object.

  • The above-returned Dataset object must consist of features that must be a tensor, and labels that can be a tensor or None.

  • The input function should accept only one dictionary input for params, which will be passed through to the Estimator constructor.

See also Limitations of the CerebrasEstimator.

Manually using the analyzer#

To manually use the analyze_input_fn_compile tool in your Python code, follow these steps:

  1. Import and call the tool:

from cerebras.tf.tools.analyze_input_fn_compile import analyze_input_fn_compile

...

dataset = input_fn(params)
analyze_input_fn_compile(dataset)
  1. Make sure you are either running this code within the Singularity container or, if running through Slurm, with only one worker on one worker node.

Signature#

analyze_input_fn_compile(dataset, hard_check=False)

Example#

from cerebras.tf.tools.analyze_input_fn import analyze_input_fn_compile

dataset = input_fn(params)
analyze_input_fn_compile(dataset)

where:

  • input_fn is your input function.

  • params is a Python dictionary of parameters.

  • dataset is a Dataset object returned by the input_fn(params).

Parameters#

dataset

Input. A Dataset object returned by the input_fn(params).

hard_check

Input. Boolean. Default value: False. If set to False, any error will be logged, and execution will not stop. When the analyzer is used in the CS system training workflow, this is set to False.

Note

You can use this analyzer as a standalone tool or include it in your CS system training workflow, prior to the training.