.. _input-function-report: Input Function Report ===================== When you compile your TensorFlow model for training on the Cerebras system, the compiler automatically analyzes your input function and generates a detailed log. This log contains information identifying any missing functions and provides recommendations on parameter values to enhance the training performance on the CS system. This section describes how to interpret this log. .. important:: The analyzer will run automatically even when you only precompile your network, either with ``validate_only`` or ``compile_only`` option. You do not need to run your network on the CS system to run the analyzer. The analyzer will display its recommendations in the output log of the compiler run, usually displayed on stdout. Example analyzer output on stdout --------------------------------- The following shows an example of the analyzer output, displayed as a part of the compiler output log on the terminal (or stdout). .. hint:: In the compiler log output on stdout, each analyzer output statement will begin with the text string ``[input_fn]``. .. code-block:: text INFO:tensorflow:Running analyze_input_fn_compile WARNING:root:[input_fn] - interleave(): in ParallelInterleaveDatasetV3, cycle_length is not being set to CS_AUTOTUNE. Currently, it is set to 64. If determinism is not required, Using CS_AUTOTUNE is likely to improve performance unless you are deliberately using a fine-tuned value.e.g. dataset = dataset.interleave(map_func, cycle_length=cerebras.tf.tools.analyze_input_fn.CS_AUTOTUNE) WARNING:root:Tensorflow recommends that most dataset input pipelines end with a call to prefetch, but ShuffleDataset used in input_fn after prefetch(). Unless this is a careful design choice, consider calling prefetch last WARNING:root:[input_fn] - interleave(): in ParallelInterleaveDatasetV3_1, cycle_length is not being set to CS_AUTOTUNE. Currently, it is set to 64. If determinism is not required, Using CS_AUTOTUNE is likely to improve performance unless you are deliberately using a fine-tuned value.e.g. dataset = dataset.interleave(map_func, cycle_length=cerebras.tf.tools.analyze_input_fn.CS_AUTOTUNE) WARNING:root:Tensorflow recommends that most dataset input pipelines end with a call to prefetch, but RepeatDataset used in input_fn after prefetch(). Unless this is a careful design choice, consider calling prefetch last WARNING:root:Tensorflow recommends that most dataset input pipelines end with a call to prefetch, but BatchDatasetV2 used in input_fn after prefetch(). Unless this is a careful design choice, consider calling prefetch last WARNING:root:Map is called prior to Batch. Consider reverting the order and performing the map function in a batched fashion to increase the performance of the input function WARNING:root:[input_fn] - flat_map(): use map() instead of flat_map() to improve performance and parallelize reads. If you are not calling flat_map directly, check if you are using: from_generator, TextLineDataset, TFRecordDataset, or FixedLenthRecordDataset. If so, set num_parallel_reads to > 1 or cerebras.tf.tools.analyze_input_fn.CS_AUTOTUNE, and map() will be used automatically. Analyzer recommendations ------------------------ The analyzer output will contain detailed recommendations. This section describes these recommendations and how to interpret them. See also: - :ref:`cerebras-autotune`. - :ref:`limitations-of-cerebras-estimator` for a contextual description of some of these recommendations. Interleave ~~~~~~~~~~ - If ``num_parallel_calls`` is set to ``CS_AUTOTUNE`` with a statement: ``num_parallel_calls = CS_AUTOTUNE`` then the compiler will automatically set ``num_parallel_calls`` to the number of threads. The worker or the job running the input pipeline has access to the number of threads. - If ``num_parallel_calls`` is not set to ``CS_AUTOTUNE`` then: - If you have not set ``num_parallel_calls`` at all, then the analyzer recommends you to set this to ``CS_AUTOTUNE``. - If you have set ``num_parallel_calls`` to a value other than ``CS_AUTOTUNE``, then the analyzer displays the following message: "``num_parallel_calls`` is not being set to ``CS_AUTOTUNE``. Currently, it is being set to {value}. Using ``CS_AUTOTUNE`` is likely to improve performance unless you are deliberately using a fine-tuned value". - If ``cycle_length`` is set to ``CS_AUTOTUNE`` with a statement: ``cycle_length = CS_AUTOTUNE`` then the compiler will automatically set ``cycle_length`` to the number of threads. The worker or the job running the input pipeline has access to the number of threads. - If ``cycle_length`` is not set to ``CS_AUTOTUNE`` then: - If you have not set ``cycle_length`` at all, then the analyzer recommends you to set this to ``CS_AUTOTUNE``. - If you have set ``cycle_length`` to a value other than ``CS_AUTOTUNE``, then the analyzer displays the following message: "``cycle_length`` is not being set to ``CS_AUTOTUNE``. Currently, it is being set to {value}. Using ``CS_AUTOTUNE`` is likely to improve performance unless you are deliberately using a fine-tuned value". Map ~~~ - Checks if the method ``flat_map()`` is used. If ``flat_map()`` is used, then recommends using the method ``map()`` instead of ``flat_map()`` to improve performance and parallelize reads. If you are not calling ``flat_map`` directly, check if you are using: ``from_generator``, ``TextLineDataset``, ``TFRecordDataset``, or ``FixedLenthRecordDataset``. If you using either of the above, then set ``num_parallel_reads`` to > 1 or to ``CS_AUTOTUNE``, and ``map()`` will be used automatically. - Recommends setting ``num_parallel_calls`` to ``CS_AUTOTUNE`` as in: ``dataset = dataset.map(extract_fn, num_parallel_calls=CS_AUTOTUNE)`` - Checks if this is called before ``batch``. If called before, then suggests to vectorize the op and call it after ``batch`` as it is more performant. ListFiles ~~~~~~~~~ - Ensure a deterministic seed is used, especially if you are sharding after. - If ``shuffle=False`` (note this is not default), then states that not shuffling here may require a very large shuffle buffer (not that important). Prefetch ~~~~~~~~ - Checks if the ``input_fn`` uses the function ``prefetch()``. If the ``prefetch()`` function is not found, then displays the following message: "Input function is not using ``prefetch()``. Using this is likely to improve the performance. Make use of the ``prefetch()``, as, for example: ``dataset = dataset.prefetch("buffer_size=CS_AUTOTUNE)"``. - Make sure that you are pre-fetching. If you are not pre-fetching, then suggests that you do so. - Make sure ``prefetch`` is the last op called. See `The best practice for input pipeline performance is to insert a tf.data.Dataset.prefetch transformation at the end of your tf.data pipeline. `__ . Shard ~~~~~ - If the data is being pulled from multiple files, then suggests that you shard the data amongst the number of workers. - This is generally done best early in the pipeline (see also :ref:`multi-worker-input-best-practices`). Applies only for multiple files. Doing sharding later has diminishing returns. - Follow this: ``num_shards = min(num_files, num_workers)`` - Ensure this is done prior to any randomized ops (such as ``shuffle`` or ``map`` with deterministic set to ``False``). Shuffle ~~~~~~~ - Ensure this is done prior to ``batch``. - Recommends setting the shuffle ``buffer_size = CS_AUTOTUNE``. - Checks if the ``seed`` parameter in the ``dataset.shuffle()`` is set to ``None``. - If not set to ``None``, then recommends that the global ``seed`` must be set to ``None``, as in: ``dataset = dataset.shuffle(buffer_size, seed=None)``. This is recommended so that each worker does not send the same batch of data and cause overfitting. Also displays the current value of ``seed``. Batch ~~~~~ - Checks if the ``input_fn`` uses the function ``batch()``. If the ``repeat()`` function is not found, then displays the following message: "Input function is not using ``batch()``. Setting this is required for static graph compilation. Make use of the ``batch(``), as, for example: ``dataset = dataset.batch(batch_size, drop_remainder=True)``. - Checks if ``drop_remainder`` is set to ``True``. If not set to ``True``, then recommends the following: Setting to ``True`` is required to provide a static graph to the CS system. Set it, for example, as: ``dataset = dataset.batch(batch_size, drop_remainder=True)``. - ``shuffle() → repeat() → batch()`` is slightly more performant than ``shuffle() → batch() → repeat()``, although a batch could straddle epoch boundaries. Repeat ~~~~~~~~ - Checks if the ``input_fn`` uses the function ``repeat()``. If the ``repeat()`` function is not found, then displays the following message: "Input function is not using ``repeat()``. This is required during CS system training for the many parallel input workers to send enough samples to the system. Make use of the ``repeat(``), as, for example: ``dataset = dataset.repeat()``. Note: Make sure this is specified for training only (not validation)". - Checks if the ``count`` parameter for ``dataset.repeat()`` is set to its default value of ``None``. If not set to ``None``, then recommends that the ``count`` parameter must be ``None`` for the CS system to correctly stream data. ``None`` is the default value. Also displays the current value of ``count``. FixedLengthRecordDataset ~~~~~~~~~~~~~~~~~~~~~~~~ For ``tf.data.FixedLengthRecordDataset``: - Recommends setting ``num_parallel_reads`` to ``CS_AUTOTUNE``. - Recommends setting ``buffer_size`` to ``CS_AUTOTUNE``. TFRecordDataset ~~~~~~~~~~~~~~~ For ``tf.data.TFRecordDataset``: - Recommends setting ``num_parallel_reads`` to ``CS_AUTOTUNE``. - Recommends setting ``buffer_size`` to ``CS_AUTOTUNE``. TextLineDataset ~~~~~~~~~~~~~~~ For ``tf.data.TextLineDataset``: - Recommends setting ``num_parallel_reads`` to ``CS_AUTOTUNE``. - Recommends setting ``buffer_size`` to ``CS_AUTOTUNE``. Input function requirements --------------------------- The input function not only must adhere to the above analyzer recommendations, but it must also satisfy the following requirements: - The input function must return `tf.data.Dataset `__ object. - The above-returned Dataset object must consist of features that must be a tensor, and labels that can be a tensor or None. - The input function should accept only one dictionary input for params, which will be passed through to the Estimator constructor. See also :ref:`limitations-of-cerebras-estimator`. Manually using the analyzer --------------------------- To manually use the ``analyze_input_fn_compile`` tool in your Python code, follow these steps: 1. Import and call the tool: .. code-block:: Python from cerebras.tf.tools.analyze_input_fn_compile import analyze_input_fn_compile ... dataset = input_fn(params) analyze_input_fn_compile(dataset) 2. Make sure you are either running this code within the Singularity container or, if running through Slurm, with only one worker on one worker node. Signature ~~~~~~~~~ ``analyze_input_fn_compile(dataset, hard_check=False)`` Example ~~~~~~~ .. code-block:: Python from cerebras.tf.tools.analyze_input_fn import analyze_input_fn_compile dataset = input_fn(params) analyze_input_fn_compile(dataset) where: - ``input_fn`` is your input function. - ``params`` is a Python dictionary of parameters. - ``dataset`` is a Dataset object returned by the ``input_fn(params)``. Parameters ~~~~~~~~~~ dataset *Input*. A Dataset object returned by the ``input_fn(params)``. hard_check *Input*. Boolean. Default value: ``False``. If set to ``False``, any error will be logged, and execution will not stop. When the analyzer is used in the CS system training workflow, this is set to ``False``. .. note:: You can use this analyzer as a standalone tool or include it in your CS system training workflow, prior to the training.