Submodules# module#

Processors for handling TF records data format for BERT


Bases: abc.ABC

Creates dataset from pre-compiled TF records using the map_fn provided in the child class.

All files in the data_dir(s) matching ‘.tfrecord’ will be read.

  • data_dir – a path or list of paths for data to be gathered from

  • batch_size (int) – batch size for the dataset

  • shuffle (bool) – whether the data should be shuffled

  • shuffle_seed (int) – seed to use for shuffling

  • shuffle_buffer (int) – buffer size for call to

  • repeat (bool) – whether the dataset should be repeated

  • use_multiple_workers (bool) – if True, dataset will be sharded

  • n_parallel_reads (int) – for call to

  • map_before_batch (bool) – if True, mapping will happen before batching.

  • skip_steps (int) – Number of steps to skip the dataset after batching.

__init__(data_dir, batch_size, shuffle=True, shuffle_seed=None, shuffle_buffer=None, repeat=True, use_multiple_workers=False, n_parallel_reads=4, map_before_batch=False, skip_steps=0)#
create_tf_dataset(is_training=True, input_context=None)#

Create tf dataset.

  • is_training (bool) – Specifies whether the data is for training

  • input_context (dict) – Given by distributed strategy for training


tf dataset

abstract map_fn(raw_record)#

Parses a batch of serialized examples into data formatted correctly for the model being used. module#

Function for performing standard transformations on datasets., element_length_func, bucket_boundaries, batch_size, padded_shapes=None, padding_values=None, no_padding=False, drop_remainder=False)#

Batch the dataset such that samples within a batch have similar values of features[key]. Tensorflow has a native implemenation of bucketing starting in version 2.6 (see This function is intended to provide a subset of the functionality of the tensorflow version until tf >= 2.6 is supported in the toolchain. See the tensorflow documentation for a detailed interface description.

  • dataset ( – an unbatched dataset.

  • element_length_func – a function that takes a dataset element and returns its length.

  • bucket_boundaries (list) – a list of boundaries between buckets. Expected to have length one less than the total number of buckets.

  • batch_size (list) – the batch size for the resulting data. Note that this is different from the tensorflow interface as we require a static batch size, so we can’t have the option for batch size to vary based on bucket.

  • padded_shapes – Possibly nested structure of tuples and dicts describing the desired padding for each dataset element. Only required if no_padding = False.

  • padding_values – A possibly nested structure of tuples and dicts describing the desired padding values for each dataset element. Only required if no_padding = False.

  • no_padding (bool) – Whether or not to pad samples before batching them.

  • drop_remainder (bool) – Whether or not to drop the final incomplete batch if the number of dataset elements is not a multiple of the batch size.


a of bucketed batched data.

Returns a bytes_list from a string / byte.

Returns a float_list from a float / double.

Returns an int64_list from a bool / enum / int / uint. typing.Union[typing.Dict[str, typing.Union[typing.List[int], typing.Tuple[int, ...]]], typing.List[int], typing.Tuple[int, ...]], label_shape: typing.Union[typing.List[int], typing.Tuple[int, ...]], batch_size: int, total_samples: int, feature_high: typing.Union[float, typing.Dict[str, float]] = 1, feature_low: typing.Union[float, typing.Dict[str, float]] = 0, label_high: float = 1, label_low: float = 0, random_seed: int = 1202, feature_type: typing.Union[numpy.dtype, typing.Dict[str, numpy.dtype]] = numpy.float32, label_type: numpy.dtype = numpy.float32, feature_function=<function <lambda>>) Tuple[Callable[[Dict[str, Any]],], Dict[str, Any]]#

Creates a callable for generating a TF dataset from random numpy data.

  • feature_shape – If tuple or list, then shape of a feature. If dictionary, then feature will be a dictionary with the same keys. Values in this input param dictionary must be tuples or list, and determine the size of the corresponding value in the feature dict.

  • label_shape – Shape of a label as tuple or list.

  • batch_size – Batch size of data.

  • total_samples – Total number of samples of data.

  • feature_high – Upper bound value for features. If feature_shape is a dict, then it can be a dict with the same keys.

  • feature_low – Lower bound value for features. If feature_shape is a dict, then it can be a dict with the same keys.

  • label_high – Upper bound value for labels.

  • label_low – Lower bound value for labels.

  • random_seed – RNG seed to be used.

  • feature_type – Data type of features. Supported types are np.float16, np.float32, np.int32. If feature_shape is a dict then can be a dict with the same keys.

  • label_type – Data type of labels. Supported types are np.float16, np.float32, np.int32.

  • feature_function – Function to apply on top of generated features.


An input_fn callable and params (dict) to be passed into the input_fn., map_fn, batch_size, is_training, shuffle, post_batch_map_fn=None, shuffle_buffer=None, repeat=True, seed=None, map_before_batch=False, batch_fn=None,, skip_steps=0)#
Apply standard transformations to a dataset:
  • shuffle -> batch -> map -> repeat if map_before_batch is False

  • shuffle -> map -> batch -> repeat if map_before_batch is True

Batching before mapping is generally faster and the preferred method due to vectorization of map fn.

Note: Mapping before batching may be required if parsing TF records that contain FixedLenSequenceFeature examples (rather than FixedLenFeature)

  • dataset ( – Dataset to apply transformations to

  • map_fn (func) – Mapping function to be applied after batching data

  • batch_size (int) – Batch size for model training

  • shuffle (bool) – If True, then shuffle the dataset

  • shuffle_buffer (int) – Size of shuffle buffer to sample data from

  • repeat (bool) – If True, repeat the dataset

  • seed (int) – Seed to use for shuffle randomizer or None

  • map_before_batch (bool) – if True, mapping will happen before batching.

  • num_parallel_calls (tf.Tensor) – representing the number of batches to compute asynchronously in parallel. Default value is when number of parallel calls is set dynamically based on available resources.

  • skip_steps (int) – Number of steps to skip the dataset after batching.


tf dataset

Module contents#