Submodules# module#
Processors for handling TF records data format for BERT
- class
Creates dataset from pre-compiled TF records using the map_fn provided in the child class.
All files in the data_dir(s) matching ‘.tfrecord’ will be read.
- Parameters
data_dir – a path or list of paths for data to be gathered from
batch_size (int) – batch size for the dataset
shuffle (bool) – whether the data should be shuffled
shuffle_seed (int) – seed to use for shuffling
shuffle_buffer (int) – buffer size for call to
repeat (bool) – whether the dataset should be repeated
use_multiple_workers (bool) – if True, dataset will be sharded
n_parallel_reads (int) – for call to
map_before_batch (bool) – if True, mapping will happen before batching.
skip_steps (int) – Number of steps to skip the dataset after batching.
- __init__(data_dir, batch_size, shuffle=True, shuffle_seed=None, shuffle_buffer=None, repeat=True, use_multiple_workers=False, n_parallel_reads=4, map_before_batch=False, skip_steps=0)#
- batch_fn(dataset)#
- create_tf_dataset(is_training=True, input_context=None)#
Create tf dataset.
- Parameters
is_training (bool) – Specifies whether the data is for training
input_context (dict) – Given by distributed strategy for training
- Returns
tf dataset
- abstract map_fn(raw_record)#
Parses a batch of serialized examples into data formatted correctly for the model being used. module#
Function for performing standard transformations on datasets.
-, element_length_func, bucket_boundaries, batch_size, padded_shapes=None, padding_values=None, no_padding=False, drop_remainder=False)#
Batch the dataset such that samples within a batch have similar values of features[key]. Tensorflow has a native implemenation of bucketing starting in version 2.6 (see This function is intended to provide a subset of the functionality of the tensorflow version until tf >= 2.6 is supported in the toolchain. See the tensorflow documentation for a detailed interface description.
- Parameters
dataset ( – an unbatched dataset.
element_length_func – a function that takes a dataset element and returns its length.
bucket_boundaries (list) – a list of boundaries between buckets. Expected to have length one less than the total number of buckets.
batch_size (list) – the batch size for the resulting data. Note that this is different from the tensorflow interface as we require a static batch size, so we can’t have the option for batch size to vary based on bucket.
padded_shapes – Possibly nested structure of tuples and dicts describing the desired padding for each dataset element. Only required if no_padding = False.
padding_values – A possibly nested structure of tuples and dicts describing the desired padding values for each dataset element. Only required if no_padding = False.
no_padding (bool) – Whether or not to pad samples before batching them.
drop_remainder (bool) – Whether or not to drop the final incomplete batch if the number of dataset elements is not a multiple of the batch size.
- Returns
a of bucketed batched data.
Returns a bytes_list from a string / byte.
Returns a float_list from a float / double.
Returns an int64_list from a bool / enum / int / uint.
- typing.Union[typing.Dict[str, typing.Union[typing.List[int], typing.Tuple[int, ...]]], typing.List[int], typing.Tuple[int, ...]], label_shape: typing.Union[typing.List[int], typing.Tuple[int, ...]], batch_size: int, total_samples: int, feature_high: typing.Union[float, typing.Dict[str, float]] = 1, feature_low: typing.Union[float, typing.Dict[str, float]] = 0, label_high: float = 1, label_low: float = 0, random_seed: int = 1202, feature_type: typing.Union[numpy.dtype, typing.Dict[str, numpy.dtype]] = numpy.float32, label_type: numpy.dtype = numpy.float32, feature_function=<function <lambda>>) Tuple[Callable[[Dict[str, Any]],], Dict[str, Any]] #
Creates a callable for generating a TF dataset from random numpy data.
- Parameters
feature_shape – If tuple or list, then shape of a feature. If dictionary, then feature will be a dictionary with the same keys. Values in this input param dictionary must be tuples or list, and determine the size of the corresponding value in the feature dict.
label_shape – Shape of a label as tuple or list.
batch_size – Batch size of data.
total_samples – Total number of samples of data.
feature_high – Upper bound value for features. If feature_shape is a dict, then it can be a dict with the same keys.
feature_low – Lower bound value for features. If feature_shape is a dict, then it can be a dict with the same keys.
label_high – Upper bound value for labels.
label_low – Lower bound value for labels.
random_seed – RNG seed to be used.
feature_type – Data type of features. Supported types are np.float16, np.float32, np.int32. If feature_shape is a dict then can be a dict with the same keys.
label_type – Data type of labels. Supported types are np.float16, np.float32, np.int32.
feature_function – Function to apply on top of generated features.
- Returns
An input_fn callable and params (dict) to be passed into the input_fn.
-, map_fn, batch_size, is_training, shuffle, post_batch_map_fn=None, shuffle_buffer=None, repeat=True, seed=None, map_before_batch=False, batch_fn=None,, skip_steps=0)#
- Apply standard transformations to a dataset:
shuffle -> batch -> map -> repeat if map_before_batch is False
shuffle -> map -> batch -> repeat if map_before_batch is True
Batching before mapping is generally faster and the preferred method due to vectorization of map fn.
Note: Mapping before batching may be required if parsing TF records that contain FixedLenSequenceFeature examples (rather than FixedLenFeature)
- Parameters
dataset ( – Dataset to apply transformations to
map_fn (func) – Mapping function to be applied after batching data
batch_size (int) – Batch size for model training
shuffle (bool) – If True, then shuffle the dataset
shuffle_buffer (int) – Size of shuffle buffer to sample data from
repeat (bool) – If True, repeat the dataset
seed (int) – Seed to use for shuffle randomizer or None
map_before_batch (bool) – if True, mapping will happen before batching.
num_parallel_calls (tf.Tensor) – representing the number of batches to compute asynchronously in parallel. Default value is when number of parallel calls is set dynamically based on available resources.
skip_steps (int) – Number of steps to skip the dataset after batching.
- Returns
tf dataset