modelzoo.transformers.pytorch.gpt2.input.GptTextDataProcessor.GptTextDataProcessor#

class modelzoo.transformers.pytorch.gpt2.input.GptTextDataProcessor.GptTextDataProcessor[source]#

Bases: torch.utils.data.IterableDataset

A text dataset processor for GPT pre-training. Performs on-the-fly processing of data from text.

Functionality includes:

Reading data from text documents Creating creating input sequences and masks, and autoregressive LM labels

Parameters

params (dict) – dict containing training input parameters for creating dataset.

Expects the following fields:

  • “metadata_files” (str or list of str): A string or strings list each pointing to a metadata file. A metadata file contains file paths for flat text cleaned documents. It has one file path per line. The cleaned cleaned files have one paragraph per line and are separated by an empty line.

  • “vocab_file” (str): Vocabulary file, to build tokenization from

  • “encoder_file (str): Encoder file, map from word-pieces to

    token IDs for tokenization

  • “max_sequence_length (int): Maximum length of the sequence to generate

  • “short_sequence_prob (int): Probability of a short sequence. Defaults to 0.

  • “overlap_size (int): Size of overlap when forming sequences from buffered token ids in a sliding window fashion. Defaults to None, which sets the overlap of max_sequence_length/4.

  • “batch_size” (int): Batch size.

  • “shuffle” (bool): Flag to enable data shuffling.

  • “shuffle_seed” (int): Shuffle seed.

  • “num_workers” (int): How many subprocesses to use for data loading.

  • “drop_last” (bool): If True and the dataset size is not divisible

    by the batch size, the last incomplete batch will be dropped.

  • “prefetch_factor” (int): Number of samples loaded in advance by each worker.

  • “persistent_workers” (bool): If True, the data loader will not shutdown

    the worker processes after a dataset has been consumed once.

  • “add_special_tokens” (bool): Flag to add BOS and EOS tokens.

  • “eos_token” (str): EOS token.

  • “pad_token” (str): PAD token.

Methods

create_dataloader

Classmethod to create the dataloader object.

__call__(*args: Any, **kwargs: Any) Any#

Call self as a function.

__init__(params)[source]#
static __new__(cls, *args: Any, **kwargs: Any) Any#
create_dataloader()[source]#

Classmethod to create the dataloader object.