modelzoo.transformers.pytorch.gpt2.input.InferenceDataProcessor.InferenceDataProcessor#

class modelzoo.transformers.pytorch.gpt2.input.InferenceDataProcessor.InferenceDataProcessor[source]#

Bases: torch.utils.data.IterableDataset

Methods

create_dataloader

Classmethod to create the dataloader object.

gen_data_samples

Preprocess raw text requests as fetched from EEH script into data samples consumable by GPT2 model and dump these to numpy file.

__call__(*args: Any, **kwargs: Any) Any#

Call self as a function.

__init__(params, samples_file_list, dataset_size)[source]#
static __new__(cls, *args: Any, **kwargs: Any) Any#
create_dataloader()[source]#

Classmethod to create the dataloader object.

static gen_data_samples(requests: List[Tuple[str, str]], batch_size: int, max_sequence_length: int, eos_token_id: int, samples_saver: modelzoo.common.pytorch.input.utils.SamplesSaver, tokenizer_file_path: Optional[str] = None) Tuple[List[str], int, Tuple[int, int]][source]#

Preprocess raw text requests as fetched from EEH script into data samples consumable by GPT2 model and dump these to numpy file.

Parameters
  • requests – List of raw text requests with each request captured as a tuple pair of context string and continuation string

  • max_sequence_length – The maximum length of each sample

  • batch_size – The batch size

  • eos_token_id – int representing the end-of-sentence token

  • samples_saverSamplesSaver object to manage the saving of data samples to file.

  • tokenizer_file_path – Path to the tokenizer file if for a custom tokenizer is used. If not specified, gpt2 tokenizer is used by default.

Returns

(List[str], int, tuple) tuple of list of file paths where the samples are dumped, the size of the dataset (total no. of samples), and tuple of context and continuation token lengths