modelzoo.transformers.pytorch.gpt2.input.InferenceDataProcessor.InferenceDataProcessor#

class modelzoo.transformers.pytorch.gpt2.input.InferenceDataProcessor.InferenceDataProcessor[source]#

Bases: torch.utils.data.IterableDataset

Methods

`create_dataloader`	Classmethod to create the dataloader object.
`gen_data_samples`	Preprocess raw text requests as fetched from EEH script into data samples consumable by GPT2 model and dump these to numpy file.

__call__(*args: Any, **kwargs: Any) → Any#: Call self as a function.

__init__(params, samples_file_list, dataset_size)[source]#

static __new__(cls, *args: Any, **kwargs: Any) → Any#

create_dataloader()[source]#: Classmethod to create the dataloader object.

static gen_data_samples(requests: List[Tuple[str, str]], batch_size: int, max_sequence_length: int, eos_token_id: int, samples_saver: modelzoo.common.pytorch.input.utils.SamplesSaver, tokenizer_file_path: Optional[str] = None) → Tuple[List[str], int, Tuple[int, int]][source]#

Preprocess raw text requests as fetched from EEH script into data samples consumable by GPT2 model and dump these to numpy file.

Parameters

requests – List of raw text requests with each request captured as a tuple pair of context string and continuation string
max_sequence_length – The maximum length of each sample
batch_size – The batch size
eos_token_id – int representing the end-of-sentence token
samples_saver – SamplesSaver object to manage the saving of data samples to file.
tokenizer_file_path – Path to the tokenizer file if for a custom tokenizer is used. If not specified, gpt2 tokenizer is used by default.

Returns

(List[str], int, tuple) tuple of list of file paths where the samples are dumped, the size of the dataset (total no. of samples), and tuple of context and continuation token lengths

modelzoo.transformers.pytorch.gpt2.input.InferenceDataProcessor.get_token_ids

modelzoo.transformers.pytorch.gpt2.input.scripts