modelzoo.transformers.pytorch.gpt2.input.InferenceDataProcessor.InferenceDataProcessor#
- class modelzoo.transformers.pytorch.gpt2.input.InferenceDataProcessor.InferenceDataProcessor[source]#
Bases:
torch.utils.data.IterableDataset
Methods
Classmethod to create the dataloader object.
Preprocess raw text requests as fetched from EEH script into data samples consumable by GPT2 model and dump these to numpy file.
- __call__(*args: Any, **kwargs: Any) Any #
Call self as a function.
- static __new__(cls, *args: Any, **kwargs: Any) Any #
- static gen_data_samples(requests: List[Tuple[str, str]], batch_size: int, max_sequence_length: int, eos_token_id: int, samples_saver: modelzoo.common.pytorch.input.utils.SamplesSaver, tokenizer_file_path: Optional[str] = None) Tuple[List[str], int, Tuple[int, int]] [source]#
Preprocess raw text requests as fetched from EEH script into data samples consumable by GPT2 model and dump these to numpy file.
- Parameters
requests – List of raw text requests with each request captured as a tuple pair of context string and continuation string
max_sequence_length – The maximum length of each sample
batch_size – The batch size
eos_token_id – int representing the end-of-sentence token
samples_saver – SamplesSaver object to manage the saving of data samples to file.
tokenizer_file_path – Path to the tokenizer file if for a custom tokenizer is used. If not specified, gpt2 tokenizer is used by default.
- Returns
(List[str], int, tuple) tuple of list of file paths where the samples are dumped, the size of the dataset (total no. of samples), and tuple of context and continuation token lengths