cerebras.modelzoo.data_preparation.nlp.gpt2.data_processor_utils.training_data_generator#

cerebras.modelzoo.data_preparation.nlp.gpt2.data_processor_utils.training_data_generator(input_files, vocab_file, encoder_file, max_sequence_length, buffer_size=1000000.0, overlap_size=None, short_seq_prob=0, inverted_mask=False, add_special_tokens=True, eos_token='<|endoftext|>', pad_token='<|endoftext|>', input_ids_dtype='int32', input_mask_dtype='int32', labels_dtype='int32')[source]#

Generator function used to create input dataset for GPT2Model.

Parameters

input_files (list[str]) – List of input files.
vocab_file (str) – Vocabulary file, to build tokenization from
encoder_file (str) – Encoder file, map from word-pieces to token IDs for tokenization
max_sequence_length (int) – Maximum length of the sequence to generate
short_seq_prob (int) – Probability of a short sequence. Defaults to 0. Sometimes we want to use shorter sequences to minimize the mismatch between pre-training and fine-tuning.
buffer_size (int) – Read buffer size. Defaults to 1MB.
overlap_size (int) – Size of overlap when forming sequences from buffered token ids in a sliding window fashion. Defaults to None, which sets the overlap of max_sequence_length/4.
inverted_mask (bool) – If set to False, has 0’s on padded positions and 1’s elsewhere. Otherwise, “inverts” the mask, so that 1’s are on padded positions and 0’s elsewhere.
eos_token (str) – End of sequence token. Defaults to “<|endoftext|>”.
pad_token (str) – Pad token. Defaults to “<|endoftext|>”.
input_ids_dtype (str) – Type of input ids. Defaults to “int32”.
input_mask_dtype (str) – Type of mask. Defaults to “int32”.
labels_dtype (str) – Type of labels. Defaults to “int32”.

Returns

yields training examples (feature, label)

cerebras.modelzoo.data_preparation.nlp.gpt2.data_processor_utils

cerebras.modelzoo.data_preparation.nlp.gptj