cerebras.modelzoo.data_preparation.nlp.gpt2.data_processor_utils.training_data_generator#

cerebras.modelzoo.data_preparation.nlp.gpt2.data_processor_utils.training_data_generator(input_files, vocab_file, encoder_file, max_sequence_length, buffer_size=1000000.0, overlap_size=None, short_seq_prob=0, inverted_mask=False, add_special_tokens=True, eos_token='<|endoftext|>', pad_token='<|endoftext|>', input_ids_dtype='int32', input_mask_dtype='int32', labels_dtype='int32')[source]#

Generator function used to create input dataset for GPT2Model.

Parameters
  • input_files (list[str]) – List of input files.

  • vocab_file (str) – Vocabulary file, to build tokenization from

  • encoder_file (str) – Encoder file, map from word-pieces to token IDs for tokenization

  • max_sequence_length (int) – Maximum length of the sequence to generate

  • short_seq_prob (int) – Probability of a short sequence. Defaults to 0. Sometimes we want to use shorter sequences to minimize the mismatch between pre-training and fine-tuning.

  • buffer_size (int) – Read buffer size. Defaults to 1MB.

  • overlap_size (int) – Size of overlap when forming sequences from buffered token ids in a sliding window fashion. Defaults to None, which sets the overlap of max_sequence_length/4.

  • inverted_mask (bool) – If set to False, has 0’s on padded positions and 1’s elsewhere. Otherwise, “inverts” the mask, so that 1’s are on padded positions and 0’s elsewhere.

  • eos_token (str) – End of sequence token. Defaults to “<|endoftext|>”.

  • pad_token (str) – Pad token. Defaults to “<|endoftext|>”.

  • input_ids_dtype (str) – Type of input ids. Defaults to “int32”.

  • input_mask_dtype (str) – Type of mask. Defaults to “int32”.

  • labels_dtype (str) – Type of labels. Defaults to “int32”.

Returns

yields training examples (feature, label)