cerebras.modelzoo.data_preparation.nlp.bert.mlm_only_processor.data_generator#

cerebras.modelzoo.data_preparation.nlp.bert.mlm_only_processor.data_generator(metadata_files, vocab_file, do_lower, disable_masking, mask_whole_word, max_seq_length, max_predictions_per_seq, masked_lm_prob, dupe_factor, output_type_shapes, multiple_docs_in_single_file=False, multiple_docs_separator='\n', single_sentence_per_line=False, buffer_size=1000000.0, min_short_seq_length=None, overlap_size=None, short_seq_prob=0, spacy_model='en_core_web_sm', inverted_mask=False, allow_cross_document_examples=True, document_separator_token='[SEP]', seed=None, input_files_prefix='')[source]#

Generator function used to create input dataset for MLM only dataset.

1. Generate raw examples with tokens based on “overlap_size”, “max_sequence_length”, “allow_cross_document_examples” and “document_separator_token” and using a sliding window approach. The exact steps are detailed in “_create_examples_from_document” function 2. Mask the raw examples based on “max_predictions_per_seq” 3. Pad the masked example to “max_sequence_length” if less that msl

Parameters
  • metadata_files (str or list[str]) – A string or strings list each pointing to a metadata file. A metadata file contains file paths for flat text cleaned documents. It has one file path per line.

  • vocab_file (str) – Vocabulary file, to build tokenization from

  • do_lower (bool) – Boolean value indicating if words should be converted to lowercase or not

  • disable_masking (bool) – whether masking should be disabled

  • mask_whole_word (bool) – If True, all subtokens corresponding to a word will be masked.

  • max_seq_length (int) – Maximum length of the sequence to generate

  • max_predictions_per_seq (int) – Maximum number of Masked tokens in a sequence

  • masked_lm_prob (float) – Proportion of tokens to be masked

  • dupe_factor (int) – Number of times to duplicate the dataset with different static masks

  • output_type_shapes (dict) – Dictionary indicating the shapes of different outputs

  • multiple_docs_in_single_file (bool) – True, when a single text file contains multiple documents separated by <multiple_docs_separator>

  • multiple_docs_separator (str) – String which separates multiple documents in a single text file.

  • single_sentence_per_line – True,when the document is already split into sentences with one sentence in each line and there is no requirement for further sentence segmentation of a document

  • buffer_size (int) – Number of tokens to be processed at a time

  • min_short_seq_length (int) – When short_seq_prob > 0, this number indicates the least number of tokens that each example should have i.e the num_tokens (excluding pad) would be in the range [min_short_seq_length, MSL]

  • overlap_size (int) – Number of tokens that overlap with previous example when processing buffer with a sliding window approach. If None, defaults to overlap to max_seq_len/4.

  • short_seq_prob (int) – Probability of a short sequence. Defaults to 0. Sometimes we want to use shorter sequences to minimize the mismatch between pre-training and fine-tuning.

  • spacy_model – spaCy model to load, i.e. shortcut link, package name or path. Used to segment text into sentences.

  • inverted_mask (bool) – If set to False, has 0’s on padded positions and 1’s elsewhere. Otherwise, “inverts” the mask, so that 1’s are on padded positions and 0’s elsewhere.

  • allow_cross_document_examples (bool) – If True, the sequences can contain tokens from the next document.

  • document_separator_token (str) – String to separate tokens from one document and the next when sequences span documents

  • seed (int) – Random seed.

  • input_file_prefix (str) – Prefix to be added to paths of the input files.

Returns

yields training examples (feature, [])