cerebras.modelzoo.data.nlp.bert.bert_utils.create_masked_lm_predictions#

cerebras.modelzoo.data.nlp.bert.bert_utils.create_masked_lm_predictions(tokens, max_sequence_length, mask_token_id, max_predictions_per_seq, input_pad_id, attn_mask_pad_id, labels_pad_id, tokenize, vocab_size, masked_lm_prob, rng, exclude_from_masking, mask_whole_word, replacement_pool=None)[source]#

Creates the predictions for the masked LM objective.

Parameters
  • tokens (list) – Tokens to process.

  • max_sequence_length (int) – Maximum sequence length.

  • mask_token_id (int) – Id of the masked token.

  • max_predictions_per_seq (int) – Maximum number of masked LM predictions per sequence

  • input_pad_id (int) – Input sequence padding id.

  • attn_mask_pad_id (int) – Attention mask padding id.

  • labels_pad_id (int) – Labels padding id.

  • tokenize (callable) – Method to tokenize the input sequence.

  • vocab_size (str) – Size of the vocabulary file.

  • masked_lm_prob (float) – Masked LM probability.

  • rng (random.Random) – Object with shuffle function.

  • exclude_from_masking (list) – List of tokens to exclude from masking.

  • mask_whole_word (bool) – Whether to mask the whole words or not.

  • replacement_pool (list) – List of ids which should be included when replacing tokens with random words from vocab. Default is None and means that we can take any token from the vocab.

Returns

tuple which includes: * np.array[int.32] input_ids: Numpy array with input token indices.

Shape: (max_sequence_length).

  • np.array[int.32] labels: Numpy array with labels.

    Shape: (max_sequence_length).

  • np.array[int.32] attention_mask

    Shape: (max_sequence_length).

  • np.array[int.32] masked_lm_mask: Numpy array with a mask of

    predicted tokens. Shape: (max_predictions) 0 indicates the non masked token, and 1 indicates the masked token.