cerebras.modelzoo.data_preparation.utils.create_masked_lm_predictions#

cerebras.modelzoo.data_preparation.utils.create_masked_lm_predictions(tokens, vocab_words, mask_whole_word, max_predictions_per_seq, masked_lm_prob, rng, exclude_from_masking=None)[source]#: Creates the predictions for the masked LM objective :param list tokens: List of tokens to process :param list vocab_words: List of all words present in the vocabulary :param bool mask_whole_word: If true, mask all the subtokens of a word :param int max_predictions_per_seq: Maximum number of masked LM predictions per sequence :param float masked_lm_prob: Masked LM probability :param rng: random.Random object with shuffle function :param Optional[list] exclude_from_masking: List of tokens to exclude from masking. Defaults to [“[CLS]”, “[SEP]”] :returns: tuple of tokens which include masked tokens, the corresponding positions for the masked tokens and also the corresponding labels for training

cerebras.modelzoo.data_preparation.utils.count_total_documents

cerebras.modelzoo.data_preparation.utils.get_files_in_metadata