cerebras.modelzoo.data_preparation.nlp.t5.utils.noise_token_span_to_unique_sentinel#

cerebras.modelzoo.data_preparation.nlp.t5.utils.noise_token_span_to_unique_sentinel(tokens, noise_mask, vocab_size)[source]#

Replace each run of consecutive noise tokens with a different sentinel. The idea here is to be able to align the dropped spans in the inputs with the markers in the targets. We want to generate training examples like “We hold <X> to be <Y> that” -> “<X> these truths <Y> self evident <Z>” Sentinels assigned in decreasing order within the sequence starting at vocab_size - 1. That is, we appropriate the last tokens in the vocabulary for additional use as sentinels. :param list tokens: A list of uncorrupted token indices. :param np.array noise_mask: A 1d boolean tensor with mask to apply noise. :param int vocab_size: Size of the vocabulary with tokens. :return: np.array with sentinels of the same type and shape as

tokens.