cerebras.modelzoo.data_preparation.nlp.t5.utils#

Functions

concatenate_documents

Concatenate unrelated documents together to reduce the need for padding.

construct_denoising_objective

Formats a raw sequence into a corrupted sequence and corresponding denoising targets. :param list tokens: A list of uncorrupted token indices. :param int vocab_size: The size of the vocabulary. :param int sos_token: The index of the SOS token in the vocabulary. :param int eos_token: The index of the EOS token in the vocabulary. :param np.random.Generator rng: The numpy random generator to be used as the source of randomness for this function. :returns: a tuple (feature_dict, label) of denoising source and target numpy arrays.

create_transformer_input_features

Creates features for Transformer model input.

flat_map

Map a function over an iterator and flatten the result.

get_raw_sequence_lengths

T5 span corruption takes a sequence raw_sequence and corrupts spans to generate sequences masked_input and target. This function computes the maximum possible length of raw_sequence such that masked_input has length no greater than max_sequence_length. It outputs this length along with the maximum length of targets for this length of raw_sequences. :param int max_sequence_length: The maximum length of the encoder inputs after masking. :param float corruption_prob: The fraction of tokens that are corrupted for the denoising objective. :param int mean_span_len: The average length of a corrupted span. :returns: An integer such that if a sequence is clipped to this length before masking then it will have length at most max_sequence_length after masking; an integer that is the maximum possible length of a decoder sequence.

noise_token_span_to_unique_sentinel

Replace each run of consecutive noise tokens with a different sentinel. The idea here is to be able to align the dropped spans in the inputs with the markers in the targets. We want to generate training examples like "We hold <X> to be <Y> that" -> "<X> these truths <Y> self evident <Z>" Sentinels assigned in decreasing order within the sequence starting at vocab_size - 1. That is, we appropriate the last tokens in the vocabulary for additional use as sentinels. :param list tokens: A list of uncorrupted token indices. :param np.array noise_mask: A 1d boolean tensor with mask to apply noise. :param int vocab_size: Size of the vocabulary with tokens. :return: np.array with sentinels of the same type and shape as tokens.

pad_t5_input_features

Provides padding for T5 input features.

parse_text

Postprocessing of the CSV file.

random_spans_noise_mask

Noise mask consisting of random spans of noise tokens. The number of noise tokens and the number of noise spans and non-noise spans are determined deterministically as follows: num_noise_tokens = round(length * noise_density) num_nonnoise_spans = num_noise_spans = round( num_noise_tokens / mean_noise_span_length) Spans alternate between non-noise and noise, beginning with non-noise. Subject to the above restrictions, all masks are equally likely. :param int length: Length of the incoming token sequence. :param float noise_density: A float - approximate density of output mask. :param float mean_noise_span_length: A number used in the noise mask calculation. :param np.random.Generator rng: The numpy random generator to be used as the source of randomness for this function. :return: A boolean np.array with shape [length].

select_random_chunk

Select a random chunk of a sample.

shuffle

Perform a buffered shuffle on an iterator.

split_sequences

Split a long sequence into shorter sequences of the specified length.