cerebras.modelzoo.data_preparation.nlp.t5.utils.get_raw_sequence_lengths#

cerebras.modelzoo.data_preparation.nlp.t5.utils.get_raw_sequence_lengths(max_sequence_length, corruption_prob=0.15, mean_span_len=3)[source]#

T5 span corruption takes a sequence raw_sequence and corrupts spans to generate sequences masked_input and target. This function computes the maximum possible length of raw_sequence such that masked_input has length no greater than max_sequence_length. It outputs this length along with the maximum length of targets for this length of raw_sequences. :param int max_sequence_length: The maximum length of the encoder inputs

after masking.

Parameters
  • corruption_prob (float) – The fraction of tokens that are corrupted for the denoising objective.

  • mean_span_len (int) – The average length of a corrupted span.

Returns

An integer such that if a sequence is clipped to this length before masking then it will have length at most max_sequence_length after masking; an integer that is the maximum possible length of a decoder sequence.