cerebras.modelzoo.data.nlp.t5.t5_utils.select_random_chunk#

cerebras.modelzoo.data.nlp.t5.t5_utils.select_random_chunk(tokens, max_length=65536, rng=None)[source]#

Select a random chunk of a sample. This is used to prevent bias towards very long passages in the corpus.

Parameters
  • tokens (list) – A list of token indices.

  • max_length (int) – the maximum allowed length of a sample before splitting.

  • rng (np.random.Generator) – The numpy random generator to be used as the source of randomness for this function.

Returns

A list that is a random chunk of tokens if len(tokens) > max_length or tokens otherwise.