cerebras.modelzoo.data.nlp.t5.t5_utils.concatenate_documents#

cerebras.modelzoo.data.nlp.t5.t5_utils.concatenate_documents(dataset, num_to_concatenate=128, pad_id=0)[source]#

Concatenate unrelated documents together to reduce the need for padding.

Parameters
  • dataset (iterable) – The input dataset.

  • num_to_concatenate (int) – How many documents to concatanate together.

Params int pad_id

The vocab id reserved for padding values. Must not occur anywhere in the dataset.

Yields

new samples made from concatenating samples in dataset.