modelzoo.transformers.pytorch.t5.input.utils.concatenate_documents#

modelzoo.transformers.pytorch.t5.input.utils.concatenate_documents(dataset, num_to_concatenate=128, pad_id=0)[source]#

Concatenate unrelated documents together to reduce the need for padding.

Parameters
  • dataset (iterable) – The input dataset.

  • num_to_concatenate (int) – How many documents to concatanate together.

Params int pad_id

The vocab id reserved for padding values. Must not occur anywhere in the dataset.

Yields

new samples made from concatenating samples in dataset.