cerebras.modelzoo.data.common.h5_map_dataset.readers.Mixture#

class cerebras.modelzoo.data.common.h5_map_dataset.readers.Mixture[source]#

Bases: object

Mix several map-style datasets according to provided weights.

Parameters
  • datasets – a list of objects implementing __len__ and __getitem__

  • weights – a list of weights associated with each dataset. weights must have the same length as datasets and contain only nonnegative values. All weights will be normalized to sum to 1.

  • interleave – whether or not samples of different datasets should be interleaved together. If all the datasets are preprocessed into sequences and shuffled before being written to disk, then setting this flag will allow you to avoid doing any shuffling at run time while still having samples from the different datasets intermingled, which may be desirable for enabling sequential disk reads. This is implemented in a way that samples within a dataset are not shuffled in relation to each other, i.e. sample 0 of dataset 0 will always have a smaller index than sample 1 of dataset 0.

  • seed – the random seed used for interleaving. Ignored if interleave is False.

Methods

Attributes

by_sample

__init__(datasets: List[cerebras.modelzoo.data.common.h5_map_dataset.readers.H5Reader], weights: List[int], interleave: bool = False, seed: int = 0)[source]#