cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.convert_dataset_to_HDF5.convert_dataset_to_HDF5#

cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.convert_dataset_to_HDF5.convert_dataset_to_HDF5(dataset: Union[torch.utils.data.IterableDataset, torch.utils.data.Dataset], output_dir='./hdf5_dataset/', name='dataset-partition', samples_per_file=2000, num_workers=8, batch_size=64, data_collator=None, dtype='i4', compression='gzip')[source]#

Iterates PyTorch dataset and writes the data to HDF5 files.

Parameters
  • dataset (IterableDataset, Dataset) – PyTorch dataset to fetch the data from.

  • output_dir (string) – directory where HDF5 will be stored. Defaults to ‘./hdf5_dataset/’

  • name (string) – name of the dataset; i.e. prefix to use for HDF5 file names. Defaults to ‘dataset-partition’

  • samples_per_file (int) – number of samples written to each HDF5 file

  • 2000 ((last file can have less samples if the dataset isn't divisible). Defaults to) –

  • num_workers (int) – number of Python processes to use for generating data. Defaults to 8

  • batch_size (int) – The batch size to use fetching the data. Defaults to 64

  • data_collator (Callable) – merges a list of samples to form a mini-batch of Tensor(s).

  • dataset. (Used when using batched loading from a map-style) –

  • dtype (string) – Data type for the HDF5 dataset.

  • compression (string) – Compression strategy.