cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.convert_dataset_to_HDF5.convert_dataset_to_HDF5#
- cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.convert_dataset_to_HDF5.convert_dataset_to_HDF5(dataset: Union[torch.utils.data.IterableDataset, torch.utils.data.Dataset], output_dir='./hdf5_dataset/', name='dataset-partition', samples_per_file=2000, num_workers=8, batch_size=64, data_collator=None, dtype='i4', compression='gzip')[source]#
Iterates PyTorch dataset and writes the data to HDF5 files.
- Parameters
dataset (IterableDataset, Dataset) – PyTorch dataset to fetch the data from.
output_dir (string) – directory where HDF5 will be stored. Defaults to ‘./hdf5_dataset/’
name (string) – name of the dataset; i.e. prefix to use for HDF5 file names. Defaults to ‘dataset-partition’
samples_per_file (int) – number of samples written to each HDF5 file
2000 ((last file can have less samples if the dataset isn't divisible). Defaults to) –
num_workers (int) – number of Python processes to use for generating data. Defaults to 8
batch_size (int) – The batch size to use fetching the data. Defaults to 64
data_collator (Callable) – merges a list of samples to form a mini-batch of Tensor(s).
dataset. (Used when using batched loading from a map-style) –
dtype (string) – Data type for the HDF5 dataset.
compression (string) – Compression strategy.