cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.utils.process_dataset#

cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.utils.process_dataset(files, dataset_processor, processes)[source]#

Process a dataset and write it into HDF5 format.

Parameters
  • files (list) – List of files to process.

  • dataset_processor – Class containing methods that specify how the dataset will be processed and written into HDF5 files.

  • processes (int) – Number of processes to use.

Returns

Dictionary containing results of execution, specifically as number of

processed, discarded, and successful files as well as number of examples from all processes.