Shuffling Samples for HDF5 dataset of GPT models#

Input files format#

The script expects the dataset files to be in .h5 file format. They can be located either in the input_dir or one of its subdirectories. It can be found here.

Generating Shuffled HDF5 Dataset#

bash launch_h5_shuffle.sh <path/to/input_dir> <path/to/output_dir> <num_chunks> <num_workers> <multi_modal_flag>

Here num_chunks is the number of output HDF5 files and multi_modal_flag denotes if the dataset is multimodal or not. This flag should be either empty or --multi_modal.

Output files structure#

The output directory will contain a bunch of h5 files as shown below:

<path/to/output_dir>
├── logs/
├── shuf_split/
├── workers/
├── 0/
├── 1/
├── 2/
├── ⋮
├── data-00000.h5
├── data-00001.h5
├── data-00002.h5
└── ⋮

- The numbered directories (`0/`, `1/`, ...) are temporary directories that were created while shuffling the samples and can be removed at the end of the run. 
- `logs/` contains the logs of each worker. 
- `shuf_split/` contains `.shuf` files for each HDF5 input file to indicate it has completed reading all samples from that file.
- `workers/` contains `.txt` files for each worker to indicate that it has completed reading all samples from the HDF5 files that were assigned to it. They are used to sync between workers.