HDF5 format for GPT dataloader#

Expected Dataset Structure by the Static Dataloader#

The current HDF5 dataloader for GPT models expects the input dataset to meet certain structure criteria which are summarized below:

The dataset files can be located in one directory or multiple directories passed as a list to the dataloader through the data_dir parameter. The files must have .h5 extension (not .hdf5 for example).
Each HDF5 file must have an attribute n_examples that shows the number of input samples stored in the current HDF5 file.
Each HDF5 file must have all input samples located in HDF5 dataset named data. The dataset data should have the dimensions [n_examples, n_features, max_sequence_length]. n_features must equal to 3 as it refers to the features input_ids, attention_mask, and labels. The features must be written and should be accessible in this exact order.

Recommended Settings for Writing HDF5 Files#

To achieve high throughput with the HDF5 dataloader, we need to enable gzip compression and enable chunking the dataset with chunk size equal to the data sample size when producing HDF5 files. This results in each data sample being placed on disk contiguously and in different piece from other samples. Also, when chunking the data, the compression is restricted within the data sample which removes data dependability.

Note

Note that enabling gzip compression without chunking can significantly hurt the performance of the dataloader and result in low throughput. This is true because gzip relies on Lempel-Ziv coding (LZ77) which causes a dependability on previous data samples while fetching the current data sample.

Example (writing HDF5 file)#

import h5py
import numpy as np

def write_hdf5_file():
  n_examples = 50000
  max_sequence_length = 2048
  file_path = "/home/user/data_file.h5"
  gpt_data = np.empty([n_examples, 3, max_sequence_length], dtype=np.int32)
  with h5py.File(file_path, mode='w') as h5_file:
    h5_file.attrs["n_examples"] = n_examples  # number of examples attribute
    h5_file.create_dataset(
        "data",                               # Name of dataset (must NOT change)
        data=gpt_data,                        # Numpy array that contains data
        dtype="i4",                           # Data type 32-bit int matches the data
        chunks=(1, 3, max_sequence_length),   # chunk size equal to the data sample size
        compression="gzip",                   # Enabling gzip compression
    )

def read_hdf5_file(file_path):
  with h5py.File(file_path, mode='r') as h5_file:
    n_examples = h5_file.attrs["n_examples"]
    for idx in range(n_examples):
      data_sample = h5_file["data"][idx]
      data_features = {}
      data_features["input_ids"] = data_sample[0]
      data_features["attention_mask"] = data_sample[1]
      data_features["labels"] = data_sample[2]
      yield data_features

Creating HDF5 dataset for GPT Models