HDF5 format for GPT dataloader#

Expected Dataset Structure by the Static Dataloader#

The current HDF5 dataloader for GPT models expects the input dataset to meet certain structure criteria which are summarized below:

  • The dataset files can be located in one directory or multiple directories passed as a list to the dataloader through the data_dir parameter. The files must have .h5 extension (not .hdf5 for example).

  • Each HDF5 file must have an attribute n_examples that shows the number of input samples stored in the current HDF5 file.

  • Each HDF5 file must have all input samples located in HDF5 dataset named data. The dataset data should have the dimensions [n_examples, n_features, max_sequence_length]. n_features must equal to 3 as it refers to the features input_ids, attention_mask, and labels. The features must be written and should be accessible in this exact order.

Example (writing HDF5 file)#

import h5py
import numpy as np

def write_hdf5_file():
  n_examples = 50000
  max_sequence_length = 2048
  file_path = "/home/user/data_file.h5"
  gpt_data = np.empty([n_examples, 3, max_sequence_length], dtype=np.int32)
  with h5py.File(file_path, mode='w') as h5_file:
    h5_file.attrs["n_examples"] = n_examples  # number of examples attribute
    h5_file.create_dataset(
        "data",                               # Name of dataset (must NOT change)
        data=gpt_data,                        # Numpy array that contains data
        dtype="i4",                           # Data type 32-bit int matches the data
        chunks=(1, 3, max_sequence_length),   # chunk size equal to the data sample size
        compression="gzip",                   # Enabling gzip compression
    )

def read_hdf5_file(file_path):
  with h5py.File(file_path, mode='r') as h5_file:
    n_examples = h5_file.attrs["n_examples"]
    for idx in range(n_examples):
      data_sample = h5_file["data"][idx]
      data_features = {}
      data_features["input_ids"] = data_sample[0]
      data_features["attention_mask"] = data_sample[1]
      data_features["labels"] = data_sample[2]
      yield data_features