HDF5 format for GPT dataloader#
Expected Dataset Structure by the Static Dataloader#
The current HDF5 dataloader for GPT models expects the input dataset to meet certain structure criteria which are summarized below:
The dataset files can be located in one directory or multiple directories passed as a list to the dataloader through the
data_dir
parameter. The files must have.h5
extension (not.hdf5
for example).Each HDF5 file must have an attribute
n_examples
that shows the number of input samples stored in the current HDF5 file.Each HDF5 file must have all input samples located in HDF5 dataset named data. The dataset data should have the dimensions
[n_examples, n_features, max_sequence_length]
.n_features
must equal to 3 as it refers to the featuresinput_ids
,attention_mask
, andlabels
. The features must be written and should be accessible in this exact order.
Recommended Settings for Writing HDF5 Files#
To achieve high throughput with the HDF5 dataloader, we need to enable gzip
compression and enable chunking the dataset with chunk size equal to the data sample size when producing HDF5 files. This results in each data sample being placed on disk contiguously and in different piece from other samples. Also, when chunking the data, the compression is restricted within the data sample which removes data dependability.
Note
Note that enabling gzip
compression without chunking can significantly hurt the performance of the dataloader and result in low throughput. This is true because gzip
relies on Lempel-Ziv coding (LZ77
) which causes a dependability on previous data samples while fetching the current data sample.
Example (writing HDF5 file)#
import h5py
import numpy as np
def write_hdf5_file():
n_examples = 50000
max_sequence_length = 2048
file_path = "/home/user/data_file.h5"
gpt_data = np.empty([n_examples, 3, max_sequence_length], dtype=np.int32)
with h5py.File(file_path, mode='w') as h5_file:
h5_file.attrs["n_examples"] = n_examples # number of examples attribute
h5_file.create_dataset(
"data", # Name of dataset (must NOT change)
data=gpt_data, # Numpy array that contains data
dtype="i4", # Data type 32-bit int matches the data
chunks=(1, 3, max_sequence_length), # chunk size equal to the data sample size
compression="gzip", # Enabling gzip compression
)
def read_hdf5_file(file_path):
with h5py.File(file_path, mode='r') as h5_file:
n_examples = h5_file.attrs["n_examples"]
for idx in range(n_examples):
data_sample = h5_file["data"][idx]
data_features = {}
data_features["input_ids"] = data_sample[0]
data_features["attention_mask"] = data_sample[1]
data_features["labels"] = data_sample[2]
yield data_features