Creating HDF5 dataset for GPT models#

Overview#

We provide two methods to generate Hierarchical Data Formats (HDF) files (.h5) that you can use in the input pipeline for GPT style models to implement data loader for GPT style models efficiently.

If you have a PyTorch dataset and need to convert it to an HDF5 format, follow section Converting a PyTorch dataset to HDF5 format.

If you have raw data and want to convert it to an HDF5 dataset, follow section Generating HDF5 data from raw data of this document.

Converting a PyTorch dataset to HDF5 format#

Suppose you have a PyTorch dataset for GPT models (from sources such HuggingFace, Map-Style, or Iterable). In that case, you can easily write samples for your dataset in HDF5 format to use with Cerebras optimized HDF5 DataProcessor. Call function convert_dataset_to_HDF5(), that is defined in convert_dataset_to_HDF5.py.

The function convert_dataset_to_HDF5() uses a PyTorch Dataloader to fetch samples from the specified dataset and writes those samples in h5 files. The following table explains the arguments to the convert_dataset_to_HDF5() function:

Table 1: convert_dataset_to_HDF5 Arguments#

Argument	Default Value	Description
`dataset`	N/A	PyTorch dataset to fetch the data from (IterableDataset or Dataset ).
`output_dir`	./hdf5_dataset/	Directory where HDF5 will be stored.
`name`	dataset-partition	Name of the dataset; i.e. prefix to use for HDF5 file names.
`samples_per_file`	2000	Number of samples written to each HDF5 file
`num_workers`	8	Number of Python processes to use for generating data.
`batch_size`	64	The batch size to use fetching the data.
`data_collator`	N/A	Merges a list of samples to form a mini-batch of Tensor(s).
`dtype`	i4	Data type for the HDF5 dataset.
`compression`	gzip	HDF5 Compression strategy.

While the function convert_dataset_to_HDF5() is generic and can be used with all transformer models, note that PyTorch dataset features dictionary should have the the following key/values GPT models:

input_ids: Input token IDs, padded with 0 to max_sequence_length.
- Shape: (batch_size, max_sequence_length)
- Type: torch.int32
attention_mask: Mask for padded positions. Has values 0 on the padded positions and 1 elsewhere.
- Shape: (batch_size, max_sequence_length)
- Type: torch.int32
labels: Labels for language modeling pre-training task, padded with 0 to max_sequence_length.
- Shape: (batch_size, max_sequence_length)
- Type: torch.int32

NOTE: For more information on using of HuggingFace datasets, refer to Using HuggingFace datasets for auto-regressive LM)

Converting raw data to HDF5 data#

We currently offer three modes for generating HDF5 files:

LMData: for processing language modeling datasets in .jsonl or .txt format.
Summarization: for processing fine-tuning datasets in .txt format.
Customize: for any dataset format, but requires a module specifying how to read the raw dataset files.

Run the following script to process your data:create_hdf5_dataset.py with the appropriate subcommand (modes) to generate the .h5 files for GPT style models. Each sub-commands takes in a set of arguments described below in Generating HDF5 files section.

Set up environment#

The following setup is needed to enable a clean run of the script.

NOTE: This assumes commands are run from here.

For setting up a Python virtual environment:`

$ python -m venv data_env
source ./data_env/bin/activate

pip install --upgrade pip
pip install -r requirements.txt

Ignore error messages such as:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed

as they shouldn’t affect the rest of the steps.

Input files format#

Ensure the input text documents are in a specific file format before utilizing the provided script, except for the Customize mode. The acceptable file formats are '.jsonl', '.json.gz', '.jsonl.zst', '.jsonl.zst.tar', '.txt'. These files should have the data in a specific structure described in data format section.

To optimally process the files, we recommend that all files with any of the above formats besides .txt contain enough text in a single file. The recommended size for each file is in the order of GB.

On the contrary, if you are processing smaller files with .txt format, input a metadata file containing a list of paths to these files to leverage multi-processing better.

Input data format#

As mentioned above, the preprocessing script accepts two primary input files: .json based or .txt based. The input data must follow a specific structure for each type to be accurately converted into hdf5 files.

Format for `jsonl` files#

The raw text and meta data for generation should be represented in the .jsonl based files as:

{"text": "Any text excerpt from the dataset of interest...\nThe new lines should be represented as a newline character.",
"meta": {"info": "any other metadata dict"}}

For the jsonl files, as shown above, the raw text is extracted from key=text in the input files by default. If your input files do not contain a text key, you should know the key corresponding to the text you need to extract. Then, extract the text from the command line argument --jsonl_key=<your key name>.

For example, if your jsonl files have the content such as the following:

{"idea": "Any text excerpt from the dataset of interest, with custom key: 'idea'...\nThe new lines should be represented as a newline character.",
"meta": {"info": "any other metadata dict"}}

then you’d need to pass --jsonl_key=idea in the command line arguments.

Format for `txt` based files in `LMData` mode#

Always represent raw text for generation in a .txt based files as:

Any text excerpt from the dataset of interest...
The new lines may not be represented as a newline character.

Note that there are no special tags or anchors in the above. If they exist, all these will be treated as a single document and may not represent the natural language.

For example, the following text gets tokenized be entirely as:

<DOC>
<DOCNO>TRC2-2008-01-01-0000</DOCNO>
<BLOGS08DAY>-13</BLOGS08DAY>
<CONTENT>
Example content in the format that may be outside of the
</DOCS>

Definition of vocab_file and encoder_file#

The script supports two kinds of tokenizers: GPT2Tokenizer and NeoXTokenizer.

Supply the correct vocab_file and encoder_file when using the desired tokenizer.

For GPT2Tokenizer, vocab_file=gpt2-vocab.bpe and encoder_file=gpt2-encoder.json
For NeoXTokenizer, encoder_file=/neox-encoder.json

These files can be found here.

Note: For the GPT2Tokenizer, we follow the nomenclature used by OpenAI in their implementation which is slightly different from Hugging Face’s nomenclature where they call the vocab_file as merges_file and encoder_file as vocab_file. However, the content of the files is the same. For NeoXTokenizer, we use the same nomenclature to avoid confusion.

Generating HDF5 files#

Once you have a text dataset that meets the above requirement, you can generate HDF5 files using the create_hdf5_dataset.py script:

python create_hdf5_dataset.py [mode] [--arguments]

As we mentioned before, the mode can be one of {LMData, Summarization,}. The four modes share the same setup and processing arguments but differ in their dataset arguments, as detailed below:

Table 2: Setup Arguments#

Argument	Default Value	Description
`params`	N/A	Path to YAML config file for setting dataset preprocessing parameters. Optional alternative for providing command line arguments.
`input_dir`	N/A	Directory where raw data is stored. Supports only the formats: [`'.jsonl', '.jsonl.zst', '.jsonl.zst.tar', '.txt'`].
`metadata_files`	N/A	Path to text file containing a list of file names corresponding to the raw input documents to be processed and stored; can handle multiple metadata files separated by comma.
`output_dir`	`./data_dir/`	Directory where HDF5 files will be stored.
`processes`	cpu count	Number of processes to use.
`module`	N/A	Python file name contains the custom dataset processor for `Customize` mode only.
`dataset_processor`	N/A	Name of the custom dataset processor for `Customize` mode only.

Note: You have to provide either the input_dir or metadata_files argument. Only files referenced in the metadata_files will be processed if you provided both.

Table 3: Processing Arguments#

Argument	Default Value	Description
`tokenizer_type`	required arg	Type of tokenizer to use for HDF5 dataset generation. Can be one of `GPT2Tokenizer` or `NeoXTokenizer`.
`vocab_file`	N/A	Path to the vocabulary file.
`encoder_file`	N/A	Path to the encoder file.
`max_seq_length`	`2048`	Maximum sequence length.
`short_seq_prob`	`0.0`	Probability of creating sequences which are shorter than the maximum sequence length.
`output_name`	`examples`	Name of the dataset; i.e. prefix to use for HDF5 file names.
`files_per_record`	`50000`	Text files to write per HDF5 file.
`write_in_batch`	`False`	Whether to write the samples in batch for the HDF5 format, setting to false will save memory but a bit slower.
`write_remainder`	`True`	Write the remainder files when data is left over from processing.
`resume_from_checkpoint`	`False`	Resume record writing from a given checkpoint.
`display_pbar`	`True`	Display progress while runs.
`seed`	`0`	Random seed.

Table 4: Dataset Arguments (`LMData` mode)#

Argument	Default Value	Description
`ftfy`	`False`	Fix text with ftfy.
`ftfy_normalizer`	`NFC`	Choose what kind of unicode normalization is applied. Usually, we apply `NFC` normalization, so that letters followed by combining characters become single combined characters. Using `None` applies no normalization while fixing text.
`wikitext_detokenize`	`False`	Use wikitext detokenizer to fix text.
`jsonl_key`	`text`	The key name in input jsonl files from which the raw text will be extracted in order to further process it.
`pack_sequences`	`True`	Concatenate a document smaller than maximum sequence length with other documents, instead of filling it with Padding token.

Table 5: Dataset Arguments (`Summarization` mode)#

Argument	Default Value	Description
`ftfy`	`False`	Fix text with ftfy.
`ftfy_normalizer`	`NFC`	Choose what kind of unicode normalization is applied. Usually, we apply `NFC` normalization, so that letters followed by combining characters become single combined characters. Using `None` applies no normalization while fixing text.
`wikitext_detokenize`	`False`	Use wikitext detokenizer to fix text.
`sep_token`	`<\|sep\|>`	Token added between prompt and completion in preprocessed sequences.
`prompt_key`	required arg	Json key for the prompt.
`completion_key`	required arg	Json key for the completion.

Usage of create_hdf5_dataset.py file#

You can provide the above arguments either as command line arguments or as YAML config file:

Command line

python create_hdf5_dataset.py LMData --input_dir /path/to/data --tokenizer_type NeoXTokenizer --encoder_file /path/to/encoder --max_seq_length 4096 --ftfy True --pack_sequences False

YAML config file

python create_hdf5_dataset.py LMData --params ./configs/autoregressive_lm_preprocessing.yaml

Example of sample YAML files for LMData and Summarization are located on Cerebras Model Zoo.

Note: You can use both, but command-line arguments will override any common arguments with the YAML configuration file.

Customize mode steps#

Create a python file or put under ./hdf5_dataset_preprocessors.py
Import the module HDF5Preprocessor in the file you created as follows:

from modelzoo.transformers.data_processing.scripts.hdf5_preprocessing.hdf5_preprocessor import HDF5Preprocessor

Create a class that inherits from HDF5Preprocessor. (e.g CustomDataset)
Implements init takes as input a dictionary contains the dataset parameters that is needed for HDF5Preprocessor.
Implements the method file_read_generator and preprocessing_generator following Write Customized Preprocessor
Run create_hdf5_dataset.py script.

Write customized preprocessor#

You can create customized preprocessors for various datasets or objectives. We provide two references at hdf5_dataset_preprocessors.py where:

LMDataPreprocessor: the preprocessor for autoregressive language modeling tasks
SummarizationPreprocessor: the preprocessor for summarization tasks

They both inherit from the HDF5BasePreprocessor at hdf5_base_preprocessor.py with two functions that can be overridden to customize for various cases:

file_read_generator() takes a file path, reads from the file, and yields the corresponding text documents. You can customize how you want the file to be read based on its format (ex. csv, zip, etc.). Our default preprocessors use lm_dataformat reader with specific JSON keys.
preprocessing_generator(), This function takes in the output of file_read_generator(), performs tokenization and other preprocessing techniques, and yields the data samples in np.array format.

For example, in the autoregressive language modeling task, file_read_generator yields an str object, and the preprocessing_generator produces an np array with shape [3, max_sequence_length] with the following three features concatenated on the first dimension:

input_ids: Input token ids, padded with 0’s to max_sequence_length.
input_mask: Loss mask for the sequence. It has 0’s padded positions like prompts or padding tokens, and 1’s elsewhere.
labels: input_ids shifted to the right by one position as the target labels.

Best practices#

It is recommended to use the ftfy module to fix the datasets. Enable with the --ftfy argument.
The NeoXTokenizer uses the HuggingFace library’s inbuilt tokenizer and handles NFC normalization independently. When using this tokenizer_type, set the --ftfy_normalizer argument to None. For the GPT2Tokenizer, use the default NFC value for the normalizer.
To process HDF5 for training, we recommend using multi-processing. Moreover, we suggest using several input files such that the totalnum,ber of input files are greater than or equal to the number of processes provided by --processes. Note that this requires a high-spec CPU server, which can handle the concurrent running processes in RAM and the I/O for reads and writes. If the I/O of the server is slow, the processes may appear to be hung for a very long while.
The recommendation is to split the data into smaller subsets and write out each subset for very large datasets (with several files, with each file in the order of GBs). You can then mix all HDF5 in a common folder for use by the data pipeline or just provide the locations of each subset in a list. The overall time to write out HDF5 can depend on the CPU server used.
It is better to split the input dataset into multiple files with similar sizes to leverage the full potential of parallel processing.
For CodeGen models processing, please use GPT2Tokenizer along with the updated vocab files such that the vocabulary of GPT-2 is extended by special tokens representing repeating tokens of tabs and white spaces.

Output files structure#

The output directory will contain many h5 files, as shown below (with two processes):

<path/to/output_dir>
├── checkpoint_0.txt
├── checkpoint_1.txt
├── data_params.json
├── examples_0_0.h5
├── examples_0_1.h5
├── examples_1_0.h5
├── examples_1_1.h5
├── examples_2_0.h5
├── examples_2_1.h5
├── examples_3_0.h5
├── examples_3_1.h5
├── examples_4_0.h5
├── examples_4_1.h5
├── examples_5_0.h5
├── examples_6_0.h5
├── examples_7_0.h5
└── examples_8_0.h5

Here data_params.json is the file that stores the parameters used for generating this set of files. checkpoint_*.txt can be used for resuming the processing in case the run script gets killed for some reason. There is one checkpoint_*.txt file for each process. To use this file, resume the previous command that you ran along with the additional command line argument --resume_from_checkpoint.

Example for HuggingFace Eli5 dataset#

The example shows conversion of HuggingFace Eli5 dataset to HDF5:

from modelzoo.transformers.data_processing.huggingface.HuggingFace_Eli5 import (
    HuggingFace_Eli5,
)
from modelzoo.transformers.data_processing.scripts.hdf5_preprocessing.convert_dataset_to_HDF5 import (
    convert_dataset_to_HDF5,
)

dataset, data_collator = HuggingFace_Eli5(split="train", num_workers=8)

convert_dataset_to_HDF5(
    dataset=dataset,
    data_collator=data_collator,
    output_dir="./eli5_hdf5_dataset/",
    num_workers=8,
)

Using Hugging Face datasets for auto-regressive LM

Generating HDF5 data GPT-style models using data chunk preprocessing