Creating HDF5 dataset for GPT Models#
For efficient implementation of data loader for GPT style models, we provide a script to generate .h5
files which can then be used in the input pipeline for GPT style models.
We provide two flavors of generating HDF5 files:
Store your text input stored in various extensions into
.h5
files with the raw text documents.Convert yor text input stored in various extensions into
.h5
files with the documents tokenized and converted into token ids, along with labels and attention masks.
Each of the above conversion can be done by running the provided script create_hdf5_dataset.py
. The script has two sub commands, raw_text
and preprocessed_text
to generate the .h5
files in one of the above two described flavors. Each sub-commands takes in a set of arguments which are described below in Generating HDF5 files section.
Before doing that, you need to setup a python virtual environment through conda as described below.
Environment Setup#
The following pre-requisites are needed to enable a clean run of the script. Below is a setup for a conda environment:
conda create --name data_env python=3.7.4 pip -y
conda activate data_env
conda install -c conda-forge cudatoolkit=10.1 pyyaml regex -y
conda install -c anaconda cudnn=7.6.4 tqdm -y
pip install lm-dataformat ftfy tokenizers h5py
During the environment setup, if you encounter errors like ERROR: pip's dependency resolver does not currently take into account all the packages that are installed...
, please ignore the message as it shouldn’t affect the rest of the steps.
Input files format#
The input text documents need to be in s a specific file format before utilizing the provided script. The acceptable file formats are '.jsonl', '.jsonl.zst', '.jsonl.zst.tar', '.txt'
. These files should have the data in a specific structure as described in data format section.
To optimally process the files, it is recommended that all files with any of the above file formats other than .txt
contain enough text in a single file. Recommended size for each file is in the order of GB.
On the contrary, if processing with smaller files with txt
format, please input a metadata
file containing a list of paths to these files to better leverage multi processing.
Input data format#
As mentioned above, there are two primary types on input files accepted into the preprocessing script. They are either json
based or txt
based. For each type, the input data needs to follow certain structure in order to be converted into hdf5
files accurately.
Format for JSONL
files#
The raw text and meta data for generation should be represented in the jsonl
based files as:
{"text": "Any text excerpt from the dataset of interest...\nThe new lines should be represented as a newline character.",
"meta": {"info": "any other metadata dict"}}
For the jsonl
files, as shown above, by default, the raw text is extracted from key=text
in the input files. If your input files do not contain text
key then you should know the key
corresponding to the text you need to extract. Then, you can use the command line argument --jsonl_key=<your key name>
to extract the text.
For example, if your jsonl files have the content as below:
{"idea": "Any text excerpt from the dataset of interest, with custom key: 'idea'...\nThe new lines should be represented as a newline character.",
"meta": {"info": "any other metadata dict"}}
then you’d need to pass --jsonl_key=idea
in the command line arguments.
Format for TXT
based files#
The raw text for generation should be represented in the txt
based files as:
Any text excerpt from the dataset of interest...
The new lines may not be represented as a newline character.
Note that in the above, there are no special tags or any anchors. If they exist, all of these will be treated as a single document and may not represent the natural language.
For example, the below text will be entirely tokenized as is:
<DOC>
<DOCNO>TRC2-2008-01-01-0000</DOCNO>
<BLOGS08DAY>-13</BLOGS08DAY>
<CONTENT>
Example content in the format that may be outside of the
</DOCS>
If your input files do not follow the above structure, then the input script may generate erroneous data into the hdf5 files, affecting the quality of the fine-tuning.
Generating HDF5 files#
Once you have the text dataset that meets above requirement, you can generate HDF5 files using the create_hdf5_dataset.py
file and providing one of the raw_text
and preprocessed_text
sub-commands.
The arguments for raw_text
subcommand are detailed below:
usage: create_hdf5_dataset.py raw_text [-h] --input_dir INPUT_DIR
[--metadata_files METADATA_FILES]
[--jsonl_key JSONL_KEY]
[--output_dir OUTPUT_DIR]
[--output_name OUTPUT_NAME]
[--seed SEED] [--processes PROCESSES]
[--write_remainder]
[--resume_from_checkpoint]
[--display_pbar]
[--files_per_record FILES_PER_RECORD]
[--write_in_batch]
optional arguments:
-h, --help show this help message and exit
--input_dir INPUT_DIR
Directory where raw data is stored.
--metadata_files METADATA_FILES
Path to text file containing a list of file names
corresponding to the raw input documents to be
processed and stored; can handle multiple metadata
files separated by comma.
--jsonl_key JSONL_KEY
The key name in input jsonl files from which the raw
text will be extracted in order to further process it. Default: "text".
--output_dir OUTPUT_DIR
Directory where HDF5 files will be stored. Defaults to
`./data_dir/`.
--output_name OUTPUT_NAME
Name of the dataset; i.e. prefix to use for HDF5 file
names.Defaults to `examples`.
--seed SEED Random seed. Defaults to `0`.
--processes PROCESSES
Number of processes to use. Default to cpu count.
--write_remainder Write the remainder files when data is left over from
processing.
--resume_from_checkpoint
Resume record writing from a given checkpoint.
--display_pbar Display progress while runs.
--files_per_record FILES_PER_RECORD
Text files to write per HDF5 file.
--write_in_batch Whether to write the samples in batch for the HDF5
format, setting to false will save memory but a bit
slower.
The arguments for preprocessed_text
subcommand are detailed below:
usage: create_hdf5_dataset.py preprocessed_text [-h] --input_dir INPUT_DIR
[--metadata_files METADATA_FILES]
[--jsonl_key JSONL_KEY]
[--output_dir OUTPUT_DIR]
[--output_name OUTPUT_NAME]
[--seed SEED]
[--processes PROCESSES]
[--write_remainder]
[--resume_from_checkpoint]
[--display_pbar]
[--files_per_record FILES_PER_RECORD]
[--write_in_batch]
--tokenizer_type
{GPT2Tokenizer,NeoXTokenizer}
--vocab_file VOCAB_FILE
[--encoder_file ENCODER_FILE]
[--max_seq_length MAX_SEQ_LENGTH]
[--short_seq_prob SHORT_SEQ_PROB]
[--ftfy]
[--ftfy_normalizer {NFC,None}]
[--wikitext-detokenize]
[--eos_id EOS_ID]
[--pad_id PAD_ID]
optional arguments:
-h, --help show this help message and exit
--input_dir INPUT_DIR
Directory where raw data is stored.
--metadata_files METADATA_FILES
Path to text file containing a list of file names
corresponding to the raw input documents to be
processed and stored; can handle multiple metadata
files separated by comma.
--jsonl_key JSONL_KEY
The key name in input jsonl files from which the raw
text will be extracted in order to further process it. Default: "text".
--output_dir OUTPUT_DIR
Directory where HDF5 files will be stored. Defaults to
`./data_dir/`.
--output_name OUTPUT_NAME
Name of the dataset; i.e. prefix to use for HDF5 file
names.Defaults to `examples`.
--seed SEED Random seed. Defaults to `0`.
--processes PROCESSES
Number of processes to use. Default to cpu count.
--write_remainder Write the remainder files when data is left over from
processing.
--resume_from_checkpoint
Resume record writing from a given checkpoint.
--display_pbar Display progress while runs.
--files_per_record FILES_PER_RECORD
Text files to write per HDF5 file.
--write_in_batch Whether to write the samples in batch for the HDF5
format, setting to false will save memory but a bit
slower.
--tokenizer_type {GPT2Tokenizer,NeoXTokenizer}
Type of tokenizer to use for tfrecord/HDF5 dataset
generation. Can be one of `GPT2Tokenizer` or
`NeoXTokenizer`.
--vocab_file VOCAB_FILE
Path to vocabulary.
--encoder_file ENCODER_FILE
Path to BPE encoder.
--max_seq_length MAX_SEQ_LENGTH
Maximum sequence length. Defaults to `2048`.
--short_seq_prob SHORT_SEQ_PROB
Probability of creating sequences which are shorter
than the maximum sequence length. Defaults to `0.0`.
--ftfy Fix text with ftfy.
--ftfy_normalizer {NFC,None}
Choose what kind of unicode normalization is applied.
Usually, we apply `NFC` normalization, so that letters
followed by combining characters become single
combined characters. Using `None` applies
no normalization while fixing text.
--wikitext-detokenize
Use wikitext detokenizer to fix text.
--eos_id EOS_ID Id for padding out shorter sequences. Defaults to
50256, which is `<|endoftext|>` in tokens.
--pad_id PAD_ID Id for padding out shorter sequences. Defaults to
50256, which is `<|endoftext|>` in tokens.
Generation Notes#
It is recommended to use the ftfy module to fix the datasets. This can be enabled by the
--ftfy
argument.The NeoXTokenizer uses the HuggingFace library’s inbuilt tokenizer and handles NFC normalization on its own. When using this tokenizer_type, it is recommended to set the
--ftfy_normalizer
argument toNone
. For theGPT2Tokenizer
, use the defaultNFC
value for the normalizer.To process HDF5 for training, we recommend using multi-processing. Moreover, we suggest using several input files such that the totalnum,ber of input files are greater than or equal to the number of processes provided by
--processes
. Note that this requires a high-spec CPU server, which can handle not only the concurrent running processes in RAM but also the I/O for reads and writes. If the I/O of the server is slow, the processes can appear to be hung for a very long while.For very large dataset (with several files with each file in the order of GBs) the recommendation is to split the data into smaller subsets and write out each subset. You can then mix all HDF5 in a common folder for use by the data pipeline, or just provide the locations of each subset in a list. The overall time to write out HDF5 can depend on the CPU server used.
It is better to split the input dataset into multiple files, with similar size to leverage the full potential of parallel processing.
For CodeGen models processing please use
GPT2Tokenizer
along with the updated vocab files such that vocabulary of GPT-2 is extended by special tokens representing repeating tokens of tabs and white spaces.
Output files structure#
The output directory will contain a bunch of h5
files as shown below:
<path/to/output_dir>
├── checkpoint.txt
├── data_params.json
├── examples_0_0.h5
├── examples_0_1.h5
├── examples_1_0.h5
├── examples_1_1.h5
├── examples_2_0.h5
├── examples_2_1.h5
├── examples_3_0.h5
├── examples_3_1.h5
├── examples_4_0.h5
├── examples_4_1.h5
├── examples_5_0.h5
├── examples_6_0.h5
├── examples_7_0.h5
└── examples_8_0.h5
Here data_params.json
is the file which stores the parameters used for generating this set of files. checkpoint.txt
can be used for resuming the processing in case the run script gets killed for some reason. To use this file, simply resume the previous command that you ran along with additional command line argument --resume_from_checkpoint <path/to/output_dir>/checkpoint.txt