Generating HDF5 data GPT-style models using data chunk preprocessing#
Overview#
This document describes the process of generating HDF5 files from raw data for use with GPT-style models. Four data processing modes are currently supported:
LMData: Processes language modeling datasets in
.jsonl
,.jsonl.zst
,.parquet
, or.txt
format.Summarization: Processes fine-tuning datasets in
.jsonl
,.jsonl.zst
,.parquet
, or.txt
format.FIM: Processes language modeling data for the Fill-in-the-Middle objective (requires datasets in
.jsonl
,.jsonl.zst
,.parquet
, or.txt
format).
Each of the above data processing can be done by running the provided script create_hdf5_dataset.py
with the appropriate sub command (modes) to generate the .h5
files for GPT style models. Each sub-commands takes in a set of arguments which are described below in Generating HDF5 files section.
Before proceeding, set up a python virtual environment conda as described in the following section.
Environment Setup#
NOTE: Skip this if you have already setup the model zoo environment as described in PYTHON-SETUP.md if you’re running on a Cerebras Wafer-Scale Cluster. If trying to run this locally, please follow the below steps.
The following prerequisites are needed to enable a clean run of the script. Below is a setup for a Python virtual environment:
Prerequisites#
NOTE: This assumes commands are run from the current directory.
python -m venv data_env
source ./data_env/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
Input files format#
The input text documents need to be in a specific file format before utilizing the provided script. The acceptable file formats are .jsonl
, .json.gz
, .jsonl.zst
, .jsonl.zst.tar
, .parquet
, .txt
. These files should have the data in a specific structure as described in data format section.
For optimal performance when processing input text files, adhere to the following guidelines:
Large Files (Non-txt)
For formats other than .txt
, ensure each individual file contains a substantial amount of text data, ideally several gigabytes in size. This facilitates efficient multi-processing and minimizes overhead.
Small Files (txt)
For .txt
files, which are designed for smaller data volumes, provide a separate “metadata” file alongside your input files. This metadata file should contain a list of paths to each individual .txt
file. This configuration enables efficient multi-processing, maximizing resource utilization.
Input data format#
As previously mentioned, the pre-processing script accepts three primary types of input files: JSON-based, TXT-based, and Parquet-based. Each type requires a specific data structure for accurate conversion to HDF5 format.
Format for JSONL
files#
The raw text and meta data for generation should be represented in the jsonl
based files as:
{"text": "Any text excerpt from the dataset of interest...\nThe new lines should be represented as a newline character.",
"meta": {"info": "any other metadata dict"}}
For the jsonl
files, as shown above, by default, the raw text is extracted from key=text
in the input files. If your input files do not contain text
key then you should know the key
corresponding to the text you need to extract. Then, you can use the command line argument --jsonl_key=<your key name>
to extract the text.
For example, if your jsonl files have the content as below:
{"idea": "Any text excerpt from the dataset of interest, with custom key: 'idea'...\nThe new lines should be represented as a newline character.",
"meta": {"info": "any other metadata dict"}}
then you’d need to pass --jsonl_key=idea
in the command line arguments.
Format for TXT
based files in LMData
mode#
The raw text for generation should be represented in the txt
based files as:
Any text excerpt from the dataset of interest...
The new lines may not be represented as a newline character.
Note that in the above, there are no special tags or any anchors. If they exist, all of these will be treated as a single document and may not represent the natural language.
For example, the below text will be entirely tokenized as is:
<DOC>
<DOCNO>TRC2-2008-01-01-0000</DOCNO>
<BLOGS08DAY>-13</BLOGS08DAY>
<CONTENT>
Example content in the format that may be outside of the
</DOCS>
Format for PARQUET
based files in LMData
mode#
Column Name = “text”: “Any text excerpt from the dataset of interest…\nThe new lines should be represented as a newline character.”, Column Name = “abc”: “….” etc
For the `parquet` files, as shown above, by default, the raw text is extracted from the column with name `text` in the input files. If your input files do not contain a column with name `text` as key then you should know the `key` corresponding to the text you need to extract. Then, you can use the command line argument `--jsonl_key=<your key name>` to extract the text.
For example, if your parquet files have the content as below and you want to extract the value from column name `idea`:
```parquet
Column Name = "idea": "Any text excerpt from the dataset of interest, with custom key: 'idea'...\nThe new lines should be represented as a newline character.",
Column Name = "abc": "...."
etc
then you’d need to pass --jsonl_key=idea
in the command line arguments.
Definition of vocab_file and encoder_file#
We support three different tokenizers with this script:
GPT2Tokenizer
NeoXTokenizer
HuggingFaceTokenizer
We need to supply the following parameters when using a specific tokenizer:
For
GPT2Tokenizer
,vocab_file=gpt2-vocab.bpe
andencoder_file=gpt2-encoder.json
.For
NeoXTokenizer
,encoder_file=neox-encoder.json
.For
HuggingFaceTokenizer
,huggingface_tokenizer
should be specified, for examplehuggingface_tokenizer=tiiuae/falcon-7b
.
These files can be found here.
Note: For GPT2Tokenizer
we follow the nomenclature used by OpenAI in their implementation which is slightly different from Hugging Face’s nomenclature where they call the vocab_file
as merges_file
and encoder_file
as vocab_file
. However, the content of the files are the same. For NeoXTokenizer
, we use the same nomenclature to avoid confusion.
Generating HDF5 files#
Once you have the text dataset that meets above requirement, you can generate HDF5 files using the create_hdf5_dataset.py script:
python create_hdf5_dataset.py [mode] [--arguments]
The mode as we mentioned before can be one of {LMData
, Summarization
,}. The four modes share the same setup and processing arguments, but differ in their dataset arguments as detailed below:
Table 2: Setup Arguments#
Argument |
Default Value |
Description |
---|---|---|
|
N/A |
Path to YAML config file for setting dataset preprocessing parameters. Optional alternative for providing command line arguments. |
|
N/A |
Directory where raw data is stored. Supports only the formats: [ |
|
N/A |
Path to text file containing a list of file names corresponding to the raw input documents to be processed and stored; can handle multiple metadata files separated by comma. |
|
|
Directory where HDF5 files will be stored. |
|
cpu count |
Number of processes to use. |
Note: You have to provide either the
input_dir
ormetadata_files
argument. If you provided both, only files referenced in themetadata_files
will be processed.
Table 3: Processing Arguments#
Argument |
Default Value |
Description |
---|---|---|
|
required arg |
Type of tokenizer to use for HDF5 dataset generation. Can be one of |
|
N/A |
Path to the vocabulary file. |
|
N/A |
Path to the encoder file. |
|
|
Token id of the end of sentence token. Will be used if tokenizer doesn’t have a default eos_id. |
|
|
Token id of the padding token. Will be used if tokenizer doesn’t have a default pad_id. |
|
|
Maximum sequence length. |
|
|
Probability of creating sequences which are shorter than the maximum sequence length. |
|
|
Name of the dataset; i.e. prefix to use for HDF5 file names. |
|
|
Text files to write per HDF5 file. |
|
|
Whether to write the samples in batch for the HDF5 format, setting to false will save memory but a bit slower. |
|
|
Write the remainder files when data is left over from processing. |
|
|
Resume record writing from a given checkpoint. |
|
|
Display progress while runs. |
|
|
Random seed. |
Table 4: Dataset Arguments (LMData
mode)#
Argument |
Default Value |
Description |
---|---|---|
|
|
Fix text with ftfy. |
|
|
Choose what kind of unicode normalization is applied. Usually, we apply |
|
|
Use wikitext detokenizer to fix text. |
|
|
The key name in input jsonl files from which the raw text will be extracted in order to further process it. |
|
|
Concatenate a document smaller than maximum sequence length with other documents, instead of filling it with Padding token. |
|
|
Minimum token length to skip the sample. |
|
|
dtype of processed input_ids. |
|
|
dtype of processed input loss masks. |
|
|
If False, 0 represents masked positions. If True 1 represents masked positions. |
|
|
Whether to split the text into smaller chunks before tokenization. This is helpful for very long documents with tokenizers such as Llama tokenizer which performs quadratically in the text length. |
|
|
Length of the text chunks to split the text into before tokenization for slower tokenizers. Could be optionally used with the above flag |
|
|
Whether to remove the BOS token from the beginning of the chunks. Set this to |
Table 5: Dataset Arguments (Summarization
mode)#
Argument |
Default Value |
Description |
---|---|---|
|
|
Fix text with ftfy. |
|
|
Choose what kind of unicode normalization is applied. Usually, we apply |
|
|
Use wikitext detokenizer to fix text. |
|
|
Minimum token length to skip the sample. |
|
|
Token added between prompt and completion in preprocessed sequences. If supplied with a non-None value, the tokenizer will add the token to the vocab size and modify the vocab size. This may not be advisable for doing fine tuning on a pre-trained model on the types of models that do not provision for extra tokens. |
|
required arg |
Json key for the prompt. |
|
required arg |
Json key for the completion. |
|
|
dtype of processed input_ids. |
|
|
dtype of processed input loss masks. |
|
|
If False, 0 represents masked positions. If True 1 represents masked positions. |
Table 6: Dataset Arguments (FIM
mode)#
Argument |
Default Value |
Description |
---|---|---|
|
|
Float specifying percentage of data to apply FIM transformation, instead of leaving as auto-regressive. |
|
|
Float specifying percentage of FIM transformation to convert to prefix-suffix-middle (PSM) vs suffix-prefix-middle (SPM) formats. |
The FIM
mode is very similar to the LMData
mode, and uses all the same other arguments as listed in the LMData
table. These additional parameters determine whether what percentage of samples have the FIM transformation applied, and what percent of these end up in PSM (prefix, suffix, middle) or SPM format.
Note: For CodeLlama, to follow the note here specify the EOT token as the EOS token in the config.
You can provide the above arguments either as command line arguments, for example:
cd ${MODELZOO_BASE_DIR}/transformers/data_processing/scripts/chunk_preprocessing/
cd ${MODELZOO_BASE_DIR}/transformers/data_processing/scripts/chunk_preprocessing/
python create_hdf5_dataset.py LMData --input_dir /path/to/data --tokenizer_type NeoXTokenizer --encoder_file /path/to/encoder --max_seq_length 4096 --use_ftfy True --pack_sequences False
or as YAML config file:
cd ${MODELZOO_BASE_DIR}/transformers/data_processing/scripts/chunk_preprocessing/
python create_hdf5_dataset.py LMData --params ../hdf5_preprocessing/configs/autoregressive_lm_preprocessing.yaml
example yaml files for LMData and Summarization are located under ./configs.
Note: You can also provide both, but command line arguments will override any common arguments with the yaml configuration file.
Note: The behavior of
eos
andpad
ids is dependent on the tokenizer used. ForGPT2Tokenizer
, theeos
andpad
ids are the same. ForNeoXTokenizer
andHuggingFaceTokenizer
, theeos
andpad
ids are the same as theeos
andpad
ids in the tokenizer. If the tokenizer does not have a defaultpad_id
then thepad_id
argument will be used. Ifpad_id
is not provided, then the defaultpad_id
will be set to same aseos_id
.
Generation Notes#
It is recommended to use the ftfy module to fix the datasets. This can be enabled by the
--use_ftfy
argument.The NeoXTokenizer uses the HuggingFace library’s inbuilt tokenizer and handles NFC normalization on its own. When using this tokenizer_type, it is recommended to set the
--ftfy_normalizer
argument toNone
. For theGPT2Tokenizer
, use the defaultNFC
value for the normalizer.For CodeGen models processing please use
GPT2Tokenizer
along with the updated vocab files such that vocabulary of GPT-2 is extended by special tokens representing repeating tokens of tabs and white spaces.
Data pre-processing pipeline#
To optimize the processing of large datasets for training, we recommend utilizing multi-processing. Two approaches are available:
Task-based Multi-processing (for >2 processes):
This approach splits the data into smaller tasks and distributes them across multiple processes for parallel processing. This method is ideal for larger datasets and utilizes available resources efficiently when using more than two processes.
File-based Multi-processing (for <=2 processes):
This approach processes individual data files concurrently, leveraging each file as a separate task. This method is suitable for smaller datasets or when only a limited number of processes (2 or less) are available.
The following sections explain these two approaches in more detail:
Task-Based Multi-Processing in Data Pipeline#
This pipeline consists of three main types of processes:
Reader Process: Responsible for reading raw data in chunks and distributing it across multiple tokenizer queues.
Tokenizer Processes: Each process takes chunks of raw data from a queue, tokenizes them, and places the tokenized data onto a writer queue.
Writer Process: Collects the tokenized data and writes it to disk in HDF5 format, tracking progress.
Process Responsibilities#
Reader Process#
Breaks down input files into manageable chunks. This can be provided as a parameter in the config yaml file. By default, the chunk size is 64kB.
Distributes chunks in a round-robin fashion to tokenizer processes.
Ensures balanced load across tokenizers.
Emits a sentinel value to indicate the end of input.
Tokenizer Process#
Retrieves data chunks from its queue.
Performs tokenization using a predefined
token_generator
.Forwards tokenized data to the corresponding writer process.
Handles sentinel value by signaling the writer process of completion.
Writer Process#
Writes tokenized data to HDF5 files.
Maintains statistics like the number of discarded and successful chunks, character and byte counts.
Manages a checkpoint system for robustness and recovery.
Sends cumulative data stats to the main control process.
Pipeline Flow#
The
task_split_process_dataset
function calculates total dataset size and chunks.It initializes all tokenizer and writer processes.
The reader process starts and reads data, pushing it to the tokenizer queues.
Tokenizer processes tokenize the data and pass it on to the writer process.
The writer process manages output files and writes tokenized data.
Progress is tracked and logged, with a progress bar displayed in the console.
Upon completion, the writer process gathers and returns final data statistics.
Key Features#
Concurrency: Utilizes multiple processes for different stages in the pipeline to ensure maximum CPU utilization.
Fault Tolerance: Implements a checkpoint system allowing for recovery from the last saved state.
Progress Tracking: Includes a real-time progress bar and logging to monitor the pipeline’s performance.
File Split Parallel Data Processing#
This section outlines the file_split_process_dataset
method within our data
processing framework, which utilizes file-based parallelism to process large
datasets efficiently.
Overview#
The file_split_process_dataset
function orchestrates the distribution of data
files across multiple processes, enabling parallel processing. This approach
ensures that each process works on a separate chunk of the dataset to maximize
utilization of CPU resources.
How It Works#
The method executes the following steps:
Initialization: It calculates the total data size and the number of chunks to process.
Checkpointing: Reads checkpoints to resume processing if previously interrupted, keeping track of files already written.
File Distribution: Assigns files to processes evenly to ensure a balanced workload.
Progress Tracking: Implements a progress bar using
tqdm
for real-time visual feedback on the processing progress.Parallel Processing: Starts multiple processes, each handling its assigned list of files.
Statistics Aggregation: Collects and aggregates data processing statistics from all processes, such as the number of processed and discarded sequences, as well as the counts of raw and normalized characters and bytes.
Notes:#
The task based processing is more efficient because it always ensures that load is balanced across processes.
It is specially useful in case of large files or large entries.
As is does fixed data size processing, we can estimate the time to complete data pre-processing which is very helpful for the users.
Therefore, we recommend using task based multi-processing. The framework will automatically switch to this mode is the
--processes
provided are 3 and above.
Online Shuffling in HDF5 File Storage#
Overview#
Our data processing pipeline integrates an innovative online shuffling feature that seamlessly interacts with HDF5 storage. This crucial functionality prevents machine learning models from learning the data order and ensures randomized distribution of data sequences, leading to improved model robustness and performance.
How Online Shuffling Works#
Our novel online shuffling mechanism operates seamlessly with HDF5 storage, eliminating the need for post-processing and its associated memory overhead. By interleaving shuffling with data serialization, it efficiently randomizes data sequences as they are written to files, significantly reducing processing time. To activate this functionality, simply specify the shuffle
flag and provide a seed with the shuffle_seed
flag within your data pre-processing configuration file.
Implementation Details#
During the HDF5 file writing operation, each sequence of tokenized data is assigned a random index that determines its placement in the output files.
This randomization ensures that upon reading the data back for training purposes, the sequences are already shuffled.
The shuffling operation is handled efficiently, allowing the system to process and shuffle large datasets without excessive memory usage.
Advantages of Online Shuffling#
Efficiency: By eliminating the need for a separate shuffling step, we save on both processing time and memory.
Scalability: This approach scales elegantly with the size of the dataset, suitable for large-scale machine learning tasks.
Simplicity: Reduces the complexity of the data preparation pipeline by integrating shuffling into the data writing process.
For detailed implementation, please refer to the append_to_hdf5
function in
our codebase.
Output files structure#
The output directory will contain a bunch of h5
files as shown below (with 2 processes):
<path/to/output_dir>
├── checkpoint_process_0.txt
├── checkpoint_process_1.txt
├── data_params.json
├── output_chunk_0_0_0_0.h5
├── output_chunk_1_0_0_0.h5
├── output_chunk_1_0_16_1.h5
├── output_chunk_0_0_28_1.h5
├── output_chunk_0_0_51_2.h5
├── output_chunk_1_0_22_2.h5
├── output_chunk_0_1_0_3.h5
├── ...
Here data_params.json
is the file which stores the parameters used for generating this set of files. checkpoint_*.txt
can be used for resuming the processing in case the run script gets killed for some reason. There is one checkpoint_*.txt
file for each process. To use this file, simply resume the previous command that you ran along with additional command line argument --resume_from_checkpoint
For three processes, the structure will be different as we use task based splitting as explained above:
<path/to/output_dir>
├── checkpoint.txt
├── data_params.json
├── output_chunk_0_0_0_0.h5
├── output_chunk_1_0_0_0.h5
├── output_chunk_1_0_16_1.h5
├── output_chunk_0_0_28_1.h5
├── output_chunk_0_0_51_2.h5
├── output_chunk_1_0_22_2.h5
├── output_chunk_0_1_0_3.h5
├── ...
├── output_chunk_0_2_5_10.h5
├── output_chunk_0_3_0_16.h5
Note that we have only one checkpoint file because we have one reader process only.
We are collecting data statistics during and after data preprocessing which are stored in data_params.json
The h5_dataset_stats
section in the generated data_params.json
file contains the following statistics:
"h5_dataset_stats": {
"detokenized_bytes": # total bytes in the detokenized text using the same tokenizer as used for tokenization,
"detokenized_chars": # total characters in the detokenized text using the same tokenizer as used for tokenization,
"loss_valid_tokens": # total number of tokens that are not padding tokens or prompt tokens,
"non_pad_tokens": # total number of tokens that are not padding tokens,
"num_sequences": # total number of sequences in the dataset,
"num_tokens": # total number of tokens in the dataset,
}
There are additional statistics generated in post-process
section:
"post-process": {
"average_bytes_per_sequence": # the average number of bytes per sequence after processing
"average_chars_per_sequence": # the average number of characters per sequence after processing
"discarded_files": # the number of files that were discarded during processing due to errors or being filtered out
"eos_id": # the token ID used to signify the end of a sequence
"n_examples": # the total number of examples (sequences) that were processed
"normalized_bytes_count": # the total number of bytes after normalization (e.g., UTF-8 encoding)
"normalized_chars_count": # the total number of characters after normalization (e.g., lowercasing, removing special characters)
"num_masked_tokens": # the total number of tokens that were masked (used in tasks like masked language modeling)
"num_pad_tokens": # the total number of padding tokens used to equalize the length of the sequences
"pad_id": # the token ID used as padding
"processed_files": # the number of files successfully processed
"raw_bytes_count": # the total number of bytes before any processing
"raw_chars_count": # the total number of characters before any processing
"successful_files": # the number of files that were successfully processed without any issues
"vocab_size": # the size of the vocabulary used in the tokenizer
}
Note that, some of the stats like normalized_chars_count
, normalized_bytes_count
, num_masked_tokens
, num_pad_tokens
are computed on the fly during data preprocessing. Eventually, we will remove the stats collection in post-processing phase.
It is important to note that some of the statistics currently collected, including normalized_chars_count
, normalized_bytes_count
, num_masked_tokens
, num_pad_tokens
, are calculated dynamically during data pre-processing. In a future update, we plan to remove the collection of these statistics in the post-processing phase to improve efficiency.
Implementation notes#
While we strive for robustness, potential errors may arise during data pre-processing. To ensure smooth operation and data preparation for different corpora, the following fallback mechanisms are implemented:
Curation Corpus, VSL Language Model, and VSL Summarization: Pre-processing for these specific data formats is not yet supported in the latest script. In such cases, use the script provided in the path create_hdf5_dataset.py.
Vocabulary Size and Sequence Length Mismatch: Improper configuration of vocabulary size and sequence length can lead to model crashes. To prevent such issues, the script verifies these parameters against the dataset specifications. If a mismatch is detected, an error message is displayed, prompting you to correct the configuration and ensuring accurate data pre-processing.