Preprocess a dataste saved in the Eleuther lm_dataformat format such as Pile for use in a data processor such as the GptG5MapDataProcessor which is backed by a H5Reader.
The basic logic in this script is to convert each input file to a single H5 output file by appying unicode normalization, tokenizing, and concatenating documents with an end of document token in between.
This script is meant to be run in parallel across several nodes using a tool
such as sbatch. For example, to preprocess Pile from the raw artifacts
downloaded from https://the-eye.eu/public/AI/pile/, run the following slurm
script using sbatch –array 0-29:
python preprocess_pile.py --input_path /path/to/raw/pile/train/*.jsonl.zst --output_dir /path/to/output/dir --tokenizer /path/to/gpt2/tokenizer.json --eos_id 50256 --normalizer NFC --rank $SLURM_ARRAY_TASK_ID --world_size $SLURM_ARRAY_TASK_COUNT
The files provided are automatically sharded beween workers based on the
provided rank and world size which results in each worker processing a single
file. The script is also functional although less parallel if you reduce the
worker pool (potentially to only a single worker) and let each worker process
multiple files. The only change needed would be in the –array sbatch
This script assumes that the documents in the source dataset are already shuffled, which is the case for the typical Pile download.