cerebras.modelzoo.data.common.h5_map_dataset.preprocess_pile#

Preprocess a dataste saved in the Eleuther lm_dataformat format such as Pile for use in a data processor such as the GptG5MapDataProcessor which is backed by a H5Reader.

The basic logic in this script is to convert each input file to a single H5 output file by appying unicode normalization, tokenizing, and concatenating documents with an end of document token in between.

This script is meant to be run in parallel across several nodes using a tool such as sbatch. For example, to preprocess Pile from the raw artifacts downloaded from https://the-eye.eu/public/AI/pile/, run the following slurm script using sbatch –array 0-29: ` #!/bin/bash python preprocess_pile.py        --input_path /path/to/raw/pile/train/*.jsonl.zst        --output_dir /path/to/output/dir        --tokenizer /path/to/gpt2/tokenizer.json        --eos_id 50256        --normalizer NFC        --rank $SLURM_ARRAY_TASK_ID        --world_size $SLURM_ARRAY_TASK_COUNT ` The files provided are automatically sharded beween workers based on the provided rank and world size which results in each worker processing a single file. The script is also functional although less parallel if you reduce the worker pool (potentially to only a single worker) and let each worker process multiple files. The only change needed would be in the –array sbatch argument.

This script assumes that the documents in the source dataset are already shuffled, which is the case for the typical Pile download.

Functions

main

parse_args

save_run_info