Preprocessing Scripts#
- Using Hugging Face datasets for auto-regressive LM
- Creating HDF5 dataset for GPT models
- Generating HDF5 data GPT-style models using data chunk preprocessing
- Data pre-processing pipeline
- Online Shuffling in HDF5 File Storage
- Output files structure
- Implementation notes
- Shuffling Samples for HDF5 dataset of GPT Models
- Step by step guide to pre-process SlimPajama