Data Processing and Dataloaders#
To train or evaluate your model using Cerebras, you will need to convert your raw data into samples that can be processed by the Cerebras Wafer-Scale cluster. This includes:
Offline preprocessing of your data
Raw data comes in different formats, orders, and it may include repeated or bad samples. In Cerebras Model Zoo, we include tools to clean up your data and prepare it to efficiently processing in the Cerebras Wafer-Scale Cluster. This includes:
- Step by step guide to pre-process SlimPajama, an example of how to preprocess a large corpus for training large language models.
- How to port Hugging Face datasets to train LLM
- Instruction on creating HDF5 dataset for GPT Models. HDF5 files provide efficient implementation of online dataloaders for Cerebras Wafer-Scale Cluster.
Online data processing using dataloaders
Dataloaders are an efficient way to store, distribute and retrieve the data. You can either
- Use one of the dataloaders already available in Cerebras Model Zoo. In this documentation, you will find the API reference to the commonly used dataloaders in Language Models included in Cerebras Model Zoo, including dataloaders processing HDF5 files.
- Create your custom dataloader. In this documentation, you will find best practices to create efficient dataloaders for Cerebras Wafer-Scale Cluster.
Note
Cerebras Model Zoo models are configured using a configuration yaml file. By convention, the data processing and data loader is defined in the train_input (for training) and eval_input (for evaluation) sections.
train_input:
data_dir: ...
data_processor: ...
...
eval_input:
data_dir: ...
data_processor: ...
...
In these sections, you will need to specify:
the data_dir, which corresponds to the path of the data after offline preprocessing.
the data_processor, which corresponds to one of the data processors available in Cerebras Model Zoo or your customized data processor. You may need to import the corresponding data processor in the data.py file of your target model. As an example, for the GPT-2 model, the data.py file imports the GptHDF5DataProcessor.
Make sure that the data specified in your data_dir is compatible with the input data expected by the dataloader specified in data_processor. To know more about about the dataloader APIs visit the Cerebras Model Zoo dataloader APIs or the README.md for the specific model you are interested in.
Note
Cerebras software does not support automatic download of PyTorch datasets as part of the dataloader. This is because the input workers that execute the dataloaders inside the Cerebras cluster do not have internet access. In addition, having multiple input workers download the dataset in the same shared location could result in synchronization issues.
Therefore, if you want to download PyTorch datasets, they should be prepared offline outside of the dataloader functions. Example scripts can be found in the PyTorch FC MNIST implementation
- Preprocessing Scripts
- Using Hugging Face datasets for auto-regressive LM
- Creating HDF5 dataset for GPT models
- Generating HDF5 data GPT-style models using data chunk preprocessing
- Data pre-processing pipeline
- Online Shuffling in HDF5 File Storage
- Output files structure
- Implementation notes
- Shuffling Samples for HDF5 dataset of GPT Models
- Step by step guide to pre-process SlimPajama
- Create your own dataloader
- modelzoo.transformers.data_processing
- modelzoo.transformers.data_processing.GenericDataProcessor
- modelzoo.transformers.data_processing.HDF5IterableDataProcessor
- modelzoo.transformers.data_processing.HDF5IterableDataset
- modelzoo.transformers.data_processing.bert
- modelzoo.transformers.data_processing.h5_map_dataset
- modelzoo.transformers.data_processing.huggingface
- modelzoo.transformers.data_processing.qa
- modelzoo.transformers.data_processing.scripts
- modelzoo.transformers.data_processing.slimpajama
- modelzoo.transformers.data_processing.tokenizers
- modelzoo.transformers.data_processing.utils