Data processing and dataloaders#

To train or evaluate your model using Cerebras, you will need to convert your raw data into samples that can be processed by the Cerebras Wafer-Scale Cluster. This includes:

DIV Onclick Test

Offline Preprocessing: Preparing Your Data for Optimal Performance

Raw data comes in different formats, orders, and it may include repeated or bad samples. In Cerebras Model Zoo, we include tools to clean up your data and prepare it to efficiently processing in the Cerebras Wafer-Scale Cluster. This includes:

How to port Hugging Face datasets to train LLM: Convert Hugging Face datasets to HDF5 for offline use with Cerebras Model Zoo or directly use them online as a dataloader for language model training and evaluation.
Creating HDF5 dataset for GPT models using chunk data preprocessing: Generate HDF5 data and preprocess in chunks for GPT models for efficient, flexible, and parallelizable training on large datasets.
Creating HDF5 dataset for GPT models: HDF5 files provide efficient implementation of online dataloaders for Cerebras Wafer-Scale Cluster.
Step by step guide to pre-process SlimPajama: An example of how to preprocess a large corpus for training Large Language Models.

Online Processing with Dataloaders: Streamlining Data Handling

Dataloaders are an efficient way to store, distribute and retrieve the data. You can either:

Use one of the dataloaders already available in Cerebras Model Zoo: In this documentation, you will find the API reference to the commonly used dataloaders in Language Models included in Cerebras Model Zoo, including dataloaders processing HDF5 files.
Create your custom dataloader: In this documentation, you will find best practices to create efficient dataloaders for Cerebras Wafer-Scale Cluster.

Configuration with Cerebras YAML file#

Models in the Cerebras Model Zoo use a Cerebras YAML file for configuration, defining data processing scripts and dataloaders in the train_input and eval_input sections for training and evaluation, respectively.

Important

Ensure all configuration parameters, especially those in train_input and eval_input sections of the YAML files, use absolute paths to ensure compatibility with PyTorch dataloaders.

train_input:
    data_dir: ...
    data_processor: ...
    ...
eval_input:
    data_dir: ...
    data_processor: ...
    ...

Configuration sections#

data_dir: Path of the preprocessed data (after offline preprocessing)

data_processor: Could be either a data processors available in Cerebras Model Zoo or your customized data processor. You may need to import the corresponding data processor in the data.py file of your target model. As an example, for the GPT-2 model, the data.py file imports the GptHDF5DataProcessor.

Compatibility and Dataloader information#

Make sure that the data specified in your data_dir is compatible with the input data expected by the dataloader specified in data_processor.

To know more about about the dataloader APIs visit the Cerebras Model Zoo dataloader APIs or the README.md for the specific model you are interested in.

Note on PyTorch datasets#

Cerebras software does not support automatic download of PyTorch datasets as part of the dataloader. This is because the input workers that execute the dataloaders inside the Cerebras cluster do not have internet access. In addition, having multiple input workers download the dataset in the same shared location could result in synchronization issues.

Therefore, if you want to download PyTorch datasets, they should be prepared offline outside of the dataloader functions. Example scripts can be found in the PyTorch FC MNIST implementation.