Prepare your data#
Samples will be processed in the input preprocessing servers before being sent to the CS-2 system. To transform your raw data, you will
Prepare your data using offline preprocessing scripts. You may consider preprocessing your data offline for two reasons:
Choose format that is compatible with Cerebras ecosystem. Depending on the framework, some data formats provide the best performance with Cerebras ecosystem. As an example, Cerebras recommends using TFRecods for offline data processing in TensorFlow
Reduce online time doing expensive changes on the data.
Create dataloaders that will be executed online to create samples for the CS-2. DataLoaders are a way to store and retrieve data that allows for fast parallel I/O operations, efficient sequence data storing, and, because it is self-contained, it is easier to distribute the data. Some operations that are done in the dataloaders are:
Removing symbols that aren’t part of the vocabulary
Tokenizing words
Adding certain symbols to indicate the beginning, middle, or end of the sentence
Converting tokens into features that can be fed into the model
Note
Learn more about dataloaders and supported dataformats in Cerebras ecosystem in Dataloaders for PyTorch and Dataloaders for TensorFlow.
Cerebras Model Zoo contains scripts for offline preprocessing as well as dataloaders for different model configurations and popular datasets. Therefore you can:
Use offline preprocessing scripts and dataloaders from Cerebras Model Zoo
Create offline preprocessing scripts to match dataloaders in Cerebras Model Zoo of your model of interest
Create your own offline preprocessing scripts and dataloaders to match Cerebras Model Zoo model or your own model.
Note
Cerebras software does not support automatic download of PyTorch or Tensorflow datasets as part of the dataloader. This is beacuase the input workers that execute the dataloaders inside the Cerebras cluster do not have internet access. In addition, having multiple input workers download the dataset in the same shared location could result in synchronization issues.
Therefore, if you want to download PyTorch or Tensorflow datasets, they should be prepared offline outside of the dataloader functions. Some example scripts can be found in the TensorFlow FC MNIST implementation and the PyTorch FC MNIST implementation