Prepare your data#

Samples will be processed in the input preprocessing servers before being sent to the CS-2 system. To transform your raw data, you will

  • Prepare your data using offline preprocessing scripts. You may consider preprocessing your data offline for two reasons:

    1. Choose format that is compatible with Cerebras ecosystem. Depending on the framework, some data formats provide the best performance with Cerebras ecosystem. As an example, Cerebras recommends using TFRecods for offline data processing in TensorFlow

    2. Reduce online time doing expensive changes on the data.

  • Create dataloaders that will be executed online to create samples for the CS-2. DataLoaders are a way to store and retrieve data that allows for fast parallel I/O operations, efficient sequence data storing, and, because it is self-contained, it is easier to distribute the data. Some operations that are done in hte dataloaders are:

    • Removing symbols that aren’t part of the vocabulary

    • Tokenizing words

    • Adding certain symbols to indicate the beginning, middle, or end of the sentence

    • Converting tokens into features that can be fed into the model

Note

Learn more about dataloaders and supported dataformats in Cerebras ecosystem in Dataloaders for PyTorch and Dataloaders for TensorFlow.

Cerebras Model Zoo contains scripts for offline preprocessing as well as dataloaders for different model configurations and popular datasets. Therefore you can:

  1. Use offline preprocessing scripts and dataloaders from Cerebras Model Zoo

  2. Create offline preprocessing scripts to match dataloaders in Cerebras Model Zoo of your model of interest

  3. Create your own offline preprocessing scripts and dataloaders to match Cerebras Model Zoo model or your own model.