DataLoader Overview

Dealing with large data due to memory limitations makes it challenging to use for machine learning. This is why TensorFlow and PyTorch have created DataLoaders. DataLoaders are a way to store and retrieve data that allows for fast parallel I/O operations, efficient sequence data storing, and, because it is self-contained, it is easier to distribute the data. This page covers the different types of DataLoaders Cerebras uses to prepare data.

Cerebras supports general data processing for text data to be used by both PyTorch and TensorFlow. For details, refer to our Model Zoo documentation.

It covers the following items:

  • Removing symbols that aren’t part of the vocabulary

  • Tokenizing words

  • Adding certain symbols to indicate the beginning, middle, or end of the sentence

  • Converting tokens into features that can be fed into the model

Data Loaders for PyTorch

Data Formats

PyTorch provides two types of datasets as part of torch.utils.data: Map-style datasets (Dataset) and Iterable-style datasets (IterableDataset). Cerebras extends the IterableDataset as shown in the files listed below. These files show implementations from Cerebras for these DataLoaders that inherit some of the functionalities from IterableDataset. These files additionally provide other functionalities (i.e., input encoding and tokenization) that are required by the model that gets fed the output of these files.

IterableDataset (torch.utils.data — PyTorch 1.12 documentation)

For more details about PyTorch dataloaders, refer to PyTorch DataLoader.

Custom DataLoaders with PyTorch

To create your own DataLoader, ensure that the input that is prepared matches the format required by the model that the data needs to be trained on. This essentially means that each model can be trained on data of a specific format. Every README file of every model that Cerebras provides explains the format required for this model to run and be trained. For example, if you are using GPT-2, you must ensure your own input function produces the features dictionary.

You can create your own dataset by extending one of the native dataset types. Currently, Cerebras only supports files of types HDF5, TSV, CSV, and TXT. Other types of files may not work as expected, as they have not been tested by Cerebras.

Data Loaders for TensorFlow

Cerebras leverages TFRecords for TensorFlow data loaders. TFRecords is primarily chosen because of the storage efficiency that it offers and can be read very fast using parallel I/O operations that the CS systems can take advantage of.

Data Formats

The following are the input DataLoaders that Cerebras provides in TensorFlow.

These extend the TfRecordsProcessor that creates datasets from pre-compiled TFRecords using the map_fn provided in the child class.

  • BertTfRecordsProcessor - Static-based data loader that creates dataset for BERT from pre-compiled TFRecords

  • BertMlmOnlyTfRecordsStaticMaskProcessor - Creates dataset from pre-compiled TFRecords for MLM task only; these TFRecords do not contain segment_ids and next_sentence_labels feature)

  • BertMlmOnlyTfRecordsDynamicMaskProcessor - Reads TFRecords containing sequences of tokens, adds MLM features on the fly; resulting dataset is for MLM task only; these TFRecords do not contain segment_ids and next_sentence_labels

  • GptTfRecordsProcessor - Creates dataset from pre-compiled TFRecords

  • T5DynamicDataProcessor - Dataset generator for T5 model, it also performs on-the-fly processing of data from text

  • DAGM2007Dataset - Creates dataset for DAGM 2007 Dataset which is PNG images and labels in CSV; it also performs on-the-fly augmentations

  • SeverstalTFRecordsDataset - Creates dataset for Severstal Dataset, which is PNG images and labels in CSV; it also performs on-the-fly augmentations

In this section, we went through the Cerebras-recommended method of creating data loaders for TensorFlow (i.e., via TFRecords) to get optimal performance. To use TF Records, you need to convert over to TFRecords-supported data format, which might need additional steps and may not be straightforward. Alternatively, you can still run other native formats like numpy, etc.; however, that may run at sub-optimal performance.

For more details about TensorFlow data loaders, refer to Basics of TFRecords.

Custom DataLoaders with TensorFlow

To create your own custom DataLoader, you must prepare the input such that it matches the format required by the model. The model expects the data in a specified format. For example, if you are using GPT-2, you must ensure your own input function produces a features dictionary. Currently, Cerebras only supports files of types TSV, CSV, TXT, and PNG. Other types of files may not work as expected, as they have not been tested by Cerebras.

Performance Analyzer

An additional tool that Cerebras provides for TensorFlow is a performance analyzer called perf_input_fn. This function takes in three arguments: input_fn, params, and time_stop. You can use it to estimate the number of steps (and samples) per second for a given input function input_fn and params.

To achieve this, you can create a python file trial.py as shown below, where the test is running for 30 seconds, as indicated in time_stop=30:

from cerebras.tf.tf_helper import perf_input_fn
from data import input_fn
from utils import get_params

params = get_params(params_file)
perf_input_fn(input_fn, params, time_stop=30)

Then you run csrun_cpu python trial.py. For example, we run the train_input_fn of the FC-MNIST model with its default parameters configs/params.yaml with this function, and we get the following output:

total steps: 19221, time: 30.00224627985225, perf: 644.3517513735975 steps/sec/worker
Without counting first step, total steps: 19331, time: 25.694870948791504, perf: 752.32913364330508 steps/sec/worker
total number of inputs: 2
Shapes: {features: (100, 784), labels: (100,), }
(19332, 30.002246379852295, 4.307375431060791)

In this example, the following parameters are as follows:

  • total steps is the number of training steps.

  • We set the time as an argument to this function, which is the amount of time to measure the performance of the input_fn.

  • perf is an estimated number of training steps per second per worker.

  • It also has the same statistics without counting the first step, which is due to loading the model and its activations, which is essentially a more accurate estimate of the average performance of the input_fn.

  • In Shapes:

    • features: 100 is the batch size and 784 is the number of features per example

    • labels: because we have 100 examples in the batch, we have 100 labels

    • The three numbers in the last line are:

      • total steps

      • time, which is the same as time_stop

      • Time taken for the first step