Dataloaders for TensorFlow#

Cerebras recommends using TFRecords to create TensorFlow data loaders with optimal performance. TFRecords is primarily chosen because of the storage efficiency that it offers and can be read very fast using parallel I/O operations that the CS systems can take advantage of.

To use TF Records, you need to convert your data offline from raw data to TFRecords-supported format.

Note

Alternatively, you can create TensorFlow dataloaders that ingest other native formats, like numpy, etc. Hoever, they may run at sub-optimal performance

Cerebras Model Zoo Dataloaders#

Cerebras Model Zoo dtaloaders extend the TfRecordsProcessor that creates datasets from pre-compiled TFRecords using the map_fn provided in the child class. These are some examples

  • BertTfRecordsProcessor - Static-based data loader that creates dataset for BERT from pre-compiled TFRecords

  • BertMlmOnlyTfRecordsStaticMaskProcessor - Creates dataset from pre-compiled TFRecords for MLM task only; these TFRecords do not contain segment_ids and next_sentence_labels feature)

  • BertMlmOnlyTfRecordsDynamicMaskProcessor - Reads TFRecords containing sequences of tokens, adds MLM features on the fly; resulting dataset is for MLM task only; these TFRecords do not contain segment_ids and next_sentence_labels

  • GptTfRecordsProcessor - Creates dataset from pre-compiled TFRecords

  • T5DynamicDataProcessor - Dataset generator for T5 model, it also performs on-the-fly processing of data from text

  • DAGM2007Dataset - Creates dataset for DAGM 2007 Dataset which is PNG images and labels in CSV; it also performs on-the-fly augmentations

  • SeverstalTFRecordsDataset - Creates dataset for Severstal Dataset, which is PNG images and labels in CSV; it also performs on-the-fly augmentations

Create a custom dataLoader with TensorFlow#

To create your own dataloader keep in mind these tips:

1. Coherence between output of the dataloader and input of the neural network model. For example, if you are using one of the models from Cerebras Model Zoo, every README file of every model explains the format required for this model to run and be trained. For instance, if you are using GPT-2, you must ensure your own input function produces a features dictionary.

  1. Cerebras supported file types. You can create your own dataset by extending one of the native dataset types. Currently, Cerebras ecosystem only supports files of types TSV, CSV, TXT, and PNG. Other file types have not been tested.

Performance Analyzer#

An additional tool that Cerebras provides for TensorFlow is a performance analyzer called perf_input_fn. This function takes in three arguments: input_fn, params, and time_stop. You can use it to estimate the number of steps (and samples) per second for a given input function input_fn and params.

To achieve this, you can create a python file trial.py as shown below, where the test is running for 30 seconds, as indicated in time_stop=30:

from cerebras.tf.tf_helper import perf_input_fn
from data import input_fn
from utils import get_params

params = get_params(params_file)
perf_input_fn(input_fn, params, time_stop=30)

Then you run csrun_cpu python trial.py. For example, we run the train_input_fn of the FC-MNIST model with its default parameters configs/params.yaml with this function, and we get the following output:

total steps: 19221, time: 30.00224627985225, perf: 644.3517513735975 steps/sec/worker
Without counting first step, total steps: 19331, time: 25.694870948791504, perf: 752.32913364330508 steps/sec/worker
total number of inputs: 2
Shapes: {features: (100, 784), labels: (100,), }
(19332, 30.002246379852295, 4.307375431060791)

In this example, the following parameters are as follows:

  • total steps is the number of training steps.

  • We set the time as an argument to this function, which is the amount of time to measure the performance of the input_fn.

  • perf is an estimated number of training steps per second per worker.

  • It also has the same statistics without counting the first step, which is due to loading the model and its activations, which is essentially a more accurate estimate of the average performance of the input_fn.

  • In Shapes:

    • features: 100 is the batch size and 784 is the number of features per example

    • labels: because we have 100 examples in the batch, we have 100 labels

    • The three numbers in the last line are:

      • total steps

      • time, which is the same as time_stop

      • Time taken for the first step