Dataloaders for PyTorch#

Overview#

To increase the speed to load your data, PyTorch supports parallelized data loading, retrieving batches of indices instead of individually, and streaming to progressively download datasets.

PyTorch provides a data loading utility (torch.utils.data.DataLoader) class. The most important argument of this DataLoader is the Dataset which is a dataset object to load the data from. There are two different types of Datasets.

Map-style datasets (Dataset) is a map from indices/keys to data samples. So, if dataset[idx] is accessed, that reads idx-th image and its label from a directory on disk.

Iterable-style datasets (IterableDataset) represents an iterable over data samples. This is very suitable where random reads are expensive or even improbable, and where the batch size depends on the fetched data. So, if iter(dataset) is called, returns a stream of data from a database, or remote server, or even logs generated in real time.

The dataloaders in Cerebras Model Zoo extend either one of the above two to create dataloader classes and implement additional functionality. As an example, BertCSVDynamicMaskDataProcessor (code) extends IterableDataset and BertClassifierDataProcessor (code) extends Dataset.

Properties of PyTorch Dataloader#

Data loading order#

IterableDataset is controlled by user-defined iterable which makes it easy for chunk-reading and dynamic batch size.

Dataset has a default Sampler it is possible to create custom Sampler object that at each time that yields the next index/key to fetch.

There is no iterable-style dataset sampler, since such datasets have no notion of a key or an index.

Loading batched/ non-batched data#

Automatic Batching is the most common(default) case, which fetches a minibatch of data and collates them into batches samples (Usually first dimension of the Tensor is the batch dimension).

When batch_size (default 1) is not None, the data loader yields batched samples instead of individual samples. drop_last argument is to specify how the data loader obtains batches of dataset keys.

Essentially, when drop_last=True, the last batch (when the number of examples in your dataset is not divisible by your batch_size) is ignored, while drop_last=False makes the last batch smaller than your batch_size.

This leads to loading from map-style datasets as such:

for indices in batch_sampler:
   yield collate_fn([dataset[i] for i in indices])

And from an iterable-style dataset would be:

dataset_iter = iter(dataset)
for indices in batch_sampler:
   yield collate_fn([next(dataset_iter) for _ in indices])

`collate_fn`#

If batching is disabled, then it simply convert Numpy arrays in Pytorch tensors. When batching is enabled, it has the following properties:

Prepends a new dimension as batch dimension

Automatically converts Numpy arrays and python numericals into torch tensors

Preserves the data structure just converted into torch tensors

A custom collate_fn can be used to customize collation; e.g., padding sequential data to max length of a batch, collating along a dimension other than the first, padding sequences of various lengths, or adding support for custom data types.

Multiple Workers#

Pytorch makes parallel data loading very easy. You can parallelize data loading with the num_workers argument of a PyTorch DataLoader and get a higher throughput.

Under the hood, the DataLoader starts num_workers processes. Each process reloads the dataset passed to the DataLoader and is used to query examples. Reloading the dataset inside a worker doesn’t fill up your RAM, since it simply memory-maps the dataset again from your disk.

Stream data#

By loading dataset in streaming mode, you allow to progressively download the data you need while iterating over the dataset. If the dataset is split in several shards (i.e., if the dataset consists of multiple data files), then you can stream in parallel using num_workers.

Now, a dataloader that supports both map-style and iterable-style datasets looks like this:

DataLoader(dataset, batch_size=1, shuffle=False, sampler=None,
   batch_sampler=None, num_workers=0, collate_fn=None,
   pin_memory=False, drop_last=False, timeout=0,
   worker_init_fn=None, *, prefetch_factor=2,
   persistent_workers=False)

Cerebras Model Zoo Dataloaders#

Some example dataloaders in Cerebras Model Zoo that extends the IterableDataset and provide additional functionalities (e.g., input encoding and tokenization) are

BertCSVDataProcessor - Reads CSV files containing the input text tokens and MLM and NSP features

GptHDF5MapDataProcessor - A HDF5 map style dataset processor to read from HDF5 format for GPT pre-training

T5DynamicDataProcessor - Reads text files containing the input text tokens, adds extra ids for language modelling task on the fly

Create a custom dataloader with PyTorch#

To create your own dataloader keep in mind these tips:

Coherence between output of the dataloader and input of the neural network model

For example, if you are using one of the models from Cerebras Model Zoo, every README file of every model explains the format required for this model to run and be trained. For instance, if you are using GPT-2, you must ensure your own input function produces the features dictionary.
Cerebras supported file types

You can create your own dataset by extending one of the native dataset types. Currently, Cerebras ecosystem only supports files of types HDF5, TSV, CSV, and TXT. Other file types have not been tested.

To learn more about creating a custom dataloader, refer to our step-by-step tutorial on Creating custom dataloaders.