Dataloaders for PyTorch#
To increase the speed to load your data, PyTorch supports parallelized data loading, retrieving batches of indices instead of individually, and streaming to progressively download datasets.
PyTorch provides a data loading utility (
torch.utils.data.DataLoader) class. The most important argument of this DataLoader is the Dataset which is a dataset object to load the data from. There are two different types of Datasets.
Map-style datasets (
Dataset) is a map from indices/keys to data samples. So, if
dataset[idx]is accessed, that reads
idx-thimage and its label from a directory on disk.
Iterable-style datasets (
IterableDataset) represents an iterable over data samples. This is very suitable where random reads are expensive or even improbable, and where the batch size depends on the fetched data. So, if
iter(dataset)is called, returns a stream of data from a database, or remote server, or even logs generated in real time.
The dataloaders in Cerebras Model Zoo extend either one of the above two to create dataloader classes and implement additional functionality. As an example,
BertCSVDynamicMaskDataProcessor (code) extends
BertClassifierDataProcessor (code) extends
Properties of PyTorch Dataloader#
Data loading order#
IterableDataset is controlled by user-defined iterable which makes it easy for chunk-reading and dynamic batch size.
Dataset has a default
Sampler it is possible to create custom
Sampler object that at each time that yields the next index/key to fetch.
There is no iterable-style dataset sampler, since such datasets have no notion of a key or an index.
Loading batched/ non-batched data#
Automatic Batching is the most common(default) case, which fetches a minibatch of data and collates them into batches samples (Usually first dimension of the Tensor is the batch dimension).
batch_size (default 1) is not
None, the data loader yields batched samples instead of individual samples.
drop_last argument is to specify how the data loader obtains batches of dataset keys.
drop_last=True, the last batch (when the number of examples in your dataset is not divisible by your batch_size) is ignored, while
drop_last=False makes the last batch smaller than your
This leads to loading from map-style datasets as such:
for indices in batch_sampler: yield collate_fn([dataset[i] for i in indices])
And from an iterable-style dataset would be:
dataset_iter = iter(dataset) for indices in batch_sampler: yield collate_fn([next(dataset_iter) for _ in indices])
If batching is disabled, then it simply convert Numpy arrays in Pytorch tensors. When batching is enabled, it has the following properties:
Prepends a new dimension as batch dimension
Automatically converts Numpy arrays and python numericals into torch tensors
Preserves the data structure just converted into torch tensors
collate_fn can be used to customize collation; e.g., padding sequential data to max length of a batch, collating along a dimension other than the first, padding sequences of various lengths, or adding support for custom data types.
Pytorch makes parallel data loading very easy. You can parallelize data loading with the
num_workers argument of a PyTorch DataLoader and get a higher throughput.
Under the hood, the
num_workers processes. Each process reloads the dataset passed to the DataLoader and is used to query examples. Reloading the dataset inside a worker doesn’t fill up your RAM, since it simply memory-maps the dataset again from your disk.
By loading dataset in streaming mode, you allow to progressively download the data you need while iterating over the dataset. If the dataset is split in several shards (i.e., if the dataset consists of multiple data files), then you can stream in parallel using
Now, a dataloader that supports both map-style and iterable-style datasets looks like this:
DataLoader(dataset, batch_size=1, shuffle=False, sampler=None, batch_sampler=None, num_workers=0, collate_fn=None, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None, *, prefetch_factor=2, persistent_workers=False)
Cerebras Model Zoo Dataloaders#
Some example dataloaders in Cerebras Model Zoo that extends the
IterableDataset and provide additional functionalities (e.g., input encoding and tokenization) are
BertCSVDataProcessor - Reads
CSVfiles containing the input text tokens and
BertCSVDynamicMaskDataProcessor - Reads
CSVfiles containing the input text tokens, adds
NSPfeatures on the fly
GptHDF5DataProcessor - A
HDF5dataset processor to read from
HDF5format for GPT pre-training
GptTextDataProcessor - A text dataset processor for GPT pre-training; performs on-the-fly processing of data from text
T5DynamicDataProcessor - Reads text files containing the input text tokens, adds extra ids for language modelling task on the fly
TransformerDynamicDataProcessor - Reads text files containing the input text tokens and generates features on the fly
Create a custom dataloader with PyTorch#
To create your own dataloader keep in mind these tips:
1. Coherence between output of the dataloader and input of the neural network model. For example, if you are using one of the models from Cerebras Model Zoo, every README file of every model explains the format required for this model to run and be trained. For instance, if you are using GPT-2, you must ensure your own input function produces the features dictionary.
Cerebras supported file types. You can create your own dataset by extending one of the native dataset types. Currently, Cerebras ecosystem only supports files of types
TXT. Other file types have not been tested.