On This Page
To increase the speed to load your data, PyTorch supports parallelized data loading, retrieving batches of indices instead of individually, and streaming to progressively download datasets.
PyTorch provides a data loading utility (
torch.utils.data.DataLoader) class. The most important argument of this DataLoader is the Dataset which is a dataset object to load the data from. There are two different types of Datasets.
Map-style datasets (
Dataset) is a map from indices/keys to data samples. So, if
dataset[idx]is accessed, that reads
idx-thimage and its label from a directory on disk.
Iterable-style datasets (
IterableDataset) represents an iterable over data samples. This is very suitable where random reads are expensive or even improbable, and where the batch size depends on the fetched data. So, if
iter(dataset)is called, returns a stream of data from a database, or remote server, or even logs generated in real time.
We extend either one of the above two to create our dataloader classes and implement additional functionality. As an example,
Data loading order¶
IterableDataset is controlled by user-defined iterable which makes it easy for chunk-reading and dynamic batch size.
Dataset has a default
Sampler it is possible to create custom
Sampler object that at each time that yields the next index/key to fetch.
There is no iterable-style dataset sampler, since such datasets have no notion of a key or an index.
Loading batched/ non-batched data¶
Automatic Batching is the most common(default) case, which fetches a minibatch of data and collates them into batches samples (Usually first dimension of the Tensor is the batch dimension).
batch_size (default 1) is not
None, the data loader yields batched samples instead of individual samples.
drop_last argument is to specify how the data loader obtains batches of dataset keys.
drop_last=True, the last batch (when the number of examples in your dataset is not divisible by your batch_size) is ignored, while
drop_last=False makes the last batch smaller than your
This leads to loading from map-style datasets as such:
for indices in batch_sampler: yield collate_fn([dataset[i] for i in indices])
And from an iterable-style dataset would be:
dataset_iter = iter(dataset) for indices in batch_sampler: yield collate_fn([next(dataset_iter) for _ in indices])
If batching is disabled, then it simply convert Numpy arrays in Pytorch tensors. When batching is enabled, it has the following properties:
Prepends a new dimension as batch dimension
Automatically converts Numpy arrays and python numericals into torch tensors
Preserves the data structure just converted into torch tensors
collate_fn can be used to customize collation; e.g., padding sequential data to max length of a batch, collating along a dimension other than the first, padding sequences of various lengths, or adding support for custom data types.
Pytorch makes parallel data loading very easy. You can parallelize data loading with the
num_workers argument of a PyTorch DataLoader and get a higher throughput.
Under the hood, the
num_workers processes. Each process reloads the dataset passed to the DataLoader and is used to query examples. Reloading the dataset inside a worker doesn’t fill up your RAM, since it simply memory-maps the dataset again from your disk.
By loading dataset in streaming mode, you allow to progressively download the data you need while iterating over the dataset. If the dataset is split in several shards (i.e., if the dataset consists of multiple data files), then you can stream in parallel using
Now, a dataloader that supports both map-style and iterable-style datasets looks like this:
DataLoader(dataset, batch_size=1, shuffle=False, sampler=None, batch_sampler=None, num_workers=0, collate_fn=None, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None, *, prefetch_factor=2, persistent_workers=False)