Special considerations for CV dataloaders#
Adding support for using worker_cache with a new dataloader#
ML Training is often bottenecked at the dataloader stage. In the Cerebras Wafer-Scale Cluster, in order to improve dataloading speeds, we can avoid network dataset streaming via the create_worker_cache
function. This enables caching of a dataset to local SSD, which has significantly faster read speeds versus network.
To enable worker_cache for a new dataloader, we need to ensure that data directory is added to the worker_cache
on the worker node. The utility function create_worker_cache
allows users to cache the directory
on the worker node. It looks at the src directory, and caches this directory on the worker_cache
if it doesn’t exist and there is enough space on the cache (shouldn’t exceed 80% after the directory is cached). It returns the path to the directory on the worker_cache.
Users just need to replace the returned dir with the original data_dir in the dataloader.
Note
create_worker_cache
should be called only for the worker task.
Example#
Below is an example of adding worker cache support for InriaAerialDataset
import cerebras_pytorch as cstorch
import cerebras_pytorch.distributed as dist
if use_worker_cache and dist.is_streamer():
if not cstorch.use_cs():
raise RuntimeError(
"use_worker_cache not supported for non-appliance runs"
)
else:
self.root = create_worker_cache(self.root)
The above snippet calls create_worker_cache
on the VisionDataset’s self.root
and overwrites it:
if use_worker_cache is True (controlled by the yaml config),
and dataloader is being called by a worker task (determined with
dist.is_streamer
).
The updated self.root
(containing the SSD path) will be used by the dataloader.