Special considerations for CV dataloaders#

Adding support for using worker_cache with a new dataloader#

ML Training is often bottenecked at the dataloader stage. In the Cerebras Wafer-Scale Cluster, in order to improve dataloading speeds, we can avoid network dataset streaming via the create_worker_cache function. This enables caching of a dataset to local SSD, which has significantly faster read speeds versus network.

To enable worker_cache for a new dataloader, we need to ensure that data directory is added to the worker_cache on the worker node. The utility function: create_worker_cache allows users to cache the directory on the worker node. It looks at the src directory, and caches this directory on the worker_cache if it doesn’t exist and there is enough space on the cache (shouldn’t exceed 80% after the directory is cached). It returns the path to the directory on the worker_cache. Users just need to replace the returned dir with the original data_dir in the dataloader.

NOTE: create_worker_cache will work only in appliance mode and should be called only for the worker task. cm.is_appliance and cm.is_streamer APIs can be used to check for appliance mode and if we are calling it for a worker task respectively.

Example#

Below is an example of adding worker cache support for InriaAerialDataset:

https://github.com/Cerebras/modelzoo/tree/main/modelzoo/vision/pytorch/unet/input/InriaAerialDataProcessor.py#L68-L75

The above snippet calls create_worker_cache on the VisionDataset’s self.root and overwrites it: 1. if use_worker_cache is True (controlled by the yaml config), 2. and dataloader is being called by a worker task (determined with cm.is_streamer).

The updated self.root (containing the SSD path) will be used by the dataloader.

Train with multi-replica data parallel mode

Define environment variables for input workers