Number of dataset files is less than number of workers#

Error observed#

If the number of dataset files is less than the number of workers when training a PyTorch BERT model in pipelined execution, this assertion will be shown during exection:

AssertionError: Number of processes should be less than number of files, Got `num_workers` equal to <num_workers> and `num_files` equal to <num_files>.

Explanation#

The dataloader implementation in Cerebras Model Zoo has the constraint: num_workers * num_workers_per_csx < number_of_files_in_dataset. In here:

  • num_workers: corresponds to the number of processes created per dataloader iterator. This parameter is a native PyTorch Dataloder argument. For Cerebras Model Zoo models, it is define in train_input.num_workers in the configuration yaml file.

  • num_workers_per_csx: corresponds to the number of input workers used per CS-2 system. This paramter is specific to Cerebras environment. For pipelined execution, its default value is 8. You can set the value of this parameter with the --num_workers_per_csx flag in the run.py script.

  • number_of_files_in_dataset: corresponds to the number of files that composed the dataset. If you are using Cerebras Model Zoo offline preprocessing scripts, you can select this parameter using the --num_output_files flag, with a default value of 10 files.

Note

If you are using multi-replica in pipelined mode, make sure that num_workers_per_csx >= num_replicas. By default, num_workers_per_csx is set to 8 to get the maximum replica count possible. We recommend to keep this default and reduce num_workers such that the dataloader implementation limitation is satisfied.