Number of dataset files is less than number of workers#
Error observed#
If the number of dataset files is less than the number of workers when training a PyTorch BERT model in pipelined execution, this assertion will be shown during exection:
AssertionError: Number of processes should be less than number of files, Got `num_workers` equal to <num_workers> and `num_files` equal to <num_files>.
Explanation#
The dataloader implementation in Cerebras Model Zoo has the constraint: num_workers * num_workers_per_csx < number_of_files_in_dataset
. In here:
num_workers: corresponds to the number of processes created per dataloader iterator. This parameter is a native PyTorch Dataloder argument. For Cerebras Model Zoo models, it is define in
train_input.num_workers
in the configuration yaml file.num_workers_per_csx: corresponds to the number of input workers used per CS-2 system. This paramter is specific to Cerebras environment. For pipelined execution, its default value is 8. You can set the value of this parameter with the
--num_workers_per_csx
flag in therun.py
script.number_of_files_in_dataset: corresponds to the number of files that composed the dataset. If you are using Cerebras Model Zoo offline preprocessing scripts, you can select this parameter using the
--num_output_files
flag, with a default value of 10 files.
Note
If you are using multi-replica in pipelined mode, make sure that num_workers_per_csx >= num_replicas
. By default, num_workers_per_csx
is set to 8 to get the maximum replica count possible. We recommend to keep this default and reduce num_workers
such that the dataloader implementation limitation is satisfied.