Number of dataset files is less than number of workers#
If the number of dataset files is less than the number of workers when training a PyTorch BERT model in pipelined execution, this assertion will be shown during exection:
AssertionError: Number of processes should be less than number of files, Got `num_workers` equal to <num_workers> and `num_files` equal to <num_files>.
The dataloader implementation in Cerebras Model Zoo has the constraint:
num_workers * num_workers_per_csx < number_of_files_in_dataset. In here:
num_workers: corresponds to the number of processes created per dataloader iterator. This parameter is a native PyTorch Dataloder argument. For Cerebras Model Zoo models, it is define in
train_input.num_workersin the configuration yaml file.
num_workers_per_csx: corresponds to the number of input workers used per CS-2 system. This paramter is specific to Cerebras environment. For pipelined execution, its default value is 8. You can set the value of this parameter with the
--num_workers_per_csxflag in the
number_of_files_in_dataset: corresponds to the number of files that composed the dataset. If you are using Cerebras Model Zoo offline preprocessing scripts, you can select this parameter using the
--num_output_filesflag, with a default value of 10 files.
If you are using multi-replica in pipelined mode, make sure that
num_workers_per_csx >= num_replicas. By default,
num_workers_per_csx is set to 8 to get the maximum replica count possible. We recommend to keep this default and reduce
num_workers such that the dataloader implementation limitation is satisfied.