cerebras.modelzoo.data.common.input_utils.get_data_for_task#

cerebras.modelzoo.data.common.input_utils.get_data_for_task(task_id, meta_data_values_cum_sum, num_examples_per_task, meta_data_values, meta_data_filenames)[source]#

Function to get distribute files with given number of examples such that each distributed task has access to exactly the same number of examples

Parameters
  • task_id (int) – Integer id for a task.

  • meta_data_values_cum_sum (int) – Cumulative sum of the file sizes in lines from meta data file.

  • num_examples_per_task (int) – Number of the examples specified per slurm task. Equal to batch_size * num_batch_per_task.

  • meta_data_values (list[int]) – List of the files sizes in lines in the meta data file.

  • meta_data_filenames (list[str]) – List with file names in the meta data file.

Returns

list of tuples of length 3. The tuple contains at - index 0: filepath. - index 1: number of examples to be considered for this task_id. - index 2: start index in the file from where these

examples should be considered

The list represents the files that should be considered for this task_id.