cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.chunk_data_preprocessor.ChunkDataPreprocessor#

class cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.chunk_data_preprocessor.ChunkDataPreprocessor[source]#

Bases: object

Initialize the class with given parameters and logger.

Parameters
  • params (dict) – Configuration parameters.

  • logger (Logger) – Logging interface.

Methods

add_token

Add token to the tokenizer :param token: token to be added to the tokenizer :type token: str

calculate_total_chunks

Calculate the total number of chunks based on the given total size and the predefined max chunk size.

calculate_total_size

Calculate the total size of all input files, taking compression factors into consideration.

check_unused_params

Check for any unused parameters and log them as warnings.

file_split_process_dataset

Process the dataset by splitting files across multiple processes.

get_output_dir

Retrieve the output directory path.

get_params_file

Retrieve the path to the JSON parameters file.

get_vocab_size

Get tokenizer vocabulary size :returns: text to tokenize :rtype: vocab_size (int)

handle_metadata_files

Handle metadata files based on provided configuration.

initialize_gpt2tokenizer

Initialize GPT-2 tokenizer.

initialize_huggingfacetokenizer

Initialize Hugging Face tokenizer.

initialize_miscellaneous_attributes

Initialize miscellaneous attributes.

initialize_neoxtokenizer

Initialize Neox tokenizer.

initialize_tokenizer

Initialize tokenizer based on the provided tokenizer_type parameter.

process_dataset

Process the dataset either through file split or task split methods.

process_dataset_params

Process dataset specific parameters.

process_files

Process the given files, tokenize the data chunks, and save to HDF5 format.

process_params

Process parameters by calling various initialization methods.

process_processing_params

Process the processing parameters and initialize relevant class attributes.

process_setup_params

Setup the number of processes based on provided configuration.

read_checkpoint

This function reads the checkpoint args from the created checkpoint file.

reader_process

Reads data from input files and distributes them to the tokenizer queues.

setup_output_directory

Set up the output directory based on provided configuration.

shuffle_second_pass

This function performs the second pass of shuffling.

split_shuffle_second_pass

This function divides the output hdf5 files into different processes and prepares them for the second pass of shuffling.

stats_collation

This function collates the stats obtained from the different writer processes into a combined final stats :param num_writer_processes: Number of writer processes

task_split_process_dataset

Split the dataset processing tasks across multiple processes.

tokenizer_process

Tokenizes data and forwards the tokenized data to the writer queue.

write_remaining_prefix

This function writes the prefix remaining after processing LMData when pack_sequences is set to true.

writer_process

Process that writes tokenized data to HDF5 format.

__init__(params, logger)[source]#

Initialize the class with given parameters and logger.

Parameters
  • params (dict) – Configuration parameters.

  • logger (Logger) – Logging interface.

process_params() None[source]#

Process parameters by calling various initialization methods.

setup_output_directory() None[source]#

Set up the output directory based on provided configuration.

handle_metadata_files() None[source]#

Handle metadata files based on provided configuration.

process_setup_params() None[source]#

Setup the number of processes based on provided configuration.

check_unused_params() None[source]#

Check for any unused parameters and log them as warnings.

process_dataset_params() None[source]#

Process dataset specific parameters.

process_processing_params() None[source]#

Process the processing parameters and initialize relevant class attributes.

add_token(token)[source]#

Add token to the tokenizer :param token: token to be added to the tokenizer :type token: str

initialize_tokenizer(processing_params: Dict[str, Any]) None[source]#

Initialize tokenizer based on the provided tokenizer_type parameter.

Parameters

processing_params (Dict[str, Any]) – Dictionary of processing parameters.

initialize_gpt2tokenizer(processing_params: Dict[str, Any]) None[source]#

Initialize GPT-2 tokenizer.

Parameters

processing_params (Dict[str, Any]) – Dictionary of processing parameters.

initialize_neoxtokenizer(processing_params: Dict[str, Any]) None[source]#

Initialize Neox tokenizer.

Parameters

processing_params (Dict[str, Any]) – Dictionary of processing parameters.

initialize_huggingfacetokenizer(processing_params: Dict[str, Any]) None[source]#

Initialize Hugging Face tokenizer.

Parameters

processing_params (Dict[str, Any]) – Dictionary of processing parameters.

initialize_miscellaneous_attributes() None[source]#

Initialize miscellaneous attributes.

get_params_file() str[source]#

Retrieve the path to the JSON parameters file.

Returns

Path to the JSON parameters file.

Return type

str

get_output_dir() str[source]#

Retrieve the output directory path.

Returns

Path to the output directory.

Return type

str

calculate_total_size() int[source]#

Calculate the total size of all input files, taking compression factors into consideration.

Returns

The total size of all input files in bytes.

Return type

int

calculate_total_chunks(total_size: int) int[source]#

Calculate the total number of chunks based on the given total size and the predefined max chunk size.

Parameters

total_size (int) – The total size of the data in bytes.

Returns

Total number of chunks.

Return type

int

read_checkpoint(num_writers) List[Tuple[int, int, int]][source]#

This function reads the checkpoint args from the created checkpoint file. :param num_writers: The number of writer processes

write_remaining_prefix(chunk_locks, pid) Tuple[int, Dict][source]#

This function writes the prefix remaining after processing LMData when pack_sequences is set to true. :param chunk_locks: List of locks for appending to hdf5 files during shuffling :param pid: Process id of the current process

shuffle_second_pass(file_list, progress_counter, pid) None[source]#

This function performs the second pass of shuffling. :param file_list: List of hdf5 file paths to shuffle :param progress_counter: A shared counter to track progress across processes.

split_shuffle_second_pass()[source]#

This function divides the output hdf5 files into different processes and prepares them for the second pass of shuffling.

stats_collation(num_writer_processes) None[source]#

This function collates the stats obtained from the different writer processes into a combined final stats :param num_writer_processes: Number of writer processes

process_files(file_paths, process_idx, checkpoint_args, progress_counter, chunk_locks) None[source]#

Process the given files, tokenize the data chunks, and save to HDF5 format.

Parameters
  • file_paths – list of file_paths.

  • process_idx – Index of current process among all process spawned for file split

  • checkpoint_args (Tuple[int, int, int]) – File index, doc start index, and hdf5 index.

  • progress_counter (Value[int]) – Shared counter tracking number of processed chunks.

  • chunk_locks – List of locks for appending to hdf5 files during shuffling

file_split_process_dataset() None[source]#

Process the dataset by splitting files across multiple processes.

reader_process(checkpoint_args: Tuple) None[source]#

Reads data from input files and distributes them to the tokenizer queues.

Parameters

checkpoint_args (Tuple[int, int, int]) – File index, doc start index, and hdf5 index.

tokenizer_process(idx: int) None[source]#

Tokenizes data and forwards the tokenized data to the writer queue.

Parameters

idx (int) – Queue ID to forward tokenized chunks of data.

writer_process(progress_counter: Value[int], num_sentinels: int, writer_idx: int, chunk_locks) None[source]#

Process that writes tokenized data to HDF5 format.

Parameters
  • progress_counter (Value[int]) – Shared counter tracking number of processed chunks.

  • num_sentinels – Number of sentinels to be received for the current writer process

  • writer_idx – The index of the current writer process

  • chunk_locks – List of locks for appending to hdf5 files during shuffling

task_split_process_dataset() None[source]#

Split the dataset processing tasks across multiple processes.

process_dataset() dict[source]#

Process the dataset either through file split or task split methods.

get_vocab_size()[source]#

Get tokenizer vocabulary size :returns: text to tokenize :rtype: vocab_size (int)