cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.chunk_data_preprocessor.ChunkDataPreprocessor#
- class cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.chunk_data_preprocessor.ChunkDataPreprocessor[source]#
Bases:
object
Initialize the class with given parameters and logger.
- Parameters
params (dict) – Configuration parameters.
logger (Logger) – Logging interface.
Methods
Add token to the tokenizer :param token: token to be added to the tokenizer :type token: str
Calculate the total number of chunks based on the given total size and the predefined max chunk size.
Calculate the total size of all input files, taking compression factors into consideration.
Check for any unused parameters and log them as warnings.
Process the dataset by splitting files across multiple processes.
Retrieve the output directory path.
Retrieve the path to the JSON parameters file.
Get tokenizer vocabulary size :returns: text to tokenize :rtype: vocab_size (int)
Handle metadata files based on provided configuration.
Initialize GPT-2 tokenizer.
Initialize Hugging Face tokenizer.
Initialize miscellaneous attributes.
Initialize Neox tokenizer.
Initialize tokenizer based on the provided tokenizer_type parameter.
Process the dataset either through file split or task split methods.
Process dataset specific parameters.
Process the given files, tokenize the data chunks, and save to HDF5 format.
Process parameters by calling various initialization methods.
Process the processing parameters and initialize relevant class attributes.
Setup the number of processes based on provided configuration.
This function reads the checkpoint args from the created checkpoint file.
Reads data from input files and distributes them to the tokenizer queues.
Set up the output directory based on provided configuration.
This function performs the second pass of shuffling.
This function divides the output hdf5 files into different processes and prepares them for the second pass of shuffling.
This function collates the stats obtained from the different writer processes into a combined final stats :param num_writer_processes: Number of writer processes
Split the dataset processing tasks across multiple processes.
Tokenizes data and forwards the tokenized data to the writer queue.
This function verifies the whether the hdf5 files have been written correctly.
This function writes the prefix remaining after processing LMData when pack_sequences is set to true.
Process that writes tokenized data to HDF5 format.
- __init__(params, logger)[source]#
Initialize the class with given parameters and logger.
- Parameters
params (dict) – Configuration parameters.
logger (Logger) – Logging interface.
- setup_output_directory() None [source]#
Set up the output directory based on provided configuration.
- process_setup_params() None [source]#
Setup the number of processes based on provided configuration.
- process_processing_params() None [source]#
Process the processing parameters and initialize relevant class attributes.
- add_token(token)[source]#
Add token to the tokenizer :param token: token to be added to the tokenizer :type token: str
- initialize_tokenizer(processing_params: Dict[str, Any]) None [source]#
Initialize tokenizer based on the provided tokenizer_type parameter.
- Parameters
processing_params (Dict[str, Any]) – Dictionary of processing parameters.
- initialize_gpt2tokenizer(processing_params: Dict[str, Any]) None [source]#
Initialize GPT-2 tokenizer.
- Parameters
processing_params (Dict[str, Any]) – Dictionary of processing parameters.
- initialize_neoxtokenizer(processing_params: Dict[str, Any]) None [source]#
Initialize Neox tokenizer.
- Parameters
processing_params (Dict[str, Any]) – Dictionary of processing parameters.
- initialize_huggingfacetokenizer(processing_params: Dict[str, Any]) None [source]#
Initialize Hugging Face tokenizer.
- Parameters
processing_params (Dict[str, Any]) – Dictionary of processing parameters.
- get_params_file() str [source]#
Retrieve the path to the JSON parameters file.
- Returns
Path to the JSON parameters file.
- Return type
str
- get_output_dir() str [source]#
Retrieve the output directory path.
- Returns
Path to the output directory.
- Return type
str
- calculate_total_size() int [source]#
Calculate the total size of all input files, taking compression factors into consideration.
- Returns
The total size of all input files in bytes.
- Return type
int
- calculate_total_chunks(total_size: int) int [source]#
Calculate the total number of chunks based on the given total size and the predefined max chunk size.
- Parameters
total_size (int) – The total size of the data in bytes.
- Returns
Total number of chunks.
- Return type
int
- read_checkpoint(num_writers) List[Tuple[int, int, int]] [source]#
This function reads the checkpoint args from the created checkpoint file. :param num_writers: The number of writer processes
- verify_hdf5_files(chunk_data) None [source]#
This function verifies the whether the hdf5 files have been written correctly. :param chunk_data: The df chunk which needs to be written to a hdf5 file
- write_remaining_prefix(chunk_locks, pid) Tuple[int, Dict] [source]#
This function writes the prefix remaining after processing LMData when pack_sequences is set to true. :param chunk_locks: List of locks for appending to hdf5 files during shuffling :param pid: Process id of the current process
- shuffle_second_pass(file_list, progress_counter, pid) None [source]#
This function performs the second pass of shuffling. :param file_list: List of hdf5 file paths to shuffle :param progress_counter: A shared counter to track progress across processes.
- split_shuffle_second_pass()[source]#
This function divides the output hdf5 files into different processes and prepares them for the second pass of shuffling.
- stats_collation(num_writer_processes) None [source]#
This function collates the stats obtained from the different writer processes into a combined final stats :param num_writer_processes: Number of writer processes
- process_files(file_paths, process_idx, checkpoint_args, progress_counter, chunk_locks) None [source]#
Process the given files, tokenize the data chunks, and save to HDF5 format.
- Parameters
file_paths – list of file_paths.
process_idx – Index of current process among all process spawned for file split
checkpoint_args (Tuple[int, int, int]) – File index, doc start index, and hdf5 index.
progress_counter (Value[int]) – Shared counter tracking number of processed chunks.
chunk_locks – List of locks for appending to hdf5 files during shuffling
- file_split_process_dataset() None [source]#
Process the dataset by splitting files across multiple processes.
- reader_process(checkpoint_args: Tuple) None [source]#
Reads data from input files and distributes them to the tokenizer queues.
- Parameters
checkpoint_args (Tuple[int, int, int]) – File index, doc start index, and hdf5 index.
- tokenizer_process(idx: int) None [source]#
Tokenizes data and forwards the tokenized data to the writer queue.
- Parameters
idx (int) – Queue ID to forward tokenized chunks of data.
- writer_process(progress_counter: Value[int], num_sentinels: int, writer_idx: int, chunk_locks) None [source]#
Process that writes tokenized data to HDF5 format.
- Parameters
progress_counter (Value[int]) – Shared counter tracking number of processed chunks.
num_sentinels – Number of sentinels to be received for the current writer process
writer_idx – The index of the current writer process
chunk_locks – List of locks for appending to hdf5 files during shuffling
- task_split_process_dataset() None [source]#
Split the dataset processing tasks across multiple processes.