modelzoo.transformers.data_processing.scripts.chunk_preprocessing.chunk_data_preprocessor.ChunkDataPreprocessor#

class modelzoo.transformers.data_processing.scripts.chunk_preprocessing.chunk_data_preprocessor.ChunkDataPreprocessor[source]#

Bases: object

Initialize the class with given parameters and logger.

Parameters

params (dict) – Configuration parameters.
logger (Logger) – Logging interface.

Methods

`add_token`	Add token to the tokenizer :param token: token to be added to the tokenizer :type token: str
`calculate_total_chunks`	Calculate the total number of chunks based on the given total size and the predefined max chunk size.
`calculate_total_size`	Calculate the total size of all input files, taking compression factors into consideration.
`check_unused_params`	Check for any unused parameters and log them as warnings.
`file_split_process_dataset`	Process the dataset by splitting files across multiple processes.
`file_split_read_checkpoint`
`get_output_dir`	Retrieve the output directory path.
`get_params_file`	Retrieve the path to the JSON parameters file.
`get_vocab_size`	Get tokenizer vocabulary size :returns: text to tokenize :rtype: vocab_size (int)
`handle_metadata_files`	Handle metadata files based on provided configuration.
`initialize_gpt2tokenizer`	Initialize GPT-2 tokenizer.
`initialize_huggingfacetokenizer`	Initialize Hugging Face tokenizer.
`initialize_miscellaneous_attributes`	Initialize miscellaneous attributes.
`initialize_neoxtokenizer`	Initialize Neox tokenizer.
`initialize_tokenizer`	Initialize tokenizer based on the provided tokenizer_type parameter.
`process_dataset`	Process the dataset either through file split or task split methods.
`process_dataset_params`	Process dataset specific parameters.
`process_files`	Process the given files, tokenize the data chunks, and save to HDF5 format.
`process_params`	Process parameters by calling various initialization methods.
`process_processing_params`	Process the processing parameters and initialize relevant class attributes.
`process_setup_params`	Setup the number of processes based on provided configuration.
`reader_process`	Reads data from input files and distributes them to the tokenizer queues.
`set_shuffle_seed`	Sets shuffle seed for numpy
`setup_output_directory`	Set up the output directory based on provided configuration.
`task_split_process_dataset`	Split the dataset processing tasks across multiple processes.
`task_split_read_checkpoint`
`tokenizer_process`	Tokenizes data and forwards the tokenized data to the writer queue.
`verify_hdf5_files`
`write_remaining_prefix`
`writer_process`	Process that writes tokenized data to HDF5 format.

__init__(params, logger)[source]#

Initialize the class with given parameters and logger.

Parameters

params (dict) – Configuration parameters.
logger (Logger) – Logging interface.

add_token(token)[source]#: Add token to the tokenizer :param token: token to be added to the tokenizer :type token: str

calculate_total_chunks(total_size: int) → int[source]#

Calculate the total number of chunks based on the given total size and the predefined max chunk size.

Parameters: total_size (int) – The total size of the data in bytes.
Returns: Total number of chunks.
Return type: int

calculate_total_size() → int[source]#

Calculate the total size of all input files, taking compression factors into consideration.

Returns: The total size of all input files in bytes.
Return type: int

check_unused_params() → None[source]#: Check for any unused parameters and log them as warnings.

file_split_process_dataset() → None[source]#: Process the dataset by splitting files across multiple processes.

get_output_dir() → str[source]#

Retrieve the output directory path.

Returns: Path to the output directory.
Return type: str

get_params_file() → str[source]#

Retrieve the path to the JSON parameters file.

Returns: Path to the JSON parameters file.
Return type: str

get_vocab_size()[source]#: Get tokenizer vocabulary size :returns: text to tokenize :rtype: vocab_size (int)

handle_metadata_files() → None[source]#: Handle metadata files based on provided configuration.

initialize_gpt2tokenizer(processing_params: Dict[str, Any]) → None[source]#

Initialize GPT-2 tokenizer.

Parameters: processing_params (Dict[str, Any]) – Dictionary of processing parameters.

initialize_huggingfacetokenizer(processing_params: Dict[str, Any]) → None[source]#

Initialize Hugging Face tokenizer.

Parameters: processing_params (Dict[str, Any]) – Dictionary of processing parameters.

initialize_miscellaneous_attributes() → None[source]#: Initialize miscellaneous attributes.

initialize_neoxtokenizer(processing_params: Dict[str, Any]) → None[source]#

Initialize Neox tokenizer.

Parameters: processing_params (Dict[str, Any]) – Dictionary of processing parameters.

initialize_tokenizer(processing_params: Dict[str, Any]) → None[source]#

Initialize tokenizer based on the provided tokenizer_type parameter.

Parameters: processing_params (Dict[str, Any]) – Dictionary of processing parameters.

process_dataset() → dict[source]#: Process the dataset either through file split or task split methods.

process_dataset_params() → None[source]#: Process dataset specific parameters.

process_files(file_paths, process_idx, checkpoint_args, progress_counter) → int[source]#

Process the given files, tokenize the data chunks, and save to HDF5 format.

Parameters: - file_paths: list of file_paths. - process_idx: Index of current process among all process spawned for file split

Returns: - int: The count of processed chunks.

process_params() → None[source]#: Process parameters by calling various initialization methods.

process_processing_params() → None[source]#: Process the processing parameters and initialize relevant class attributes.

process_setup_params() → None[source]#: Setup the number of processes based on provided configuration.

reader_process(checkpoint_args: Tuple) → None[source]#

Reads data from input files and distributes them to the tokenizer queues.

Parameters: checkpoint_args (Tuple[int, int, int]) – File index, doc start index, and hdf5 index.

set_shuffle_seed()[source]#: Sets shuffle seed for numpy

setup_output_directory() → None[source]#: Set up the output directory based on provided configuration.

task_split_process_dataset() → None[source]#: Split the dataset processing tasks across multiple processes.

tokenizer_process(idx: int) → None[source]#

Tokenizes data and forwards the tokenized data to the writer queue.

Parameters

tokenizer_queue (Queue) – Queue containing chunks of data for tokenization.
idx (int) – Queue ID to forward tokenized chunks of data.

writer_process(progress_counter: Value[int]) → None[source]#

Process that writes tokenized data to HDF5 format.

Parameters

writer_queue (Queue) – Queue from which tokenized chunks of data are taken for writing.
progress_counter (Value[int]) – Shared counter tracking number of processed chunks.

modelzoo.transformers.data_processing.scripts.chunk_preprocessing.chunk_data_preprocessor.update_progress

modelzoo.transformers.data_processing.scripts.chunk_preprocessing.create_hdf5_dataset