modelzoo.transformers.data_processing.scripts.chunk_preprocessing.chunk_data_preprocessor.ChunkDataPreprocessor#
- class modelzoo.transformers.data_processing.scripts.chunk_preprocessing.chunk_data_preprocessor.ChunkDataPreprocessor[source]#
Bases:
object
Initialize the class with given parameters and logger.
- Parameters
params (dict) – Configuration parameters.
logger (Logger) – Logging interface.
Methods
Add token to the tokenizer :param token: token to be added to the tokenizer :type token: str
Calculate the total number of chunks based on the given total size and the predefined max chunk size.
Calculate the total size of all input files, taking compression factors into consideration.
Check for any unused parameters and log them as warnings.
Process the dataset by splitting files across multiple processes.
file_split_read_checkpoint
Retrieve the output directory path.
Retrieve the path to the JSON parameters file.
Get tokenizer vocabulary size :returns: text to tokenize :rtype: vocab_size (int)
Handle metadata files based on provided configuration.
Initialize GPT-2 tokenizer.
Initialize Hugging Face tokenizer.
Initialize miscellaneous attributes.
Initialize Neox tokenizer.
Initialize tokenizer based on the provided tokenizer_type parameter.
Process the dataset either through file split or task split methods.
Process dataset specific parameters.
Process the given files, tokenize the data chunks, and save to HDF5 format.
Process parameters by calling various initialization methods.
Process the processing parameters and initialize relevant class attributes.
Setup the number of processes based on provided configuration.
Reads data from input files and distributes them to the tokenizer queues.
Sets shuffle seed for numpy
Set up the output directory based on provided configuration.
Split the dataset processing tasks across multiple processes.
task_split_read_checkpoint
Tokenizes data and forwards the tokenized data to the writer queue.
verify_hdf5_files
write_remaining_prefix
Process that writes tokenized data to HDF5 format.
- __init__(params, logger)[source]#
Initialize the class with given parameters and logger.
- Parameters
params (dict) – Configuration parameters.
logger (Logger) – Logging interface.
- add_token(token)[source]#
Add token to the tokenizer :param token: token to be added to the tokenizer :type token: str
- calculate_total_chunks(total_size: int) int [source]#
Calculate the total number of chunks based on the given total size and the predefined max chunk size.
- Parameters
total_size (int) – The total size of the data in bytes.
- Returns
Total number of chunks.
- Return type
int
- calculate_total_size() int [source]#
Calculate the total size of all input files, taking compression factors into consideration.
- Returns
The total size of all input files in bytes.
- Return type
int
- file_split_process_dataset() None [source]#
Process the dataset by splitting files across multiple processes.
- get_output_dir() str [source]#
Retrieve the output directory path.
- Returns
Path to the output directory.
- Return type
str
- get_params_file() str [source]#
Retrieve the path to the JSON parameters file.
- Returns
Path to the JSON parameters file.
- Return type
str
- get_vocab_size()[source]#
Get tokenizer vocabulary size :returns: text to tokenize :rtype: vocab_size (int)
- initialize_gpt2tokenizer(processing_params: Dict[str, Any]) None [source]#
Initialize GPT-2 tokenizer.
- Parameters
processing_params (Dict[str, Any]) – Dictionary of processing parameters.
- initialize_huggingfacetokenizer(processing_params: Dict[str, Any]) None [source]#
Initialize Hugging Face tokenizer.
- Parameters
processing_params (Dict[str, Any]) – Dictionary of processing parameters.
- initialize_neoxtokenizer(processing_params: Dict[str, Any]) None [source]#
Initialize Neox tokenizer.
- Parameters
processing_params (Dict[str, Any]) – Dictionary of processing parameters.
- initialize_tokenizer(processing_params: Dict[str, Any]) None [source]#
Initialize tokenizer based on the provided tokenizer_type parameter.
- Parameters
processing_params (Dict[str, Any]) – Dictionary of processing parameters.
- process_dataset() dict [source]#
Process the dataset either through file split or task split methods.
- process_files(file_paths, process_idx, checkpoint_args, progress_counter) int [source]#
Process the given files, tokenize the data chunks, and save to HDF5 format.
Parameters: - file_paths: list of file_paths. - process_idx: Index of current process among all process spawned for file split
Returns: - int: The count of processed chunks.
- process_processing_params() None [source]#
Process the processing parameters and initialize relevant class attributes.
- process_setup_params() None [source]#
Setup the number of processes based on provided configuration.
- reader_process(checkpoint_args: Tuple) None [source]#
Reads data from input files and distributes them to the tokenizer queues.
- Parameters
checkpoint_args (Tuple[int, int, int]) – File index, doc start index, and hdf5 index.
- setup_output_directory() None [source]#
Set up the output directory based on provided configuration.
- task_split_process_dataset() None [source]#
Split the dataset processing tasks across multiple processes.