modelzoo.transformers.data_processing.scripts.hdf5_preprocessing.utils#
Functions
For the argparse to parse arguments for subcommands, we add common command line arguments to each subcommand parser here. |
|
Collect statistics of the dataset. |
|
Given a list of token_ids, generate input sequence and labels. |
|
Given a list of prompt_ids and completion_ids, generate input sequence and labels. |
|
Write the input params to file. |
|
Write outputs of execution |
|
Get all files of given filetypes from input directory. |
|
Retrieve configuration parameters :returns: |
|
Argparser definition for command line arguments from user. |
|
Get arguments for verifying HDF5 dataset. :param params: Dictionary containing parameters for verifying HDF5 dataset. :type params: dict :param data_processor: Class containing methods that specify how the dataset will be processed and written into HDF5 files. |
|
Process a dataset and write it into HDF5 format. |
|
Checkpoint reader for execution. |
|
Update config parameters with CLI arguments |
|
This function is used to do sanity checks at the end of the creation of hdf5 files. This function loads every .h5 files generated and checks: 1. The data type 2. Shape of the dataset 3. Fact that labels and inputs are as expected. |
|
Verify the generated HDF5 dataset. |
|
Detokenizer for wikitext. |
Classes
DatasetStats(num_sequences: int, num_tokens: int, detokenized_bytes: int, detokenized_chars: int, non_pad_tokens: int, loss_valid_tokens: int) |
|
VerificationArgs(processes: int, files_per_record: int, max_seq_length: int, tokenizer_obj: object, eos_id: int, pad_id: int) |