modelzoo.transformers.data_processing.scripts.hdf5_preprocessing.utils#

Functions

add_common_args

For the argparse to parse arguments for subcommands, we add common command line arguments to each subcommand parser here.

add_lm_args

The language-modeling format is common enough (FIM is very similar) that we can re-use the arguments for it

check_fim_special_tokens

chunk

Since we do character-level FIM we need to detokenize, determine boundaries to split, and re-tokenize after splitting.

collect_stats

Collect statistics of the dataset.

create_features_auto_lm

Given a list of token_ids, generate input sequence and labels.

create_features_auto_lm_vsl

Given a list of VSL sequences, generate input features and labels.

create_features_summarization

Given a list of prompt_ids and completion_ids, generate input sequence and labels.

create_features_summarization_vsl

Given a list of VSL sequences, generate input features and labels.

dump_args

Write the input params to file.

dump_result

Write outputs of execution

fim

Takes in an array of input_ids, mask, and labels, and performs the FIM operation to re-arrange into PSM and SPM format with some probability

format_fim

Takes in list of prefix/middle/suffix token lists, along with respective FIM (or AR) formats.

get_files

Get all files of given filetypes from input directory.

get_params

Retrieve configuration parameters :returns:

get_parser

Argparser definition for command line arguments from user.

get_tokenizer_vocab

get_verification_args

Get arguments for verifying HDF5 dataset. :param params: Dictionary containing parameters for verifying HDF5 dataset. :type params: dict :param data_processor: Class containing methods that specify how the dataset will be processed and written into HDF5 files.

handle_bos_token_default

When performing FIM, we tokenize each chunk again after splitting.

handle_jsonl

has_valid_extension

listdir_or_file

pad_helper

Helper for padding.

process_dataset

Process a dataset and write it into HDF5 format.

read_checkpoint

Checkpoint reader for execution.

set_defaults

split_text_and_tokenize

Function to split the text into smaller sequences of length max_tok_len and then tokenize each of the smaller sequences.

truncate_helper

The goal of our truncation scheme is to avoid removing tokens from the middle section.

truncate_or_pad_helper

Since we perform FIM at character-level, we potentially split characters in the middle of a word.

update_params

Update config parameters with CLI arguments

validate_tokens

verify_saved_hdf5_files

This function is used to do sanity checks at the end of the creation of hdf5 files. This function loads every .h5 files generated and checks: 1. The data type 2. Shape of the dataset 3. Fact that labels and inputs are as expected.

verify_saved_hdf5_files_mp

Verify the generated HDF5 dataset.

wikitext_detokenizer

Detokenizer for wikitext.

Classes

DatasetStats

DatasetStats(num_sequences: int, num_tokens: int, detokenized_bytes: int, detokenized_chars: int, non_pad_tokens: int, loss_valid_tokens: int)

Reader

VerificationArgs

VerificationArgs(processes: int, files_per_record: int, max_seq_length: int, tokenizer_obj: object, eos_id: int, pad_id: int, use_vsl: bool)