cerebras.modelzoo.data_preparation.data_preprocessing.utils#

Functions

append_eos_to_multiple_semantic_regions

check_fim_special_tokens

chunk

Since we do character-level FIM we need to detokenize, determine boundaries to split, and re-tokenize after splitting.

create_features_auto_lm_vsl

Given a list of VSL sequences, generate input features and labels.

dump_args

Write the input params to file.

dump_result

Write outputs of execution

fim

Takes in an array of input_ids, mask, and labels, and performs the FIM operation to re-arrange into PSM and SPM format with some probability

find_token_range

format_fim

Takes in list of prefix/middle/suffix token lists, along with respective FIM (or AR) formats.

get_files

Get all files of given filetypes from input directory.

get_params

Retrieve configuration parameters :returns:

get_parser

Argparser definition for command line arguments from user.

get_size

Recursively finds size of objects

get_tokenizer_vocab

handle_bos_token_default

When performing FIM, we tokenize each chunk again after splitting.

has_valid_extension

listdir_or_file

pad_helper

Helper for padding.

set_defaults

split_text_and_tokenize

Function to split the text into smaller sequences of length max_tok_len and then tokenize each of the smaller sequences.

truncate_helper

The goal of our truncation scheme is to avoid removing tokens from the middle section.

truncate_or_pad_helper

Since we perform FIM at character-level, we potentially split characters in the middle of a word.

update_params

Update config parameters with CLI arguments

validate_tokens

wikitext_detokenizer

Detokenizer for wikitext.