cerebras.modelzoo.data_preparation.data_preprocessing.vsl_finetuning_token_generator.VSLFinetuningTokenGenerator#

class cerebras.modelzoo.data_preparation.data_preprocessing.vsl_finetuning_token_generator.VSLFinetuningTokenGenerator(params, tokenizer, eos_id, pad_id)[source]#

Bases: cerebras.modelzoo.data_preparation.data_preprocessing.finetuning_token_generator.FinetuningTokenGenerator

Token generator for variable-length sequence summarization (VSLS). Extends FinetuningTokenGenerator with additional functionality for VSLS.

Initialize VSLFinetuningTokenGenerator with dataset parameters, tokenizer, and token IDs.

Methods

`append_within_max_length`	Optimize representation of tokenized data by merging shorter sequences within the specified maximum sequence length. :param tokenized_data: List of tokenized text data where each inner list contains (prompt, completion) tuples. :type tokenized_data: List[List[tuple]].
`clean_text`	Clean the provided text.
`encode`	Tokenize and encode the document for text summarization.
`get_data_ranges`	Get data ranges for the conversation data.
`get_data_stats`	Get data statistics from the sample.
`get_tokenized_semantic_regions`
`parse_semantic_data_array`
`process_chunks`	Process chunks of tokenized text and return processed features along with the total padding added.
`tokenize_data`

Attributes

use_vsl

process_chunks(tokenized_data)[source]#

Process chunks of tokenized text and return processed features along with the total padding added.

Parameters

tokenized_data (List[List[tuple]]) – List of tokenized text chunks, where each chunk is represented as a list of (prompt, completion) tuples.

Returns

Tuple containing a list of processed results: and the total number of padding tokens added.

Return type

Tuple[List[Any], int]

encode(semantic_data_array)[source]#

Tokenize and encode the document for text summarization.

Parameters: data – Union[List[Dict], Tuple]: Contains data either as a tuple of prompt, completion or a multi turn dialogue
Returns: List of tokenized data and a stats dictionary
Return type: Tuple[List[tuple],Dict

append_within_max_length(tokenized_data)[source]#

Optimize representation of tokenized data by merging shorter sequences within the specified maximum sequence length. :param tokenized_data: List of tokenized text data

where each inner list contains (prompt, completion) tuples.

Returns: Optimized list after merging shorter sequences.
Return type: List[List[tuple]]

clean_text(data)#

Clean the provided text.

Parameters: data (str) – Text to clean.
Returns: Cleaned text.
Return type: str

get_data_ranges(semantic_regions, formatted_data)#

Get data ranges for the conversation data.

Parameters

conversation_data (List[Dict[str, str]]) – List of conversation data.
formatted_data (str) – Formatted conversation data.

Returns

Ranges for system, user, and assistant data.

Return type

Tuple[List[Tuple[int, int]], List[Tuple[int, int]], List[Tuple[int, int]]]

get_data_stats(sample)#

Get data statistics from the sample.

Parameters: sample (np.ndarray) – Tokenized sample.
Returns: Data statistics.
Return type: Dict[str, int]

cerebras.modelzoo.data_preparation.data_preprocessing.vsl_finetuning_token_generator.create_features_finetuning_vsl

cerebras.modelzoo.data_preparation.data_preprocessing.vsl_pretraining_token_generator