cerebras.modelzoo.data_preparation.data_preprocessing.vsl_finetuning_token_generator.VSLFinetuningTokenGenerator#
- class cerebras.modelzoo.data_preparation.data_preprocessing.vsl_finetuning_token_generator.VSLFinetuningTokenGenerator(params, tokenizer, eos_id, pad_id)[source]#
-
Token generator for variable-length sequence summarization (VSLS). Extends FinetuningTokenGenerator with additional functionality for VSLS.
Initialize VSLFinetuningTokenGenerator with dataset parameters, tokenizer, and token IDs.
Methods
Optimize representation of tokenized data by merging shorter sequences within the specified maximum sequence length. :param tokenized_data: List of tokenized text data where each inner list contains (prompt, completion) tuples. :type tokenized_data: List[List[tuple]].
Clean the provided text.
Tokenize and encode the document for text summarization.
Get data ranges for the conversation data.
Get data statistics from the sample.
get_tokenized_semantic_regions
parse_semantic_data_array
Process chunks of tokenized text and return processed features along with the total padding added.
tokenize_data
Attributes
use_vsl
- process_chunks(tokenized_data)[source]#
Process chunks of tokenized text and return processed features along with the total padding added.
- Parameters
tokenized_data (List[List[tuple]]) – List of tokenized text chunks, where each chunk is represented as a list of (prompt, completion) tuples.
- Returns
- Tuple containing a list of processed results
and the total number of padding tokens added.
- Return type
Tuple[List[Any], int]
- encode(semantic_data_array)[source]#
Tokenize and encode the document for text summarization.
- Parameters
data – Union[List[Dict], Tuple]: Contains data either as a tuple of prompt, completion or a multi turn dialogue
- Returns
List of tokenized data and a stats dictionary
- Return type
Tuple[List[tuple],Dict
- append_within_max_length(tokenized_data)[source]#
Optimize representation of tokenized data by merging shorter sequences within the specified maximum sequence length. :param tokenized_data: List of tokenized text data
where each inner list contains (prompt, completion) tuples.
- Returns
Optimized list after merging shorter sequences.
- Return type
List[List[tuple]]
- clean_text(data)#
Clean the provided text.
- Parameters
data (str) – Text to clean.
- Returns
Cleaned text.
- Return type
str
- get_data_ranges(semantic_regions, formatted_data)#
Get data ranges for the conversation data.
- Parameters
conversation_data (List[Dict[str, str]]) – List of conversation data.
formatted_data (str) – Formatted conversation data.
- Returns
Ranges for system, user, and assistant data.
- Return type
Tuple[List[Tuple[int, int]], List[Tuple[int, int]], List[Tuple[int, int]]]
- get_data_stats(sample)#
Get data statistics from the sample.
- Parameters
sample (np.ndarray) – Tokenized sample.
- Returns
Data statistics.
- Return type
Dict[str, int]