cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.summarization_vsl_data_token_generator.VSLSummarizationTokenGenerator#
- class cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.summarization_vsl_data_token_generator.VSLSummarizationTokenGenerator[source]#
-
Token generator for variable-length sequence summarization (VSLS). Extends SummarizationTokenGenerator with additional functionality for VSLS.
Initialize VSLSummarizationTokenGenerator with dataset parameters, tokenizer, and token IDs.
Methods
Optimize representation of tokenized data by merging shorter sequences within the specified maximum sequence length.
check_valid_doc
Tokenize and encode the document for text summarization.
Get the token ID for the given token.
Prepends prefixes to prompt ids and completion ids and manages the beginning of sentence (BOS) tokens.
Process chunks of tokenized text and return processed features along with the total padding added.
tokenize_prefix_tags
Tokenize the provided text.
Attributes
use_vsl
- __init__(params, tokenizer, eos_id, pad_id)[source]#
Initialize VSLSummarizationTokenGenerator with dataset parameters, tokenizer, and token IDs.
- process_chunks(tokenized_data: List[List[tuple]]) Tuple[List[Any], int] [source]#
Process chunks of tokenized text and return processed features along with the total padding added.
- Parameters
tokenized_data (List[List[tuple]]) – List of tokenized text chunks, where each chunk is represented as a list of (prompt, completion) tuples.
- Returns
- Tuple containing a list of processed results
and the total number of padding tokens added.
- Return type
Tuple[List[Any], int]
- encode(doc: List[tuple]) Tuple[List[tuple], Dict] [source]#
Tokenize and encode the document for text summarization.
- Parameters
doc (List[tuple]) – Contains a list of (prompt, completion) data to encode.
- Returns
List of tokenized tuples (prompt, completion) and a stats dictionary
- Return type
Tuple[List[tuple],Dict
- append_within_max_length(tokenized_data: List[List[tuple]]) List[List[List[tuple]]] [source]#
Optimize representation of tokenized data by merging shorter sequences within the specified maximum sequence length.
- Parameters
tokenized_data (List[List[tuple(List, List)]]) – List of tokenized text data where each inner list contains (prompt, completion) tuples.
- Returns
Optimized list after merging shorter sequences.
- Return type
List[List[List[tuple]]]
- get_token_id(token: str) int #
Get the token ID for the given token.
- Parameters
token (str) – Token for which the ID is needed.
- Returns
Token ID.
- Return type
int
- prepend_prefix(prompt_ids: List[int], completion_ids: List[int], index: int) Tuple[List[int], List[int]] #
Prepends prefixes to prompt ids and completion ids and manages the beginning of sentence (BOS) tokens.
- Parameters
prompt_ids – A list of integer IDs representing the prompt.
completion_ids – A list of integer IDs representing the completion.
index – The index indicating the position of the sequence being processed.
return: A tuple of two lists: the updated prompt_ids and completion_ids.
- tokenize_text(text: str) List[int] #
Tokenize the provided text.
- Parameters
text (str) – Text to tokenize.
- Returns
List of token IDs.
- Return type
List[int]