cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.summarization_vsl_data_token_generator.VSLSummarizationTokenGenerator#

class cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.summarization_vsl_data_token_generator.VSLSummarizationTokenGenerator[source]#

Bases: cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.summarization_data_token_generator.SummarizationTokenGenerator

Token generator for variable-length sequence summarization (VSLS). Extends SummarizationTokenGenerator with additional functionality for VSLS.

Initialize VSLSummarizationTokenGenerator with dataset parameters, tokenizer, and token IDs.

Methods

`append_within_max_length`	Optimize representation of tokenized data by merging shorter sequences within the specified maximum sequence length.
`check_valid_doc`
`encode`	Tokenize and encode the document for text summarization.
`get_token_id`	Get the token ID for the given token.
`prepend_prefix`	Prepends prefixes to prompt ids and completion ids and manages the beginning of sentence (BOS) tokens.
`process_chunks`	Process chunks of tokenized text and return processed features along with the total padding added.
`tokenize_prefix_tags`
`tokenize_text`	Tokenize the provided text.

Attributes

use_vsl

__init__(params, tokenizer, eos_id, pad_id)[source]#: Initialize VSLSummarizationTokenGenerator with dataset parameters, tokenizer, and token IDs.

process_chunks(tokenized_data: List[List[tuple]]) → Tuple[List[Any], int][source]#

Process chunks of tokenized text and return processed features along with the total padding added.

Parameters

tokenized_data (List[List[tuple]]) – List of tokenized text chunks, where each chunk is represented as a list of (prompt, completion) tuples.

Returns

Tuple containing a list of processed results: and the total number of padding tokens added.

Return type

Tuple[List[Any], int]

encode(doc: List[tuple]) → Tuple[List[tuple], Dict][source]#

Tokenize and encode the document for text summarization.

Parameters: doc (List[tuple]) – Contains a list of (prompt, completion) data to encode.
Returns: List of tokenized tuples (prompt, completion) and a stats dictionary
Return type: Tuple[List[tuple],Dict

append_within_max_length(tokenized_data: List[List[tuple]]) → List[List[List[tuple]]][source]#

Optimize representation of tokenized data by merging shorter sequences within the specified maximum sequence length.

Parameters: tokenized_data (List[List[tuple(List, List)]]) – List of tokenized text data where each inner list contains (prompt, completion) tuples.
Returns: Optimized list after merging shorter sequences.
Return type: List[List[List[tuple]]]

get_token_id(token: str) → int#

Get the token ID for the given token.

Parameters: token (str) – Token for which the ID is needed.
Returns: Token ID.
Return type: int

prepend_prefix(prompt_ids: List[int], completion_ids: List[int], index: int) → Tuple[List[int], List[int]]#

Prepends prefixes to prompt ids and completion ids and manages the beginning of sentence (BOS) tokens.

Parameters

prompt_ids – A list of integer IDs representing the prompt.
completion_ids – A list of integer IDs representing the completion.
index – The index indicating the position of the sequence being processed.

return: A tuple of two lists: the updated prompt_ids and completion_ids.

tokenize_text(text: str) → List[int]#

Tokenize the provided text.

Parameters: text (str) – Text to tokenize.
Returns: List of token IDs.
Return type: List[int]

cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.summarization_vsl_data_token_generator.create_features_summarization_vsl

cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.utils