cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.summarization_vsl_data_token_generator.VSLSummarizationTokenGenerator#

class cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.summarization_vsl_data_token_generator.VSLSummarizationTokenGenerator[source]#

Bases: cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.summarization_data_token_generator.SummarizationTokenGenerator

Token generator for variable-length sequence summarization (VSLS). Extends SummarizationTokenGenerator with additional functionality for VSLS.

Initialize VSLSummarizationTokenGenerator with dataset parameters, tokenizer, and token IDs.

Methods

append_within_max_length

Optimize representation of tokenized data by merging shorter sequences within the specified maximum sequence length.

check_valid_doc

encode

Tokenize and encode the document for text summarization.

get_token_id

Get the token ID for the given token.

prepend_prefix

Prepends prefixes to prompt ids and completion ids and manages the beginning of sentence (BOS) tokens.

process_chunks

Process chunks of tokenized text and return processed features along with the total padding added.

tokenize_prefix_tags

tokenize_text

Tokenize the provided text.

Attributes

use_vsl

__init__(params, tokenizer, eos_id, pad_id)[source]#

Initialize VSLSummarizationTokenGenerator with dataset parameters, tokenizer, and token IDs.

process_chunks(tokenized_data: List[List[tuple]]) Tuple[List[Any], int][source]#

Process chunks of tokenized text and return processed features along with the total padding added.

Parameters

tokenized_data (List[List[tuple]]) – List of tokenized text chunks, where each chunk is represented as a list of (prompt, completion) tuples.

Returns

Tuple containing a list of processed results

and the total number of padding tokens added.

Return type

Tuple[List[Any], int]

encode(doc: List[tuple]) Tuple[List[tuple], Dict][source]#

Tokenize and encode the document for text summarization.

Parameters

doc (List[tuple]) – Contains a list of (prompt, completion) data to encode.

Returns

List of tokenized tuples (prompt, completion) and a stats dictionary

Return type

Tuple[List[tuple],Dict

append_within_max_length(tokenized_data: List[List[tuple]]) List[List[List[tuple]]][source]#

Optimize representation of tokenized data by merging shorter sequences within the specified maximum sequence length.

Parameters

tokenized_data (List[List[tuple(List, List)]]) – List of tokenized text data where each inner list contains (prompt, completion) tuples.

Returns

Optimized list after merging shorter sequences.

Return type

List[List[List[tuple]]]

get_token_id(token: str) int#

Get the token ID for the given token.

Parameters

token (str) – Token for which the ID is needed.

Returns

Token ID.

Return type

int

prepend_prefix(prompt_ids: List[int], completion_ids: List[int], index: int) Tuple[List[int], List[int]]#

Prepends prefixes to prompt ids and completion ids and manages the beginning of sentence (BOS) tokens.

Parameters
  • prompt_ids – A list of integer IDs representing the prompt.

  • completion_ids – A list of integer IDs representing the completion.

  • index – The index indicating the position of the sequence being processed.

return: A tuple of two lists: the updated prompt_ids and completion_ids.

tokenize_text(text: str) List[int]#

Tokenize the provided text.

Parameters

text (str) – Text to tokenize.

Returns

List of token IDs.

Return type

List[int]