cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.summarization_data_token_generator#

SummarizationTokenGenerator Module

This module provides the SummarizationTokenGenerator class which is designed to tokenize prompt/completion data and create features suitable for summarization tasks. The class utilizes the BPETokenizer from the modelzoo.transformers.data_processing.tokenizers package for tokenization.

Usage:

tokenizer = SummarizationTokenizer(dataset_params,max_sequence_length,tokenizer) tokenized_features = tokenizer.encode((“prompt_text”,”completion_text”))

Functions

create_features_summarization

Given a list of prompt_ids and completion_ids, generate input sequence and labels.

Classes

SummarizationTokenGenerator

Initialize the SummarizationTokenizer class.