cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.summarization_data_token_generator.SummarizationTokenGenerator#
- class cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.summarization_data_token_generator.SummarizationTokenGenerator[source]#
Bases:
object
Initialize the SummarizationTokenizer class.
- Parameters
vocab_file (str) – Path to the vocabulary file.
encoder_file (str) – Path to the encoder file.
max_seq_length (int, optional) – Maximum sequence length. Defaults to 2048.
Methods
check_valid_doc
Tokenize and encode the doc for text summarization.
Get the token ID for the given token.
Prepends prefixes to prompt ids and completion ids and manages the beginning of sentence (BOS) tokens.
tokenize_prefix_tags
Tokenize the provided text.
- __init__(params, tokenizer, eos_id, pad_id)[source]#
Initialize the SummarizationTokenizer class.
- Parameters
vocab_file (str) – Path to the vocabulary file.
encoder_file (str) – Path to the encoder file.
max_seq_length (int, optional) – Maximum sequence length. Defaults to 2048.
- prepend_prefix(prompt_ids: List[int], completion_ids: List[int], index: int) Tuple[List[int], List[int]] [source]#
Prepends prefixes to prompt ids and completion ids and manages the beginning of sentence (BOS) tokens.
- Parameters
prompt_ids – A list of integer IDs representing the prompt.
completion_ids – A list of integer IDs representing the completion.
index – The index indicating the position of the sequence being processed.
return: A tuple of two lists: the updated prompt_ids and completion_ids.
- tokenize_text(text: str) List[int] [source]#
Tokenize the provided text.
- Parameters
text (str) – Text to tokenize.
- Returns
List of token IDs.
- Return type
List[int]
- encode(doc: List[Tuple]) Tuple[List[numpy.ndarray], Dict] [source]#
Tokenize and encode the doc for text summarization.
- Parameters
List[tuple] – Contains a list of prompt, completion data to encode
- Returns
Tuple of encoded features for text summarization and dataset stats
- Return type
-> Tuple[List[np.ndarray], Dict]