cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.summarization_data_token_generator.SummarizationTokenGenerator#

class cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.summarization_data_token_generator.SummarizationTokenGenerator[source]#

Bases: object

Initialize the SummarizationTokenizer class.

Parameters
  • vocab_file (str) – Path to the vocabulary file.

  • encoder_file (str) – Path to the encoder file.

  • max_seq_length (int, optional) – Maximum sequence length. Defaults to 2048.

Methods

check_valid_doc

encode

Tokenize and encode the doc for text summarization.

get_token_id

Get the token ID for the given token.

prepend_prefix

Prepends prefixes to prompt ids and completion ids and manages the beginning of sentence (BOS) tokens.

tokenize_prefix_tags

tokenize_text

Tokenize the provided text.

__init__(params, tokenizer, eos_id, pad_id)[source]#

Initialize the SummarizationTokenizer class.

Parameters
  • vocab_file (str) – Path to the vocabulary file.

  • encoder_file (str) – Path to the encoder file.

  • max_seq_length (int, optional) – Maximum sequence length. Defaults to 2048.

prepend_prefix(prompt_ids: List[int], completion_ids: List[int], index: int) Tuple[List[int], List[int]][source]#

Prepends prefixes to prompt ids and completion ids and manages the beginning of sentence (BOS) tokens.

Parameters
  • prompt_ids – A list of integer IDs representing the prompt.

  • completion_ids – A list of integer IDs representing the completion.

  • index – The index indicating the position of the sequence being processed.

return: A tuple of two lists: the updated prompt_ids and completion_ids.

tokenize_text(text: str) List[int][source]#

Tokenize the provided text.

Parameters

text (str) – Text to tokenize.

Returns

List of token IDs.

Return type

List[int]

encode(doc: List[Tuple]) Tuple[List[numpy.ndarray], Dict][source]#

Tokenize and encode the doc for text summarization.

Parameters

List[tuple] – Contains a list of prompt, completion data to encode

Returns

Tuple of encoded features for text summarization and dataset stats

Return type

-> Tuple[List[np.ndarray], Dict]

get_token_id(token: str) int[source]#

Get the token ID for the given token.

Parameters

token (str) – Token for which the ID is needed.

Returns

Token ID.

Return type

int