modelzoo.transformers.data_processing.scripts.chunk_preprocessing.summarization_data_token_generator.SummarizationTokenGenerator#

class modelzoo.transformers.data_processing.scripts.chunk_preprocessing.summarization_data_token_generator.SummarizationTokenGenerator[source]#

Bases: object

Initialize the SummarizationTokenizer class.

Parameters
  • vocab_file (str) – Path to the vocabulary file.

  • encoder_file (str) – Path to the encoder file.

  • max_seq_length (int, optional) – Maximum sequence length. Defaults to 2048.

Methods

encode

Tokenize and encode the doc for text summarization.

get_token_id

Get the token ID for the given token.

tokenize_text

Tokenize the provided text.

__init__(params, tokenizer, eos_id, pad_id)[source]#

Initialize the SummarizationTokenizer class.

Parameters
  • vocab_file (str) – Path to the vocabulary file.

  • encoder_file (str) – Path to the encoder file.

  • max_seq_length (int, optional) – Maximum sequence length. Defaults to 2048.

encode(doc: tuple) List[numpy.ndarray][source]#

Tokenize and encode the doc for text summarization.

Parameters

doc (tuple) – Contains prompt, completion data to encode

Returns

Encoded features for text summarization.

Return type

List[np.ndarray]

get_token_id(token: str) int[source]#

Get the token ID for the given token.

Parameters

token (str) – Token for which the ID is needed.

Returns

Token ID.

Return type

int

tokenize_text(text: str) List[int][source]#

Tokenize the provided text.

Parameters

text (str) – Text to tokenize.

Returns

List of token IDs.

Return type

List[int]