modelzoo.transformers.data_processing.scripts.chunk_preprocessing.summarization_data_token_generator.SummarizationTokenGenerator#

class modelzoo.transformers.data_processing.scripts.chunk_preprocessing.summarization_data_token_generator.SummarizationTokenGenerator[source]#

Bases: object

Initialize the SummarizationTokenizer class.

Parameters

vocab_file (str) – Path to the vocabulary file.
encoder_file (str) – Path to the encoder file.
max_seq_length (int, optional) – Maximum sequence length. Defaults to 2048.

Methods

`encode`	Tokenize and encode the doc for text summarization.
`get_token_id`	Get the token ID for the given token.
`tokenize_text`	Tokenize the provided text.

__init__(params, tokenizer, eos_id, pad_id)[source]#

Initialize the SummarizationTokenizer class.

Parameters

vocab_file (str) – Path to the vocabulary file.
encoder_file (str) – Path to the encoder file.
max_seq_length (int, optional) – Maximum sequence length. Defaults to 2048.

encode(doc: tuple) → List[numpy.ndarray][source]#

Tokenize and encode the doc for text summarization.

get_token_id(token: str) → int[source]#

Get the token ID for the given token.

tokenize_text(text: str) → List[int][source]#

Tokenize the provided text.

modelzoo.transformers.data_processing.scripts.chunk_preprocessing.summarization_data_token_generator.create_features_summarization

modelzoo.transformers.data_processing.scripts.chunk_preprocessing.utils