modelzoo.transformers.data_processing.scripts.chunk_preprocessing.summarization_data_token_generator.SummarizationTokenGenerator#
- class modelzoo.transformers.data_processing.scripts.chunk_preprocessing.summarization_data_token_generator.SummarizationTokenGenerator[source]#
Bases:
object
Initialize the SummarizationTokenizer class.
- Parameters
vocab_file (str) – Path to the vocabulary file.
encoder_file (str) – Path to the encoder file.
max_seq_length (int, optional) – Maximum sequence length. Defaults to 2048.
Methods
Tokenize and encode the doc for text summarization.
Get the token ID for the given token.
Tokenize the provided text.
- __init__(params, tokenizer, eos_id, pad_id)[source]#
Initialize the SummarizationTokenizer class.
- Parameters
vocab_file (str) – Path to the vocabulary file.
encoder_file (str) – Path to the encoder file.
max_seq_length (int, optional) – Maximum sequence length. Defaults to 2048.
- encode(doc: tuple) List[numpy.ndarray] [source]#
Tokenize and encode the doc for text summarization.
- Parameters
doc (tuple) – Contains prompt, completion data to encode
- Returns
Encoded features for text summarization.
- Return type
List[np.ndarray]