modelzoo.transformers.data_processing.tokenizers.Tokenization.WordPieceTokenizer#
- class modelzoo.transformers.data_processing.tokenizers.Tokenization.WordPieceTokenizer[source]#
Bases:
modelzoo.transformers.data_processing.tokenizers.Tokenization.BaseTokenizer
Class for tokenization of a piece of text into its word pieces :param str vocab_file: File containing vocabulary, each token in new line :param str unknown_token: Token for words not in vocabulary :param int max_input_chars_per_word: Max length of word for splitting :param bool do_lower: Specifies whether to convert to lower case for data processing
Methods
Tokenize a piece of text into its word pieces This uses a greedy longest-match-first algorithm to perfom tokenization using the given vocabulary. For example: input = "unaffable" output = ["un", "##aff", "##able"].