
Tokenization classes and functions



Class for base tokenization of a piece of text Handles grammar operations like removing strip accents, checking for chinese characters in text, handling splitting on punctuation and control characters.


Class for full tokenization of a piece of text Calls BaseTokenizer and WordPiece tokenizer to perform basic grammar operations and wordpiece splits :param str vocab_file: File containing vocabulary, each token in new line :param bool do_lower: Specifies whether to convert to lower case for data processing


Class for tokenization of a piece of text into its word pieces :param str vocab_file: File containing vocabulary, each token in new line :param str unknown_token: Token for words not in vocabulary :param int max_input_chars_per_word: Max length of word for splitting :param bool do_lower: Specifies whether to convert to lower case for data processing