cerebras.modelzoo.data_preparation.nlp.tokenizers.Tokenization#

Tokenization classes and functions

Classes

BaseTokenizer

Class for base tokenization of a piece of text Handles grammar operations like removing strip accents, checking for chinese characters in text, handling splitting on punctuation and control characters.

FullTokenizer

Class for full tokenization of a piece of text Calls BaseTokenizer and WordPiece tokenizer to perform basic grammar operations and wordpiece splits :param str vocab_file: File containing vocabulary, each token in new line :param bool do_lower: Specifies whether to convert to lower case for data processing

WordPieceTokenizer

Class for tokenization of a piece of text into its word pieces :param str vocab_file: File containing vocabulary, each token in new line :param str unknown_token: Token for words not in vocabulary :param int max_input_chars_per_word: Max length of word for splitting :param bool do_lower: Specifies whether to convert to lower case for data processing