cerebras.modelzoo.data_preparation.nlp.tokenizers.Tokenization.BaseTokenizer#

class cerebras.modelzoo.data_preparation.nlp.tokenizers.Tokenization.BaseTokenizer[source]#

Bases: object

Class for base tokenization of a piece of text Handles grammar operations like removing strip accents, checking for chinese characters in text, handling splitting on punctuation and control characters. Also handles creating the tokenizer for converting tokens->id and id->tokens and storing vocabulary for the dataset :param str vocab_file: File containing vocabulary, each token in new line :param bool do_lower: Specifies whether to convert to lower case for data processing

Methods

tokenize

Tokenizes a piece of text.

__init__(vocab_file, do_lower_case=True)[source]#
tokenize(text)[source]#

Tokenizes a piece of text. Does not convert to ids