cerebras.modelzoo.data_preparation.nlp.tokenizers.Tokenization.FullTokenizer#

class cerebras.modelzoo.data_preparation.nlp.tokenizers.Tokenization.FullTokenizer[source]#

Bases: object

Class for full tokenization of a piece of text Calls BaseTokenizer and WordPiece tokenizer to perform basic grammar operations and wordpiece splits :param str vocab_file: File containing vocabulary, each token in new line :param bool do_lower: Specifies whether to convert to lower case for data processing

Methods

convert_ids_to_tokens

Converts a list of ids to a list of tokens We shift all inputs by 1 because of the ids->token dictionary formed by keras Tokenizer starts with index 1 instead of 0.

convert_tokens_to_ids

Converts a list of tokens to a list of ids We shift all outputs by 1 because of the dictionary formed by keras Tokenizer starts with index 1 instead of 0.

get_vocab_words

Returns a list of the words in the vocab

tokenize

Perform basic tokenization followed by wordpiece tokenization on a piece of text.

__init__(vocab_file, do_lower_case=True)[source]#
convert_tokens_to_ids(text)[source]#

Converts a list of tokens to a list of ids We shift all outputs by 1 because of the dictionary formed by keras Tokenizer starts with index 1 instead of 0.

convert_ids_to_tokens(text)[source]#

Converts a list of ids to a list of tokens We shift all inputs by 1 because of the ids->token dictionary formed by keras Tokenizer starts with index 1 instead of 0.

tokenize(text)[source]#

Perform basic tokenization followed by wordpiece tokenization on a piece of text. Does not convert to ids.

get_vocab_words()[source]#

Returns a list of the words in the vocab