cerebras.modelzoo.data_preparation.nlp.tokenizers.HFTokenizer.HFTokenizer#

class cerebras.modelzoo.data_preparation.nlp.tokenizers.HFTokenizer.HFTokenizer(vocab_file, special_tokens=None)[source]#

Bases: object

Designed to integrate the HF’s Tokenizer library :param vocab_file: A vocabulary file to create the tokenizer from. :type vocab_file: str :param special_tokens: A list or a string representing the special

tokens that are to be added to the tokenizer.

Methods

`add_special_tokens`
`add_token`
`decode`
`encode`
`get_token`
`get_token_from_tokenizer_config`	This api is designed to extract token information from the tokenizer config json file.
`get_token_id`
`set_eos_pad_tokens`

Attributes

`eos`
`pad`

get_token_from_tokenizer_config(json_data, token)[source]#: This api is designed to extract token information from the tokenizer config json file. We assume the token data to be in 2 formats either as a string or a dictionary.

cerebras.modelzoo.data_preparation.nlp.tokenizers.HFTokenizer

cerebras.modelzoo.data_preparation.nlp.tokenizers.Tokenization