cerebras.modelzoo.data_preparation.nlp.tokenizers.BPETokenizer#

Byte pair encoding/decoding utilities

Modified from the GPT-2 codebase: https://github.com/openai/gpt-2

Functions

bytes_to_unicode

Returns list of utf-8 byte and a corresponding list of unicode strings.

get_pairs

Return set of symbol pairs in a word.

Classes

BPETokenizer