cerebras.modelzoo.data_preparation.data_preprocessing.custom_tokenizer_example.CustomLlama3Tokenizer.CustomLlama3Tokenizer#

class cerebras.modelzoo.data_preparation.data_preprocessing.custom_tokenizer_example.CustomLlama3Tokenizer.CustomLlama3Tokenizer(pretrained_model_name_or_path, eos_token_id=None, pad_token_id=None, **kwargs)[source]#

Bases: object

Custom implementation of Llama3 Tokenizer, which overrides compute_offsets of the HuggingFace (which is buggy - https://github.com/huggingface/tokenizers/issues/1553).

Parameters
  • pretrained_model_name_or_path (str) – The pretrained model name or path.

  • eos_token_id (Union[int, None], optional) – The ID of the end-of-sequence token. Defaults to None.

  • pad_token_id (Union[int, None], optional) – The ID of the padding token. Defaults to None.

  • **kwargs (Any) – Additional keyword arguments to be passed to AutoTokenizer.

tokenizer#

The AutoTokenizer instance for the given pretrained model.

Type

AutoTokenizer

eos_token_id#

The ID of the end-of-sequence token.

Type

int

pad_token_id#

The ID of the padding token.

Type

int

Methods

compute_offsets

Compute offsets for the given encoded input.

compute_offsets(encoded, return_offsets_mapping=False)[source]#

Compute offsets for the given encoded input.

Parameters
  • encoded (Dict[str, Any]) – The encoded input containing ‘input_ids’ and ‘offset_mapping’.

  • return_offsets_mapping (bool, optional) – Whether to return the offsets mapping. Defaults to False.

Returns

A list of tuples representing the start and end offsets for each token.

Return type

List[Tuple[int, int]]

__call__(text, **kwargs)[source]#

Encode the given text into tokens and optionally return the offsets mapping.

Parameters
  • text (str) – The input text to tokenize.

  • **kwargs (Any) – Additional keyword arguments for tokenization.

Returns

The encoded result containing ‘input_ids’, ‘attention_mask’, and optionally ‘offset_mapping’.

Return type

Dict[str, Any]