cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.lm_data_token_generator.LMDataTokenGenerator#

class cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.lm_data_token_generator.LMDataTokenGenerator[source]#

Bases: object

Initialize the LMDataTokenGenerator class.

Parameters
  • vocab_file (str) – Path to the vocabulary file.

  • encoder_file (str) – Path to the encoder file.

  • max_seq_length (int, optional) – Maximum sequence length. Defaults to 2048.

Methods

encode

Tokenize and encode the data for auto-regressive language modeling.

encode_leftover_prefix

Processes the leftover prefix which is a list of ndarray tokens into chunks based on max sequence length.

get_token_id

Get the token ID for the given token.

process_chunks

Processes chunks of tokenized text and returns processed features along with the total padding added.

tokenize_text

Tokenize the provided text.

tokenize_text_auto_lm

Tokenize the text and create features for auto-regressive language modeling.

__init__(params, tokenizer, eos_id, pad_id)[source]#

Initialize the LMDataTokenGenerator class.

Parameters
  • vocab_file (str) – Path to the vocabulary file.

  • encoder_file (str) – Path to the encoder file.

  • max_seq_length (int, optional) – Maximum sequence length. Defaults to 2048.

tokenize_text(text: str) List[int][source]#

Tokenize the provided text.

Parameters

text (str) – Text to tokenize.

Returns

List of token IDs.

Return type

List[int]

process_chunks(tokenized_text_chunks: List[List[int]]) Tuple[List[Any], Dict][source]#

Processes chunks of tokenized text and returns processed features along with the total padding added.

Args: tokenized_text_chunks (List[List[int]]): A list of tokenized text chunks, where each chunk is represented as a list of integers.

Returns: Tuple[List[Any], Dict]: A tuple containing a list of processed results and dataset stats.

tokenize_text_auto_lm(text: str) Tuple[List[numpy.ndarray], Dict][source]#

Tokenize the text and create features for auto-regressive language modeling.

Parameters

text (str) – Text to tokenize.

Returns

Tuple of encoded features for auto-regressive language modeling and dataset stats.

Return type

Tuple[List[np.ndarray], Dict]

encode(data: str) Tuple[List[numpy.ndarray], Dict][source]#

Tokenize and encode the data for auto-regressive language modeling.

Parameters

data (str) – Text data to encode.

Returns

Tuple of encoded features for auto-regressive language modeling and dataset stats.

Return type

Tuple[List[np.ndarray], Dict]

encode_leftover_prefix(prefix: List[numpy.ndarray]) Tuple[List[numpy.ndarray], Dict][source]#

Processes the leftover prefix which is a list of ndarray tokens into chunks based on max sequence length.

The last chunk is handled specifically if it’s shorter than the max sequence length. If the last chunk has less than two tokens, it’s discarded.

Parameters

prefix (List[np.ndarray]) – The prefix list of token arrays to process.

Returns

A tuple containing the processed token chunks as a list of ndarrays and the dataset stats.

Return type

Tuple[List[np.ndarray], Dict]

get_token_id(token: str) int[source]#

Get the token ID for the given token.

Parameters

token (str) – Token for which the ID is needed.

Returns

Token ID.

Return type

int