cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.lm_vsl_data_token_generator.VSLLMDataTokenGenerator#

class cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.lm_vsl_data_token_generator.VSLLMDataTokenGenerator[source]#

Bases: cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.lm_data_token_generator.LMDataTokenGenerator

Processes tokenized text data, specifically for VSLLM. Extends LMDataTokenGenerator by handling text tokenization, feature creation, and optimizing representation of tokenized data for language modeling tasks.

use_vsl#

Usage of variable sequence length logic.

Type: bool

fold_long_doc#

Whether to fold long documents.

Type: bool

position_ids_dtype#

Data type for position IDs in tokenized output.

Type: str

Parameters

params (dict) – Parameters for the dataset and model.
tokenizer – Tokenizer instance for text tokenization.
eos_id (int) – End-of-sequence ID.
pad_id (int) – Padding ID.

Initialize VSLLMDataTokenGenerator with dataset parameters, tokenizer, and token IDs.

Methods

`append_within_max_length`	Optimizes representation of tokenized data by merging shorter sequences within the specified maximum sequence length.
`encode`	Tokenize and encode the data for auto-regressive language modeling.
`encode_leftover_prefix`	Processes the leftover prefix which is a list of ndarray tokens into chunks based on max sequence length.
`get_token_id`	Get the token ID for the given token.
`process_chunks`	Processes chunks of tokenized text and returns processed features along with statistics about padding and tokens.
`tokenize_text`	Tokenize the provided text.
`tokenize_text_auto_lm`	Tokenizes the given text and creates features suitable for auto-regressive language modeling.

Attributes

use_vsl

__init__(params: dict, tokenizer, eos_id: int, pad_id: int)[source]#: Initialize VSLLMDataTokenGenerator with dataset parameters, tokenizer, and token IDs.

process_chunks(tokenized_data: List[List[Any]]) → Tuple[List[Any], dict][source]#

Processes chunks of tokenized text and returns processed features along with statistics about padding and tokens.

Parameters: tokenized_data (List[List[Any]]) – Tokenized text chunks as a list.
Returns: Processed results and statistics.
Return type: Tuple[List[Any], dict]

tokenize_text_auto_lm(text: str) → Tuple[List[numpy.ndarray], dict][source]#

Tokenizes the given text and creates features suitable for auto-regressive language modeling. Handles end-of-sequence addition, sequence length adjustments, and document folding for long documents.

Parameters

text (str) – The text to tokenize.

Returns

Tokenized and processed text features: and statistics.

Return type

Tuple[List[np.ndarray], dict]

append_within_max_length(tokenized_data: List[List[List[Any]]]) → List[List[List[Any]]][source]#

Optimizes representation of tokenized data by merging shorter sequences within the specified maximum sequence length. Converts 3D list to a modified 3D structure where each innermost list is treated as a separate 2D list, then merges these 2D lists if their combined length is within the max sequence length.

Parameters: tokenized_data (List[List[List[Any]]]) – 3D list of tokenized text data.
Returns: Optimized 3D list after merging shorter sequences.
Return type: List[List[List[Any]]]

encode(data: str) → Tuple[List[numpy.ndarray], Dict]#

Tokenize and encode the data for auto-regressive language modeling.

Parameters: data (str) – Text data to encode.
Returns: Tuple of encoded features for auto-regressive language modeling and dataset stats.
Return type: Tuple[List[np.ndarray], Dict]

encode_leftover_prefix(prefix: List[numpy.ndarray]) → Tuple[List[numpy.ndarray], Dict]#

Processes the leftover prefix which is a list of ndarray tokens into chunks based on max sequence length.

The last chunk is handled specifically if it’s shorter than the max sequence length. If the last chunk has less than two tokens, it’s discarded.

Parameters: prefix (List[np.ndarray]) – The prefix list of token arrays to process.
Returns: A tuple containing the processed token chunks as a list of ndarrays and the dataset stats.
Return type: Tuple[List[np.ndarray], Dict]

get_token_id(token: str) → int#

Get the token ID for the given token.

Parameters: token (str) – Token for which the ID is needed.
Returns: Token ID.
Return type: int

tokenize_text(text: str) → List[int]#

Tokenize the provided text.

Parameters: text (str) – Text to tokenize.
Returns: List of token IDs.
Return type: List[int]

cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.lm_vsl_data_token_generator

cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator