cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.lm_vsl_data_token_generator.VSLLMDataTokenGenerator#

class cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.lm_vsl_data_token_generator.VSLLMDataTokenGenerator[source]#

Bases: cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.lm_data_token_generator.LMDataTokenGenerator

Processes tokenized text data, specifically for VSLLM. Extends LMDataTokenGenerator by handling text tokenization, feature creation, and optimizing representation of tokenized data for language modeling tasks.

use_vsl#

Usage of variable sequence length logic.

Type

bool

fold_long_doc#

Whether to fold long documents.

Type

bool

position_ids_dtype#

Data type for position IDs in tokenized output.

Type

str

Parameters
  • params (dict) – Parameters for the dataset and model.

  • tokenizer – Tokenizer instance for text tokenization.

  • eos_id (int) – End-of-sequence ID.

  • pad_id (int) – Padding ID.

Initialize VSLLMDataTokenGenerator with dataset parameters, tokenizer, and token IDs.

Methods

append_within_max_length

Optimizes representation of tokenized data by merging shorter sequences within the specified maximum sequence length.

encode

Tokenize and encode the data for auto-regressive language modeling.

encode_leftover_prefix

Processes the leftover prefix which is a list of ndarray tokens into chunks based on max sequence length.

get_token_id

Get the token ID for the given token.

process_chunks

Processes chunks of tokenized text and returns processed features along with statistics about padding and tokens.

tokenize_text

Tokenize the provided text.

tokenize_text_auto_lm

Tokenizes the given text and creates features suitable for auto-regressive language modeling.

Attributes

use_vsl

__init__(params: dict, tokenizer, eos_id: int, pad_id: int)[source]#

Initialize VSLLMDataTokenGenerator with dataset parameters, tokenizer, and token IDs.

process_chunks(tokenized_data: List[List[Any]]) Tuple[List[Any], dict][source]#

Processes chunks of tokenized text and returns processed features along with statistics about padding and tokens.

Parameters

tokenized_data (List[List[Any]]) – Tokenized text chunks as a list.

Returns

Processed results and statistics.

Return type

Tuple[List[Any], dict]

tokenize_text_auto_lm(text: str) Tuple[List[numpy.ndarray], dict][source]#

Tokenizes the given text and creates features suitable for auto-regressive language modeling. Handles end-of-sequence addition, sequence length adjustments, and document folding for long documents.

Parameters

text (str) – The text to tokenize.

Returns

Tokenized and processed text features

and statistics.

Return type

Tuple[List[np.ndarray], dict]

append_within_max_length(tokenized_data: List[List[List[Any]]]) List[List[List[Any]]][source]#

Optimizes representation of tokenized data by merging shorter sequences within the specified maximum sequence length. Converts 3D list to a modified 3D structure where each innermost list is treated as a separate 2D list, then merges these 2D lists if their combined length is within the max sequence length.

Parameters

tokenized_data (List[List[List[Any]]]) – 3D list of tokenized text data.

Returns

Optimized 3D list after merging shorter sequences.

Return type

List[List[List[Any]]]

encode(data: str) Tuple[List[numpy.ndarray], Dict]#

Tokenize and encode the data for auto-regressive language modeling.

Parameters

data (str) – Text data to encode.

Returns

Tuple of encoded features for auto-regressive language modeling and dataset stats.

Return type

Tuple[List[np.ndarray], Dict]

encode_leftover_prefix(prefix: List[numpy.ndarray]) Tuple[List[numpy.ndarray], Dict]#

Processes the leftover prefix which is a list of ndarray tokens into chunks based on max sequence length.

The last chunk is handled specifically if it’s shorter than the max sequence length. If the last chunk has less than two tokens, it’s discarded.

Parameters

prefix (List[np.ndarray]) – The prefix list of token arrays to process.

Returns

A tuple containing the processed token chunks as a list of ndarrays and the dataset stats.

Return type

Tuple[List[np.ndarray], Dict]

get_token_id(token: str) int#

Get the token ID for the given token.

Parameters

token (str) – Token for which the ID is needed.

Returns

Token ID.

Return type

int

tokenize_text(text: str) List[int]#

Tokenize the provided text.

Parameters

text (str) – Text to tokenize.

Returns

List of token IDs.

Return type

List[int]