cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.lm_vsl_data_token_generator.VSLLMDataTokenGenerator#
- class cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.lm_vsl_data_token_generator.VSLLMDataTokenGenerator[source]#
-
Processes tokenized text data, specifically for VSLLM. Extends LMDataTokenGenerator by handling text tokenization, feature creation, and optimizing representation of tokenized data for language modeling tasks.
- use_vsl#
Usage of variable sequence length logic.
- Type
bool
- fold_long_doc#
Whether to fold long documents.
- Type
bool
- position_ids_dtype#
Data type for position IDs in tokenized output.
- Type
str
- Parameters
params (dict) – Parameters for the dataset and model.
tokenizer – Tokenizer instance for text tokenization.
eos_id (int) – End-of-sequence ID.
pad_id (int) – Padding ID.
Initialize VSLLMDataTokenGenerator with dataset parameters, tokenizer, and token IDs.
Methods
Optimizes representation of tokenized data by merging shorter sequences within the specified maximum sequence length.
Tokenize and encode the data for auto-regressive language modeling.
Processes the leftover prefix which is a list of ndarray tokens into chunks based on max sequence length.
Get the token ID for the given token.
Processes chunks of tokenized text and returns processed features along with statistics about padding and tokens.
Tokenize the provided text.
Tokenizes the given text and creates features suitable for auto-regressive language modeling.
Attributes
- __init__(params: dict, tokenizer, eos_id: int, pad_id: int)[source]#
Initialize VSLLMDataTokenGenerator with dataset parameters, tokenizer, and token IDs.
- process_chunks(tokenized_data: List[List[Any]]) Tuple[List[Any], dict] [source]#
Processes chunks of tokenized text and returns processed features along with statistics about padding and tokens.
- Parameters
tokenized_data (List[List[Any]]) – Tokenized text chunks as a list.
- Returns
Processed results and statistics.
- Return type
Tuple[List[Any], dict]
- tokenize_text_auto_lm(text: str) Tuple[List[numpy.ndarray], dict] [source]#
Tokenizes the given text and creates features suitable for auto-regressive language modeling. Handles end-of-sequence addition, sequence length adjustments, and document folding for long documents.
- Parameters
text (str) – The text to tokenize.
- Returns
- Tokenized and processed text features
and statistics.
- Return type
Tuple[List[np.ndarray], dict]
- append_within_max_length(tokenized_data: List[List[List[Any]]]) List[List[List[Any]]] [source]#
Optimizes representation of tokenized data by merging shorter sequences within the specified maximum sequence length. Converts 3D list to a modified 3D structure where each innermost list is treated as a separate 2D list, then merges these 2D lists if their combined length is within the max sequence length.
- Parameters
tokenized_data (List[List[List[Any]]]) – 3D list of tokenized text data.
- Returns
Optimized 3D list after merging shorter sequences.
- Return type
List[List[List[Any]]]
- encode(data: str) Tuple[List[numpy.ndarray], Dict] #
Tokenize and encode the data for auto-regressive language modeling.
- Parameters
data (str) – Text data to encode.
- Returns
Tuple of encoded features for auto-regressive language modeling and dataset stats.
- Return type
Tuple[List[np.ndarray], Dict]
- encode_leftover_prefix(prefix: List[numpy.ndarray]) Tuple[List[numpy.ndarray], Dict] #
Processes the leftover prefix which is a list of ndarray tokens into chunks based on max sequence length.
The last chunk is handled specifically if it’s shorter than the max sequence length. If the last chunk has less than two tokens, it’s discarded.
- Parameters
prefix (List[np.ndarray]) – The prefix list of token arrays to process.
- Returns
A tuple containing the processed token chunks as a list of ndarrays and the dataset stats.
- Return type
Tuple[List[np.ndarray], Dict]
- get_token_id(token: str) int #
Get the token ID for the given token.
- Parameters
token (str) – Token for which the ID is needed.
- Returns
Token ID.
- Return type
int
- tokenize_text(text: str) List[int] #
Tokenize the provided text.
- Parameters
text (str) – Text to tokenize.
- Returns
List of token IDs.
- Return type
List[int]