cerebras.modelzoo.data_preparation.data_preprocessing.multimodal_pretraining_token_generator.MultiModalPretrainingTokenGenerator#
- class cerebras.modelzoo.data_preparation.data_preprocessing.multimodal_pretraining_token_generator.MultiModalPretrainingTokenGenerator(params, tokenizer, eos_id, pad_id)[source]#
-
Methods
chop_doc_into_msl
Clean the provided text.
Tokenize and encode the doc for text summarization.
Processes the leftover prefix which is a list of ndarray tokens into chunks based on max sequence length.
Generate a list of token IDs that can be masked.
Get data ranges for the conversation data.
Get data statistics from the sample.
Get segment indices for the data ranges.
Masks tokens in a single sequence according to the MLM strategy.
parse_semantic_data_array
Processes chunks of tokenized text and returns processed features along with the total padding added.
Processes chunks of tokenized text and returns processed features along with the total padding added.
process_docs
tokenize_data
- get_data_ranges(semantic_regions, formatted_data)[source]#
Get data ranges for the conversation data.
- Parameters
conversation_data (List[Dict[str, str]]) – List of conversation data.
formatted_data (str) – Formatted conversation data.
- Returns
Ranges for system, user, and assistant data.
- Return type
Tuple[List[Tuple[int, int]], List[Tuple[int, int]], List[Tuple[int, int]]]
- get_segment_indices(formatted_data, tokenized_data, image_region_list)[source]#
Get segment indices for the data ranges.
- Parameters
data_ranges (Tuple[List[Tuple[int, int]], List[Tuple[int, int]], List[Tuple[int, int]]]) – Data ranges for system, user, and assistant.
offset_mapping (List[Tuple[int, int]]) – Offset mapping of the tokenized data.
- clean_text(data)#
Clean the provided text.
- Parameters
data (str) – Text to clean.
- Returns
Cleaned text.
- Return type
str
- encode(semantic_data_array)[source]#
Tokenize and encode the doc for text summarization.
- Parameters
data (Dict) – Contains a semantic data dict returned from a format hook
- Returns
Tuple of encoded features for text summarization and dataset stats
- Return type
-> Tuple[List[np.ndarray], Dict]
- encode_leftover_prefix(prefix)#
Processes the leftover prefix which is a list of ndarray tokens into chunks based on max sequence length.
The last chunk is handled specifically if it’s shorter than the max sequence length. If the last chunk has less than two tokens, it’s discarded.
- Parameters
prefix (List[np.ndarray]) – The prefix list of token arrays to process.
- Returns
A tuple containing the processed token chunks as a list of ndarrays and the dataset stats.
- Return type
Tuple[Dict[str, Any], Dict[str, int]]
- get_allowable_token_ids()#
Generate a list of token IDs that can be masked.
- get_data_stats(sample, lvt=None)#
Get data statistics from the sample.
- Parameters
sample (np.ndarray) – Tokenized sample.
- Returns
Data statistics.
- Return type
Dict[str, int]
- mask_single_sequence(input_ids)#
Masks tokens in a single sequence according to the MLM strategy. When self.mlm_with_gather is False, the returning len(labels) == len(input_ids) When self.mlm_with_gather is True, the returning len(labels) == self.max_predictions
- Parameters
input_ids (List[int]) – Original sequence of token IDs.
- Returns
input_ids: Modified sequence with masked tokens.
masked_lm_positions: Positions of the masked tokens, empty if not self.mlm_with_gather.
masked_lm_mask: Binary indicators (1s) for positions that were masked, empty if not self.mlm_with_gather.
labels: Original token IDs of the masked tokens for label purposes.
- Return type
Tuple[List[int], List[int], List[int], List[int]]
- process_chunks(tokenized_text_chunks)#
Processes chunks of tokenized text and returns processed features along with the total padding added.
- Parameters
tokenized_text_chunks (List[List[int]]) – A list of tokenized text chunks, where each chunk is represented as a list of integers.
- Returns
A tuple containing a list of processed results and dataset stats.
- Return type
Tuple[List[np.ndarray], Dict[str, int]]
- process_chunks_mlm(tokenized_text_chunks)#
Processes chunks of tokenized text and returns processed features along with the total padding added.
Args: tokenized_text_chunks (List[List[int]]): A list of tokenized text chunks, where each chunk is represented as a list of integers.
Returns: Tuple[List[Any], Dict]: A tuple containing a list of processed results and dataset stats.