cerebras.modelzoo.data_preparation.data_preprocessing.multimodal_pretraining_token_generator.MultiModalPretrainingTokenGenerator#

class cerebras.modelzoo.data_preparation.data_preprocessing.multimodal_pretraining_token_generator.MultiModalPretrainingTokenGenerator(params, tokenizer, eos_id, pad_id)[source]#

Bases: cerebras.modelzoo.data_preparation.data_preprocessing.pretraining_token_generator.PretrainingTokenGenerator

Methods

chop_doc_into_msl

clean_text

Clean the provided text.

encode

Tokenize and encode the doc for text summarization.

encode_leftover_prefix

Processes the leftover prefix which is a list of ndarray tokens into chunks based on max sequence length.

get_allowable_token_ids

Generate a list of token IDs that can be masked.

get_data_ranges

Get data ranges for the conversation data.

get_data_stats

Get data statistics from the sample.

get_segment_indices

Get segment indices for the data ranges.

mask_single_sequence

Masks tokens in a single sequence according to the MLM strategy.

parse_semantic_data_array

process_chunks

Processes chunks of tokenized text and returns processed features along with the total padding added.

process_chunks_mlm

Processes chunks of tokenized text and returns processed features along with the total padding added.

process_docs

tokenize_data

get_data_ranges(semantic_regions, formatted_data)[source]#

Get data ranges for the conversation data.

Parameters
  • conversation_data (List[Dict[str, str]]) – List of conversation data.

  • formatted_data (str) – Formatted conversation data.

Returns

Ranges for system, user, and assistant data.

Return type

Tuple[List[Tuple[int, int]], List[Tuple[int, int]], List[Tuple[int, int]]]

get_segment_indices(formatted_data, tokenized_data, image_region_list)[source]#

Get segment indices for the data ranges.

Parameters
  • data_ranges (Tuple[List[Tuple[int, int]], List[Tuple[int, int]], List[Tuple[int, int]]]) – Data ranges for system, user, and assistant.

  • offset_mapping (List[Tuple[int, int]]) – Offset mapping of the tokenized data.

clean_text(data)#

Clean the provided text.

Parameters

data (str) – Text to clean.

Returns

Cleaned text.

Return type

str

encode(semantic_data_array)[source]#

Tokenize and encode the doc for text summarization.

Parameters

data (Dict) – Contains a semantic data dict returned from a format hook

Returns

Tuple of encoded features for text summarization and dataset stats

Return type

-> Tuple[List[np.ndarray], Dict]

encode_leftover_prefix(prefix)#

Processes the leftover prefix which is a list of ndarray tokens into chunks based on max sequence length.

The last chunk is handled specifically if it’s shorter than the max sequence length. If the last chunk has less than two tokens, it’s discarded.

Parameters

prefix (List[np.ndarray]) – The prefix list of token arrays to process.

Returns

A tuple containing the processed token chunks as a list of ndarrays and the dataset stats.

Return type

Tuple[Dict[str, Any], Dict[str, int]]

get_allowable_token_ids()#

Generate a list of token IDs that can be masked.

get_data_stats(sample, lvt=None)#

Get data statistics from the sample.

Parameters

sample (np.ndarray) – Tokenized sample.

Returns

Data statistics.

Return type

Dict[str, int]

mask_single_sequence(input_ids)#

Masks tokens in a single sequence according to the MLM strategy. When self.mlm_with_gather is False, the returning len(labels) == len(input_ids) When self.mlm_with_gather is True, the returning len(labels) == self.max_predictions

Parameters

input_ids (List[int]) – Original sequence of token IDs.

Returns

  • input_ids: Modified sequence with masked tokens.

  • masked_lm_positions: Positions of the masked tokens, empty if not self.mlm_with_gather.

  • masked_lm_mask: Binary indicators (1s) for positions that were masked, empty if not self.mlm_with_gather.

  • labels: Original token IDs of the masked tokens for label purposes.

Return type

Tuple[List[int], List[int], List[int], List[int]]

process_chunks(tokenized_text_chunks)#

Processes chunks of tokenized text and returns processed features along with the total padding added.

Parameters

tokenized_text_chunks (List[List[int]]) – A list of tokenized text chunks, where each chunk is represented as a list of integers.

Returns

A tuple containing a list of processed results and dataset stats.

Return type

Tuple[List[np.ndarray], Dict[str, int]]

process_chunks_mlm(tokenized_text_chunks)#

Processes chunks of tokenized text and returns processed features along with the total padding added.

Args: tokenized_text_chunks (List[List[int]]): A list of tokenized text chunks, where each chunk is represented as a list of integers.

Returns: Tuple[List[Any], Dict]: A tuple containing a list of processed results and dataset stats.