cerebras.modelzoo.data_preparation.data_preprocessing.multimodal_finetuning_token_generator.MultiModalFinetuningTokenGenerator#
- class cerebras.modelzoo.data_preparation.data_preprocessing.multimodal_finetuning_token_generator.MultiModalFinetuningTokenGenerator(params, tokenizer, eos_id, pad_id)[source]#
-
Methods
Clean the provided text.
Tokenize and encode the doc for text summarization.
Get data ranges for the conversation data.
Get data statistics from the sample.
get_tokenized_semantic_regions
parse_semantic_data_array
tokenize_data
- clean_text(data)#
Clean the provided text.
- Parameters
data (str) – Text to clean.
- Returns
Cleaned text.
- Return type
str
- get_data_ranges(semantic_regions, formatted_data)#
Get data ranges for the conversation data.
- Parameters
conversation_data (List[Dict[str, str]]) – List of conversation data.
formatted_data (str) – Formatted conversation data.
- Returns
Ranges for system, user, and assistant data.
- Return type
Tuple[List[Tuple[int, int]], List[Tuple[int, int]], List[Tuple[int, int]]]
- get_data_stats(sample)#
Get data statistics from the sample.
- Parameters
sample (np.ndarray) – Tokenized sample.
- Returns
Data statistics.
- Return type
Dict[str, int]