cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator#
MLMTokenGenerator Module
This module provides the MLMTokenGenerator class which is designed to prepare tokenized data for training with the Masked Language Modeling (MLM) objective, commonly used in pre-training transformers like BERT. The class uses a tokenizer object (compatible with Hugging Face’s transformers library) to tokenize text data and creates masked tokens features which are essential for MLM training. It supports dynamic masking of tokens where a configurable percentage of tokens are masked randomly, facilitating the training of deep learning models on tasks that require understanding of context and word relationships.
The MLMTokenGenerator handles tokenization, applies the MLM masking strategy, and prepares appropriate outputs for model training, including the indices of masked tokens and their original values, which are used as labels in MLM.
- Usage:
from transformers import BertTokenizer tokenizer = BertTokenizer.from_pretrained(‘bert-base-uncased’) mlm_generator = MLMTokenGenerator(
params={‘dataset’:{}, ‘processing’:{‘max_seq_length’:512, ‘mlm_fraction’:0.15}}, tokenizer=tokenizer, eos_id=tokenizer.eos_token_id, pad_id=tokenizer.pad_token_id
) data = [“This is a sample input sentence”, “This is another example”] tokenized_data = tokenizer(data, padding=True, return_tensors=’pt’, max_length=512) input_ids = tokenized_data[‘input_ids’].tolist() masked_input_ids, masked_positions, masked_labels, original_labels = mlm_generator.mask_single_sequence(input_ids[0])
- cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator.tokenizer#
A tokenizer object, instance of transformers.PreTrainedTokenizer.
- Type
PreTrainedTokenizer
- cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator.eos_id#
Token ID used to signify the end of a sentence.
- Type
int
- cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator.pad_id#
Token ID used to signify padding.
- Type
int
- cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator.max_seq_length#
Maximum length of tokens in a sequence, sequences longer than this are truncated.
- Type
int
- cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator.mlm_fraction#
Fraction of tokens in each sequence to be masked.
- Type
float
- cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator.max_predictions#
Maximum number of tokens to mask in a sequence.
- Type
int
- cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator.seed#
Random seed for reproducibility.
- Type
int
- cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator.rng#
Random number generator for masking logic.
- Type
random.Random
- cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator.excluded_tokens#
Tokens that should not be masked.
- Type
List[str]
- cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator.allowable_token_ids#
Token IDs that can be masked.
- Type
List[int]
- cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator.special_tokens_ids#
Token IDs of special tokens that should not be masked.
- Type
Set[int]
Classes