cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator#

MLMTokenGenerator Module

This module provides the MLMTokenGenerator class which is designed to prepare tokenized data for training with the Masked Language Modeling (MLM) objective, commonly used in pre-training transformers like BERT. The class uses a tokenizer object (compatible with Hugging Face’s transformers library) to tokenize text data and creates masked tokens features which are essential for MLM training. It supports dynamic masking of tokens where a configurable percentage of tokens are masked randomly, facilitating the training of deep learning models on tasks that require understanding of context and word relationships.

The MLMTokenGenerator handles tokenization, applies the MLM masking strategy, and prepares appropriate outputs for model training, including the indices of masked tokens and their original values, which are used as labels in MLM.

Usage:

from transformers import BertTokenizer tokenizer = BertTokenizer.from_pretrained(‘bert-base-uncased’) mlm_generator = MLMTokenGenerator(

params={‘dataset’:{}, ‘processing’:{‘max_seq_length’:512, ‘mlm_fraction’:0.15}}, tokenizer=tokenizer, eos_id=tokenizer.eos_token_id, pad_id=tokenizer.pad_token_id

) data = [“This is a sample input sentence”, “This is another example”] tokenized_data = tokenizer(data, padding=True, return_tensors=’pt’, max_length=512) input_ids = tokenized_data[‘input_ids’].tolist() masked_input_ids, masked_positions, masked_labels, original_labels = mlm_generator.mask_single_sequence(input_ids[0])

cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator.tokenizer#

A tokenizer object, instance of transformers.PreTrainedTokenizer.

Type: PreTrainedTokenizer

cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator.eos_id#

Token ID used to signify the end of a sentence.

Type: int

cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator.pad_id#

Token ID used to signify padding.

Type: int

cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator.max_seq_length#

Maximum length of tokens in a sequence, sequences longer than this are truncated.

Type: int

cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator.mlm_fraction#

Fraction of tokens in each sequence to be masked.

Type: float

cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator.max_predictions#

Maximum number of tokens to mask in a sequence.

Type: int

cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator.seed#

Random seed for reproducibility.

Type: int

cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator.rng#

Random number generator for masking logic.

Type: random.Random

cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator.excluded_tokens#

Tokens that should not be masked.

Type: List[str]

cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator.allowable_token_ids#

Token IDs that can be masked.

Type: List[int]

cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator.special_tokens_ids#

Token IDs of special tokens that should not be masked.

Type: Set[int]

Classes

MLMTokenGenerator

cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.lm_vsl_data_token_generator.VSLLMDataTokenGenerator

cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator.MLMTokenGenerator