cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator#

MLMTokenGenerator Module

This module provides the MLMTokenGenerator class which is designed to prepare tokenized data for training with the Masked Language Modeling (MLM) objective, commonly used in pre-training transformers like BERT. The class uses a tokenizer object (compatible with Hugging Face’s transformers library) to tokenize text data and creates masked tokens features which are essential for MLM training. It supports dynamic masking of tokens where a configurable percentage of tokens are masked randomly, facilitating the training of deep learning models on tasks that require understanding of context and word relationships.

The MLMTokenGenerator handles tokenization, applies the MLM masking strategy, and prepares appropriate outputs for model training, including the indices of masked tokens and their original values, which are used as labels in MLM.

Usage:

from transformers import BertTokenizer tokenizer = BertTokenizer.from_pretrained(‘bert-base-uncased’) mlm_generator = MLMTokenGenerator(

params={‘dataset’:{}, ‘processing’:{‘max_seq_length’:512, ‘mlm_fraction’:0.15}}, tokenizer=tokenizer, eos_id=tokenizer.eos_token_id, pad_id=tokenizer.pad_token_id

) data = [“This is a sample input sentence”, “This is another example”] tokenized_data = tokenizer(data, padding=True, return_tensors=’pt’, max_length=512) input_ids = tokenized_data[‘input_ids’].tolist() masked_input_ids, masked_positions, masked_labels, original_labels = mlm_generator.mask_single_sequence(input_ids[0])

cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator.tokenizer#

A tokenizer object, instance of transformers.PreTrainedTokenizer.

Type

PreTrainedTokenizer

cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator.eos_id#

Token ID used to signify the end of a sentence.

Type

int

cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator.pad_id#

Token ID used to signify padding.

Type

int

cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator.max_seq_length#

Maximum length of tokens in a sequence, sequences longer than this are truncated.

Type

int

cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator.mlm_fraction#

Fraction of tokens in each sequence to be masked.

Type

float

cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator.max_predictions#

Maximum number of tokens to mask in a sequence.

Type

int

cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator.seed#

Random seed for reproducibility.

Type

int

cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator.rng#

Random number generator for masking logic.

Type

random.Random

cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator.excluded_tokens#

Tokens that should not be masked.

Type

List[str]

cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator.allowable_token_ids#

Token IDs that can be masked.

Type

List[int]

cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator.special_tokens_ids#

Token IDs of special tokens that should not be masked.

Type

Set[int]

Classes

MLMTokenGenerator