cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator.MLMTokenGenerator#
- class cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator.MLMTokenGenerator[source]#
Bases:
object
Methods
Tokenize and encode the data for masked language modeling.
Masks tokens in a single sequence according to the MLM strategy.
- mask_single_sequence(input_ids: List[int]) Tuple[List[int], List[int], List[int], List[int]] [source]#
Masks tokens in a single sequence according to the MLM strategy.
- Parameters
input_ids (List[int]) – Original sequence of token IDs.
- Returns
Modified sequence with masked tokens.
Positions of the masked tokens.
Binary indicators (1s) for positions that were masked.
Original token IDs of the masked tokens for label purposes.
- Return type
Tuple[List[int], List[int], List[int], List[int]]