cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator.MLMTokenGenerator#

class cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.mlm_data_token_generator.MLMTokenGenerator[source]#

Bases: object

Methods

encode

Tokenize and encode the data for masked language modeling.

mask_single_sequence

Masks tokens in a single sequence according to the MLM strategy.

__init__(params: Dict[str, Any], tokenizer, eos_id: int, pad_id: int)[source]#
mask_single_sequence(input_ids: List[int]) Tuple[List[int], List[int], List[int], List[int]][source]#

Masks tokens in a single sequence according to the MLM strategy.

Parameters

input_ids (List[int]) – Original sequence of token IDs.

Returns

  • Modified sequence with masked tokens.

  • Positions of the masked tokens.

  • Binary indicators (1s) for positions that were masked.

  • Original token IDs of the masked tokens for label purposes.

Return type

Tuple[List[int], List[int], List[int], List[int]]

encode(data: str) Tuple[List[numpy.ndarray], Dict][source]#

Tokenize and encode the data for masked language modeling.

Parameters

data (str) – Text data to encode.

Returns

Tuple of encoded features for masked language modeling and dataset stats.

Return type

Tuple[List[np.ndarray], Dict]