cerebras.modelzoo.data_preparation.data_preprocessing.pretraining_token_generator#

PretrainingTokenGenerator Module

This module provides the PretrainingTokenGenerator class which is designed to process text data and create features suitable for language modeling tasks.

Usage:

tokenizer = PretrainingTokenGenerator(dataset_params, max_sequence_length, tokenizer) tokenized_features = tokenizer.encode(“Sample text for processing.”)

Functions

create_features_auto_lm

Given a list of token_ids, generate input sequence and labels.

Classes

PretrainingTokenGenerator

Initialize the PretrainingTokenGenerator class.