cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.lm_data_token_generator#

LMDataTokenGenerator Module

This module provides the LMDataTokenGenerator class which is designed to process text data and create features suitable for language modeling tasks.

Usage:

tokenizer = LMDataTokenGenerator(dataset_params,max_sequence_length,tokenizer) tokenized_features = tokenizer.encode(“Sample text for processing.”)

Functions

create_features_auto_lm

Given a list of token_ids, generate input sequence and labels.

Classes

LMDataTokenGenerator

Initialize the LMDataTokenGenerator class.