cerebras.modelzoo.data_preparation.nlp.chunk_data_processing#

chunk_data_preprocessor

This module implements a generic data preprocessor called ChunkDataPreprocessor.

create_hdf5_dataset

Script to generate an HDF5 dataset for GPT Models.

data_reader

This module contains helper functions and classes to read data from different formats, process them, and save in HDF5 format.

dpo_data_token_generator

fim_data_token_generator

FIMTokenGenerator Module

lm_data_token_generator

LMDataTokenGenerator Module

lm_vsl_data_token_generator

This module provides the VSLLMDataTokenGenerator class, extending LMDataTokenGenerator for advanced processing of tokenized text data tailored for variable-length sequence language modeling (VSLLM).

mlm_data_token_generator

MLMTokenGenerator Module

nlg_token_generator

summarization_data_token_generator

SummarizationTokenGenerator Module

summarization_vsl_data_token_generator

This module provides the VSLSummarizationTokenGenerator class, which extends the SummarizationTokenGenerator for processing tokenized text data specifically for variable-length sequence summarization (VSLS).

utils