cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.data_reader#

This module contains helper functions and classes to read data from different formats, process them, and save in HDF5 format. It supports JSONL, GZipped JSON, Parquet, ZST compressed JSONL, and TAR archives of ZST compressed JSONL files.

Classes:
DataFrame:

An object to hold and process data with the ability to serialize itself into an HDF5 format.

Reader:

Provides a mechanism to read data from multiple file formats, process it, and yield in manageable chunks.

Functions

find_last_paragraph_or_sentence_end

Find the last end of a paragraph (denoted by '

get_data_size

Compute the size of the given data.

optional_lock

set_doc_idx

This is used to set metadata for a given dataframe

split_entry_by_paragraph_or_sentence

Split a large entry into chunks by sentence or paragraph end.

Classes

DataFrame

Initialize the DataFrame object.

Reader

Initialize the Reader instance.