cerebras.modelzoo.data_preparation.nlp.slimpajama.preprocessing.datasets.Dataset#

class cerebras.modelzoo.data_preparation.nlp.slimpajama.preprocessing.datasets.Dataset[source]#

Bases: abc.ABC

Methods

already_shuffled

Datasets where the source is already shuffled should override this to return True so that it isn't shuffled again.

dir_path

Path to the directory

documents

A generator producing all documents in the dataset.

name

Human-readable name of tfhe dataset

num_docs

short_documents_path

Path to the file with short documents

size

Return an estimate of the dataset size.

dir_path()[source]#

Path to the directory

short_documents_path()[source]#

Path to the file with short documents

name()[source]#

Human-readable name of tfhe dataset

documents(process_id, n_process, dup_sh, short_sh)[source]#

A generator producing all documents in the dataset.

size()[source]#

Return an estimate of the dataset size. Implementations may use a faster, less accurate estimate.

already_shuffled()[source]#

Datasets where the source is already shuffled should override this to return True so that it isn’t shuffled again.