cerebras.modelzoo.data_preparation.nlp.slimpajama.preprocessing.datasets.Dataset#

class cerebras.modelzoo.data_preparation.nlp.slimpajama.preprocessing.datasets.Dataset[source]#

Bases: abc.ABC

Methods

`already_shuffled`	Datasets where the source is already shuffled should override this to return True so that it isn't shuffled again.
`dir_path`	Path to the directory
`documents`	A generator producing all documents in the dataset.
`name`	Human-readable name of tfhe dataset
`num_docs`
`short_documents_path`	Path to the file with short documents
`size`	Return an estimate of the dataset size.

documents(process_id, n_process, dup_sh, short_sh)[source]#: A generator producing all documents in the dataset.

size()[source]#: Return an estimate of the dataset size. Implementations may use a faster, less accurate estimate.

already_shuffled()[source]#: Datasets where the source is already shuffled should override this to return True so that it isn’t shuffled again.

cerebras.modelzoo.data_preparation.nlp.slimpajama.preprocessing.datasets.redpj_datasets

cerebras.modelzoo.data_preparation.nlp.slimpajama.preprocessing.datasets.RedPajamaArXivDataset