cerebras.modelzoo.data_preparation.nlp.slimpajama.preprocessing.datasets.RedPajamaReplication#

class cerebras.modelzoo.data_preparation.nlp.slimpajama.preprocessing.datasets.RedPajamaReplication(datasets, duplicates, short_docs)[source]#

Bases: cerebras.modelzoo.data_preparation.nlp.slimpajama.preprocessing.datasets.Dataset

Methods

`already_shuffled`	Datasets where the source is already shuffled should override this to return True so that it isn't shuffled again.
`dir_path`	Path to the directory
`documents`
`name`
`num_docs`	Return an estimate of the dataset number of documents.
`sample_documents`
`short_documents_path`	Path to the file with short documents
`size`

num_docs()[source]#: Return an estimate of the dataset number of documents. Implementations may use a faster, less accurate estimate.

already_shuffled()#: Datasets where the source is already shuffled should override this to return True so that it isn’t shuffled again.

dir_path()#: Path to the directory

short_documents_path()#: Path to the file with short documents

cerebras.modelzoo.data_preparation.nlp.slimpajama.preprocessing.datasets.RedPajamaGithubDataset

cerebras.modelzoo.data_preparation.nlp.slimpajama.preprocessing.datasets.RedPajamaStackExchangeDataset