modelzoo.transformers.data_processing.slimpajama.preprocessing.datasets.RedPajamaReplication#
- class modelzoo.transformers.data_processing.slimpajama.preprocessing.datasets.RedPajamaReplication[source]#
Bases:
modelzoo.transformers.data_processing.slimpajama.preprocessing.datasets.Dataset
Methods
Datasets where the source is already shuffled should override this to return True so that it isn't shuffled again.
Path to the directory
A generator producing all documents in the dataset.
Human-readable name of tfhe dataset
Return an estimate of the dataset number of documents.
sample_documents
Path to the file with short documents
Return an estimate of the dataset size.
- already_shuffled()#
Datasets where the source is already shuffled should override this to return True so that it isn’t shuffled again.
- dir_path()#
Path to the directory
- num_docs()[source]#
Return an estimate of the dataset number of documents. Implementations may use a faster, less accurate estimate.
- short_documents_path()#
Path to the file with short documents