cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.hdf5_base_preprocessor.HDF5BasePreprocessor#

class cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.hdf5_base_preprocessor.HDF5BasePreprocessor(params)[source]#

Bases: abc.ABC

This module defines how to process a dataset, tokenize it and write into HDF5 format.

Parameters: params (Dict) – Dictionary contains the parameters that configures the processing of the dataset.

Methods

`add_token`	Add token to the tokenizer :param token: token to be added to the tokenizer :type token: str
`create_dataset`	Creates HDF5 dataset from given parameters.
`file_read_generator`	Read file and generates content :param file: path to data file :type file: str
`generate_sample`
`get_vocab_size`	Get tokenizer vocabulary size :returns: text to tokenize :rtype: vocab_size (int)
`preprocessing_generator`	Takes in content read from files and generates samples :param dos_read: return results of function file_read_generator :type dos_read: tuple
`seed_runs`	Set seed for run based on user provided seed and rank.
`write_hdf5_file`	Write data to HDF5 file.
`write_hdf5_files`	Writes a list of files to HDF5.

abstract file_read_generator(file)[source]#

Read file and generates content :param file: path to data file :type file: str

abstract preprocessing_generator(*doc_read_results)[source]#

Takes in content read from files and generates samples :param dos_read: return results of function file_read_generator :type dos_read: tuple

add_token(token)[source]#: Add token to the tokenizer :param token: token to be added to the tokenizer :type token: str

get_vocab_size()[source]#: Get tokenizer vocabulary size :returns: text to tokenize :rtype: vocab_size (int)

seed_runs(rank=0)[source]#

Set seed for run based on user provided seed and rank.

Parameters: rank (int) – Rank to set, based on process number for execution. Defaults to 0.
Returns: Object of type random.Random, with seed set.

write_hdf5_file(file_path, files, rng, n_examples, chunks, dtype='i4', compression='gzip')[source]#

Write data to HDF5 file.

Parameters

write_hdf5_files(files, start_number, write_remainder=False, process_number=None, rng=<random.Random object>)[source]#

Writes a list of files to HDF5.

Parameters

files (sequence) – List of lists containing tokenized data to write.
start_number (int) – Continual count of HDF5 files written out.
write_remainder (bool) – Write out remaining data from files, if files per record is not met. Defaults to False.
process_number (int) – Process number for execution. Defaults to None.
rng (random.Random obj) – Instance of random object, with states set. Defaults to new instance created for write.

Returns

Continual count of HDF5 files written out. remainder (list): Remaining sequences not written out, if length of

files to write is greater than the file per record.

Return type

start_number (int)

create_dataset(params)[source]#

Creates HDF5 dataset from given parameters.

Parameters

Returns

Dictionary containing results of execution, specifically as number of: processed, discarded, and successful files as well as number of examples.

cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.hdf5_base_preprocessor

cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.hdf5_curation_corpus_preprocessor