cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.data_reader.DataFrame#

class cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.data_reader.DataFrame[source]#

Bases: object

Initialize the DataFrame object.

Methods

`add`	Add an entry to the DataFrame.
`append_to_hdf5`	Appends the different examples in a dataFrame object to different HDF5 files.
`check_valid_multi_turn_dialogue`	Checks if the document is corrupted in the case of summarization tasks
`clear`	Clear the raw data after tokenizing.
`save_to_hdf5`	Save the DataFrame object to an HDF5 file.
`tokenize`	Tokenize the data values.

__init__(keys: Optional[Dict] = None)[source]#

Initialize the DataFrame object.

save_to_hdf5(h5file: Any, write_in_batch: bool, dtype: str = 'i4', compression: str = 'gzip') → None[source]#

Save the DataFrame object to an HDF5 file.

Parameters

append_to_hdf5(output_dir, total_chunks, pid, chunk_locks, dtype='i4', compression='gzip')[source]#

Appends the different examples in a dataFrame object to different HDF5 files. This API is called when online shuffling is used

Parameters

output_dir – Output dir where HDF5 data is supposed to be dumped.
total_chunks – Total number of estimated output chunks.
pid – Process id of the writer process.
chunk_locks – The list of file specific chunk locks used while appending to a output file.

add(value: Dict[str, Any]) → None[source]#

Add an entry to the DataFrame.

Parameters: value (Union[Dict[str, Any], Any]) – Entry to be added.

check_valid_multi_turn_dialogue(doc)[source]#: Checks if the document is corrupted in the case of summarization tasks

tokenize(dataset_processor: Any) → None[source]#

Tokenize the data values.

Parameters: dataset_processor – Dataset Processor to be used for processing the data.

cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.data_reader.split_entry_by_paragraph_or_sentence

cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.data_reader.Reader