cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.data_reader.DataFrame#
- class cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.data_reader.DataFrame[source]#
Bases:
object
Initialize the DataFrame object.
- Parameters
keys (Dict) – Keys for the data entries.
Methods
Add an entry to the DataFrame.
Appends the different examples in a dataFrame object to different HDF5 files.
Checks if the document is corrupted in the case of summarization tasks
Clear the raw data after tokenizing.
Save the DataFrame object to an HDF5 file.
Tokenize the data values.
- __init__(keys: Optional[Dict] = None)[source]#
Initialize the DataFrame object.
- Parameters
keys (Dict) – Keys for the data entries.
- save_to_hdf5(h5file: Any, write_in_batch: bool, dtype: str = 'i4', compression: str = 'gzip') None [source]#
Save the DataFrame object to an HDF5 file.
- Parameters
h5file – An HDF5 file handle.
data_frame_num (int) – Unique identifier for the data frame.
- append_to_hdf5(output_dir, total_chunks, pid, chunk_locks, dtype='i4', compression='gzip')[source]#
Appends the different examples in a dataFrame object to different HDF5 files. This API is called when online shuffling is used
- Parameters
output_dir – Output dir where HDF5 data is supposed to be dumped.
total_chunks – Total number of estimated output chunks.
pid – Process id of the writer process.
chunk_locks – The list of file specific chunk locks used while appending to a output file.
- add(value: Dict[str, Any]) None [source]#
Add an entry to the DataFrame.
- Parameters
value (Union[Dict[str, Any], Any]) – Entry to be added.