cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.data_reader.DataFrame#

class cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.data_reader.DataFrame[source]#

Bases: object

Initialize the DataFrame object.

Parameters

keys (Dict) – Keys for the data entries.

Methods

add

Add an entry to the DataFrame.

append_to_hdf5

Appends the different examples in a dataFrame object to different HDF5 files.

check_valid_multi_turn_dialogue

Checks if the document is corrupted in the case of summarization tasks

clear

Clear the raw data after tokenizing.

save_mlm_data_to_csv

Save the processed tokenized data to a CSV file.

save_to_hdf5

Save the DataFrame object to an HDF5 file.

tokenize

Tokenize the data values.

__init__(keys: Optional[Dict] = None)[source]#

Initialize the DataFrame object.

Parameters

keys (Dict) – Keys for the data entries.

save_to_hdf5(h5file: Any, write_in_batch: bool, dtype: str = 'i4', compression: str = 'gzip') None[source]#

Save the DataFrame object to an HDF5 file.

Parameters
  • h5file – An HDF5 file handle.

  • data_frame_num (int) – Unique identifier for the data frame.

save_mlm_data_to_csv(csv_file_path)[source]#

Save the processed tokenized data to a CSV file.

Parameters

csv_file_path (str) – Path to the CSV file to write.

append_to_hdf5(output_dir, total_chunks, pid, chunk_locks, dtype='i4', compression='gzip')[source]#

Appends the different examples in a dataFrame object to different HDF5 files. This API is called when online shuffling is used

Parameters
  • output_dir – Output dir where HDF5 data is supposed to be dumped.

  • total_chunks – Total number of estimated output chunks.

  • pid – Process id of the writer process.

  • chunk_locks – The list of file specific chunk locks used while appending to a output file.

add(value: Dict[str, Any]) None[source]#

Add an entry to the DataFrame.

Parameters

value (Union[Dict[str, Any], Any]) – Entry to be added.

clear() None[source]#

Clear the raw data after tokenizing.

check_valid_multi_turn_dialogue(doc)[source]#

Checks if the document is corrupted in the case of summarization tasks

tokenize(dataset_processor: Any) None[source]#

Tokenize the data values.

Parameters

dataset_processor – Dataset Processor to be used for processing the data.