modelzoo.transformers.data_processing.scripts.chunk_preprocessing.data_reader.DataFrame#

class modelzoo.transformers.data_processing.scripts.chunk_preprocessing.data_reader.DataFrame[source]#

Bases: object

Initialize the DataFrame object.

Parameters

keys (List[str], optional) – Keys for the data entries. Defaults to [‘text’].

Methods

add

Add an entry to the DataFrame.

append_to_hdf5

Appends the different examples in a dataFrame object to different HDF5 files.

clear

Clear the data and reset size.

save_to_hdf5

Save the DataFrame object to an HDF5 file.

tokenize

Tokenize the data values.

__init__(keys: Optional[List[str]] = None)[source]#

Initialize the DataFrame object.

Parameters

keys (List[str], optional) – Keys for the data entries. Defaults to [‘text’].

add(value: Dict[str, Any]) None[source]#

Add an entry to the DataFrame.

Parameters

value (Union[Dict[str, Any], Any]) – Entry to be added.

append_to_hdf5(output_dir, total_chunks, pid, dtype='i4', compression='gzip')[source]#

Appends the different examples in a dataFrame object to different HDF5 files. This API is called when online shuffling is used

Parameters
  • output_dir – Output dir where HDF5 data is supposed to be dumped.

  • total_chunks – Total number of estimated output chunks.

  • pid – Process id of the writer process.

clear() None[source]#

Clear the data and reset size.

save_to_hdf5(h5file: Any, write_in_batch: bool, dtype: str = 'i4', compression: str = 'gzip') None[source]#

Save the DataFrame object to an HDF5 file.

Parameters
  • h5file – An HDF5 file handle.

  • data_frame_num (int) – Unique identifier for the data frame.

tokenize(dataset_processor: Any, pad_token_id: int) None[source]#

Tokenize the data values.

Parameters
  • dataset_processor – Dataset Processor to be used for processing the data.

  • use_ftfy (bool) – Whether to use ftfy for text normalization.

  • ftfy_normalizer (str) – Normalization technique for ftfy.

  • pad_token_id (int) – ID of the pad token.