modelzoo.transformers.data_processing.scripts.chunk_preprocessing.data_reader.DataFrame#

class modelzoo.transformers.data_processing.scripts.chunk_preprocessing.data_reader.DataFrame[source]#

Bases: object

Initialize the DataFrame object.

Parameters: keys (List[str], optional) – Keys for the data entries. Defaults to [‘text’].

Methods

`add`	Add an entry to the DataFrame.
`append_to_hdf5`	Appends the different examples in a dataFrame object to different HDF5 files.
`clear`	Clear the data and reset size.
`save_to_hdf5`	Save the DataFrame object to an HDF5 file.
`tokenize`	Tokenize the data values.

__init__(keys: Optional[List[str]] = None)[source]#

Initialize the DataFrame object.

Parameters: keys (List[str], optional) – Keys for the data entries. Defaults to [‘text’].

add(value: Dict[str, Any]) → None[source]#

Add an entry to the DataFrame.

Parameters: value (Union[Dict[str, Any], Any]) – Entry to be added.

append_to_hdf5(output_dir, total_chunks, pid, dtype='i4', compression='gzip')[source]#

Appends the different examples in a dataFrame object to different HDF5 files. This API is called when online shuffling is used

Parameters

save_to_hdf5(h5file: Any, write_in_batch: bool, dtype: str = 'i4', compression: str = 'gzip') → None[source]#

Save the DataFrame object to an HDF5 file.

Parameters

tokenize(dataset_processor: Any, pad_token_id: int) → None[source]#

Tokenize the data values.

Parameters

modelzoo.transformers.data_processing.scripts.chunk_preprocessing.data_reader.split_entry_by_paragraph_or_sentence

modelzoo.transformers.data_processing.scripts.chunk_preprocessing.data_reader.Reader