modelzoo.transformers.data_processing.scripts.chunk_preprocessing.data_reader.DataFrame#
- class modelzoo.transformers.data_processing.scripts.chunk_preprocessing.data_reader.DataFrame[source]#
Bases:
object
Initialize the DataFrame object.
- Parameters
keys (List[str], optional) – Keys for the data entries. Defaults to [‘text’].
Methods
Add an entry to the DataFrame.
Appends the different examples in a dataFrame object to different HDF5 files.
Clear the data and reset size.
Save the DataFrame object to an HDF5 file.
Tokenize the data values.
- __init__(keys: Optional[List[str]] = None)[source]#
Initialize the DataFrame object.
- Parameters
keys (List[str], optional) – Keys for the data entries. Defaults to [‘text’].
- add(value: Dict[str, Any]) None [source]#
Add an entry to the DataFrame.
- Parameters
value (Union[Dict[str, Any], Any]) – Entry to be added.
- append_to_hdf5(output_dir, total_chunks, pid, dtype='i4', compression='gzip')[source]#
Appends the different examples in a dataFrame object to different HDF5 files. This API is called when online shuffling is used
- Parameters
output_dir – Output dir where HDF5 data is supposed to be dumped.
total_chunks – Total number of estimated output chunks.
pid – Process id of the writer process.
- save_to_hdf5(h5file: Any, write_in_batch: bool, dtype: str = 'i4', compression: str = 'gzip') None [source]#
Save the DataFrame object to an HDF5 file.
- Parameters
h5file – An HDF5 file handle.
data_frame_num (int) – Unique identifier for the data frame.
- tokenize(dataset_processor: Any, pad_token_id: int) None [source]#
Tokenize the data values.
- Parameters
dataset_processor – Dataset Processor to be used for processing the data.
use_ftfy (bool) – Whether to use ftfy for text normalization.
ftfy_normalizer (str) – Normalization technique for ftfy.
pad_token_id (int) – ID of the pad token.