cerebras.modelzoo.data.common.h5_map_dataset.readers.H5Reader#

class cerebras.modelzoo.data.common.h5_map_dataset.readers.H5Reader[source]#

Bases: object

Class for reading individual sequences from HDF5 files stored on disk.

Supports 2 formats of data on disk:
  1. rank-1 tensor of concatenated tokenized documents.

  2. rank > 1 tensor of preprocessed samples where the 0th index of the data on disk indexes the data by sample.

Creates a reader for an HDF5 corpus.

Parameters
  • data_dirs – Directories containing h5 files to read from.

  • extra_data_keys – Additional HDF5 keys containing data to read from.

  • sequence_length – The number of tokens per sample if reading from a corpus. Must be None if the data has already been preprocessed into samples.

  • read_extra_token – Whether to read and return one extra token after the end of the sequence. This can be useful for language modeling tasks where you want to construct the labels as an shifted version of the inputs. Setting this to True differs from increasing sequence_length by one in that the extra token returned due to this flag will be included in some other sequence as the first token. Will be ignored if sequence_length is None.

  • data_subset – A string specifying the subset of the corpus to consider. E.g. if data_subset=”0.0-0.75” is specified, only samples in the first 3/4 of the dataset will be considered and the last 1/4 of the dataset will be completely untouched. The self reported length will be the length of the valid portion of the dataset (e.g. the first 3/4), and any attempt to access an element beyond this length will result in an exception.

  • sort – Whether to sort the file paths after reading them. This flag is included for backwards compatibility and should almost always be set to True. It will be removed in the future.

  • use_vsl – Flag to enable variable sequence length training. It requires the dataset to have two extra features: the attention_span of keys and the position_ids of tokens.

Methods

Attributes

by_sample

vdataset

__init__(data_dirs: Union[str, List[str]], extra_data_keys: Optional[List[str]] = None, sequence_length: Optional[int] = None, read_extra_token: bool = False, data_subset: Optional[str] = None, sort: bool = True, use_vsl: bool = False)[source]#

Creates a reader for an HDF5 corpus.

Parameters
  • data_dirs – Directories containing h5 files to read from.

  • extra_data_keys – Additional HDF5 keys containing data to read from.

  • sequence_length – The number of tokens per sample if reading from a corpus. Must be None if the data has already been preprocessed into samples.

  • read_extra_token – Whether to read and return one extra token after the end of the sequence. This can be useful for language modeling tasks where you want to construct the labels as an shifted version of the inputs. Setting this to True differs from increasing sequence_length by one in that the extra token returned due to this flag will be included in some other sequence as the first token. Will be ignored if sequence_length is None.

  • data_subset – A string specifying the subset of the corpus to consider. E.g. if data_subset=”0.0-0.75” is specified, only samples in the first 3/4 of the dataset will be considered and the last 1/4 of the dataset will be completely untouched. The self reported length will be the length of the valid portion of the dataset (e.g. the first 3/4), and any attempt to access an element beyond this length will result in an exception.

  • sort – Whether to sort the file paths after reading them. This flag is included for backwards compatibility and should almost always be set to True. It will be removed in the future.

  • use_vsl – Flag to enable variable sequence length training. It requires the dataset to have two extra features: the attention_span of keys and the position_ids of tokens.