cerebras.modelzoo.data_preparation.raw_dataset_processor.RawDatasetProcessor.RawDatasetProcessor#

class cerebras.modelzoo.data_preparation.raw_dataset_processor.RawDatasetProcessor.RawDatasetProcessor(*args, **kwargs)[source]#

Bases: torch.utils.data.IterableDataset

Methods

collate_fn

Collates a list of dictionaries into a batch

create_dataloader

Classmethod to create the dataloader object.

get_next_item

Returns the next item in the iteration.

get_next_item()[source]#

Returns the next item in the iteration.

This function iterates over the data stream from the reader, tokenizes the data, and yields dictionaries containing features as keys and NumPy arrays as values.

Returns

An iterator yielding dictionaries with string keys and NumPy array values.

Return type

Iterator[Dict[str, np.ndarray]]

collate_fn(batch)[source]#

Collates a list of dictionaries into a batch

Parameters

batch (List[Dict[str, np.ndarray]]) – A list of dictionaries, where each dictionary contains string keys and NumPy array values.

Returns

The collated batch.

Return type

Any

create_dataloader()[source]#

Classmethod to create the dataloader object.

Returns

A DataLoader object for the dataset.

Return type

DataLoader