data¶

Default Dataset, DataLoader similar to utils.data in PyTorch.

You can also use those provided by PyTorch or huggingface/datasets.

Functions

subset_dataset(dataset, num_samples)

This function will be useful for testing and debugging purposes.

Classes

`DataLoader`(dataset[, batch_size, shuffle, seed])	A simplified version of PyTorch DataLoader.
`Dataset`()	An abstract class representing a `Dataset`.
`Subset`(dataset, indices)	Subset of a dataset at specified indices.

class Dataset[source]¶

Bases: Generic[T_co]

An abstract class representing a Dataset.

All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite __getitem__(), supporting fetching a data sample for a given key. Subclasses could also optionally overwrite __len__(), which is expected to return the size of the dataset by many Sampler implementations and the default options of DataLoader. Subclasses could also optionally implement __getitems__(), for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples.

Note

DataLoader by default constructs an index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.

class Subset(dataset: Dataset[T_co], indices: Sequence[int])[source]¶

Bases: Dataset[T_co]

Subset of a dataset at specified indices.

Parameters:

dataset (Dataset) – The whole Dataset
indices (sequence) – Indices in the whole set selected for subset

dataset: Dataset[T_co]¶

indices: Sequence[int]¶

class DataLoader(dataset, batch_size: int = 4, shuffle: bool = True, seed: int = 42)[source]¶

Bases: object

A simplified version of PyTorch DataLoader.

The biggest difference is not to handle tensors, but to handle any type of data.

set_max_steps(max_steps: int)[source]¶

subset_dataset(dataset, num_samples: int)[source]¶: This function will be useful for testing and debugging purposes.