Datasets#
Datasets are wrapped in a Dataset
object.
The Dataset often will be used together with utils.data.DataLoader
to load data in batches.
DataLoader can also handle parallel data loading with multiple workers and apply data shuffling.
To be able to use your data, you need to:
Create a subclass of
DataClass
that defines the data structure, including a unique identifier, input and output fields for LLM calls.Create a subclass of
utils.data.Dataset
that defines how to load the data (local/cloud), split the data, and convert it to your defined DataClass, and how to load and preprocess the data. Optionally you can use PyTorch’s dataset, the only thing is it often works with Tensor, you will need to convert it back to normal data at some point.
In default, AdalFlow saved any downloaded datasets in the ~/.adalflow/cached_datasets directory.
You can see plenty of examples in the Datasets directory. The examples of DataClass can be found at types.