Datasets
Introduction
ARES is an automatic RAG evaluation framework, and as such, it requires a few datasets to run.
The following are the datasets that ARES requires:
- In-domain prompts dataset
- Unlabeled evaluation set
- Labeled evaluation set (Optional, as ARES can create a lebeled evaluation set using machine labels in PPI! )
KILT Huggingface Dataset
To run ARES, you can evaluate it on any unlabeled evaluation set. However, if you would like to further test ARES, we have provided a filter to retrieve the KILT Huggingface dataset.
To load the dataset, use the following code:
Specify dataset name
Specify the name of the dataset you would like to use. Feel free to choose any including "nq", "hotpotqa", "wow", and "fever".
from ares import ARES
dataset = ares.KILT_Dataset(<specify dataset name>)
# Specify "nq", "hotpotqa", "wow", or "fever"
In the dataset, you will retrieve different ratios of testing data, ensuring a diverse set of evaluation metrics.
SuperGLUE Huggingface Dataset
Futhermore, we have provided a filter to retrieve the SuperGLUE Huggingface dataset.
To load the dataset, use the following code:
Specify dataset name
Specify the name of the dataset you would like to use. Feel free to choose any including "record", "rte", "boolq", or "multirc".
from ares import ARES
dataset = ares.superGlue_dataset(<specify dataset name>)
# Specify "record", "rte", "boolq", or "multirc"
Remarks
The provided datasets can be used for conducting any tests you would like on ARES. The ratios represent the ground truth accuracies of the datasets. We have provided and will continue to curate tutorials here utilizing these datasets, showcasing ARES's robust RAG evaluations.