[01] Full RAG Eval Walkthrough w/ PPI

This tutorial will showcase a full walkthrough of how to use ARES on an NQ dataset with a ground truth accuracy of 60%, showcasing ARES's robust evaluation accuracies. If you haven't, download the necessary datasets to the follow tutorials here.

[1] Synthetic Generation

The first step is to configure the synthetic generation. Below contains the code for configuring the synthetic generation.


from ares import ARES

synth_config = { 
    "document_filepaths": ["nq_labeled_output.tsv"] ,
    "few_shot_prompt_filename": "nq_few_shot_prompt_for_synthetic_query_generation.tsv",
    "synthetic_queries_filenames": ["nq_0.6_synthetic_queries.tsv"], 
    "documents_sampled": 6189
}

[2] Training Classifier

The second step is to train the classifier. Below contains the code for training the classifier.

from ares import ARES

classifier_config = {
    "training_dataset": ["nq_0.6_synthetic_queries.tsv"], 
    "validation_set": ["nq_labeled_output.tsv"], 
    "label_column": ["Context_Relevance_Label"], 
    "num_epochs": 10, 
    "patience_value": 3, 
    "learning_rate": 5e-6,
    "assigned_batch_size": 1,  # Change according to GPU memory
    "gradient_accumulation_multiplier": 32,  # Change according to GPU memory
}

[3] RAG Evaluation w/ ARES's PPI

The third step is to evaluate the unlabeled evaluation set using ARES's PPI in conjunction with the trained classifier we have from step 2. Below contains the code for evaluating the unlabeled evaluation set.

from ares import ARES

ppi_config = { 
    "evaluation_datasets": ['nq_unlabeled_output.tsv'], 
    "few_shot_examples_filepath": "nq_few_shot_prompt_for_judge_scoring.tsv",
    "checkpoints": ["Context_Relevance_Label_nq_labeled_output_date_time.pt"], 
    "labels": ["Context_Relevance_Label"], 
    "gold_label_path": "nq_labeled_output.tsv", # Samples 300 labeled examples 
}

[4] Run all configurations together

THe final step is to run this entire pipeline. Below contains the code from previous steps and how to run this entire pipeline.

from ares import ARES

synth_config = { 
    "document_filepaths": ["nq_labeled_output.tsv"] ,
    "few_shot_prompt_filename": "nq_few_shot_prompt_for_synthetic_query_generation.tsv",
    "synthetic_queries_filenames": ["nq_0.6_synthetic_queries.tsv"], 
    "documents_sampled": 6189
}

classifier_config = {
    "training_dataset": ["nq_0.6_synthetic_queries.tsv"], 
    "validation_set": ["nq_labeled_output.tsv"], 
    "label_column": ["Context_Relevance_Label"], 
    "num_epochs": 10, 
    "patience_value": 3, 
    "learning_rate": 5e-6,
    "assigned_batch_size": 1,  # Change according to GPU memory
    "gradient_accumulation_multiplier": 32,  # Change according to GPU memory
}

ppi_config = { 
        "evaluation_datasets": ['nq_unlabeled_output.tsv'], 
        "few_shot_examples_filepath": "nq_few_shot_prompt_for_judge_scoring.tsv",
        "checkpoints": ["Context_Relevance_Label_nq_labeled_output_date_time.pt"], 
        "labels": ["Context_Relevance_Label"], 
        "gold_label_path": "nq_labeled_output.tsv", # Samples 300 labeled examples 
}

ares = ARES(synthetic_query_generator=synth_config, classifier_model=classifier_config, ppi=ppi_config)
results = ares.run() 
print(results)

Results

The following are ARES's evaluation accuracies on the NQ (60% ground truth accuracy) dataset:

Context_Relevance_Label Scoring
ARES Ranking
ARES Prediction: [0.6116376385433233]
ARES Confidence Interval: [[0.554, 0.669]]
Number of Examples in Evaluation Set: [4421]
Ground Truth Performance: [0.6]
ARES LLM Judge Accuracy on Ground Truth Labels: [0.775]
Annotated Examples used for PPI: 300