Training Classifier Starter Guide

This pages teach you how to train high-precision classifiers to determine the relevance and faithfulness of RAG outputs

Training Classifier Configuration

The synth_config dictionary is a configuration object that sets up ARES for generating synthetic queries based on a given dataset. Below is how the training classifier configuration style.

from ares import ARES

classifier_config = {
    "classification_dataset": [<classification_dataset_filepath>],
    "test_set_selection": <test_set_selection_filepath>, 
    "label_column": [<labels>], 
    "model_choice": "microsoft/deberta-v3-large", # Default model is "microsoft/deberta-v3-large"
    "num_epochs": 10, 
    "patience_value": 3, 
    "learning_rate": 5e-6
}

ares = ARES(classifier_model=classifier_config)
results = ares.train_classifier()
print(results)

Classification Dataset

Generated from the ARES synthetic generator, here you should provide a list of file paths or an individual filepath to your labeled dataset used for training the classifier. The dataset should include text data and corresponding labels for supervised learning.

"classification_dataset": ["output/synthetic_queries_1.tsv"],

Test Set Selection

Provide the file path to your test set for evaluating the classifier's performance. This should be separate from the training data to ensure an unbiased assessment.

"test_set_selection": "/data/datasets_v2/nq/nq_ratio_0.6_.tsv"

Link to ARES Github Repo for test set selection file example used.

Label Column(s)

List the column name(s) in your dataset that contain the label(s). These are the targets your classifier will predict.

"label_column": ["Conmtext_Relevance_Label"],

Model Choice

Specifies the pre-trained language model to fine-tune for classification. By default, ARES uses "microsoft/deberta-v3-large". You can replace this with any Hugging Face model suitable for your task.

 "model_choice": "google/flan-t5-xxl",

Num Epochs

Determines the number of training epochs, which is the number of times the learning algorithm will work through the entire training dataset.

"num_epochs": 10,

Patience Value

This is used in early stopping to prevent overfitting. It's the number of epochs with no improvement on the validation set after which training will be stopped.

"patience_value": 3,

Learning Rate

Sets the initial learning rate for the optimizer. This is a crucial hyperparameter that controls the adjustment of model weights during training.

 "learning_rate": 5e-6

Training Classifier Configuration: Full Example

from ares import ARES

classifier_config = {
    "classification_dataset": ["output/synthetic_queries_1.tsv"], 
    "validation_set": "./datasets_v2/nq/ratio_0.5_reformatted_full_articles_False_validation_with_negatives.tsv",
    "label_column": ["Context_Relevance_Label"], 
    "model_choice": "microsoft/deberta-v3-large",
    "num_epochs": 10, 
    "patience_value": 3, 
    "learning_rate": 5e-6
}

ares = ARES(classifier_model=classifier_config)
results = ares.train_classifier()
print(results)