Capabilities and Parameters

This page provides an in-depth overview of the parameters and capabilities available for the training classifier in ARES, allowing users to fully customize the training pipeline in ARES.

Training Classifier Configuration

The synth_config dictionary is a configuration object that sets up ARES for generating synthetic queries based on a given dataset. Below is how the training classifier configuration style.

Training Classifier Parameters

Inherently, in ARES the values past learning_rate are not required and will use the default values if not provided. Review values '''assigned_batch_size''' and '''gradient_accumulation_multiplier''', they are dependent on your system.


    classifier_config = {
    "classification_dataset": [<classification_dataset_filepath>],
    "validation_set": <test_set_selection_filepath>, 
    "label_column": [<labels>], 
    "model_choice": "microsoft/deberta-v3-large", 
    "num_epochs": 10, 
    "patience_value": 3, 
    "learning_rate": 5e-6,
    "training_dataset_path": "path/to/training/dataset.tsv",
    "validation_dataset_path":  "path/to/validation/dataset.tsv",
    "validation_set_scoring": True,
    "assigned_batch_size": 1,
    "gradient_accumulation_multiplier": 32,
    "number_of_runs": 1,
    "num_warmup_steps": 100,
    "training_row_limit": -1,
    "validation_row_limit": -1
}

Classification Dataset

Generated from the ARES synthetic generator, here you should provide a list of file paths or an individual filepath to your labeled dataset used for training the classifier. The dataset should include text data and corresponding labels for supervised learning.

"classification_dataset": ["output/synthetic_queries_1.tsv"],

# of Training Datasets

Ensure the number of training datasets provided aligns with number of validation datasets.

Validation Set

Provide the file path to your validation set for evaluating the classifier's performance. This should be separate from the training data to ensure an unbiased assessment.

"validation_set": "/data/datasets_v2/nq/nq_ratio_0.6_.tsv"

# of Training Datasets

Ensure the number of validation datasets provided aligns with number of training datasets.

Link to ARES Github Repo for test set selection file example used.

Label Column(s)

List the column name(s) in your dataset that contain the label(s). These are the targets your classifier will predict.

"label_column": ["Conmtext_Relevance_Label"],

Model Choice

Specifies the pre-trained language model to fine-tune for classification. By default, ARES uses "microsoft/deberta-v3-large". You can replace this with any Hugging Face model suitable for your task.

 "model_choice": "google/flan-t5-xxl",

Num Epochs

Determines the number of training epochs, which is the number of times the learning algorithm will work through the entire training dataset.

"num_epochs": 10,

Patience Value

This is used in early stopping to prevent overfitting. It's the number of epochs with no improvement on the validation set after which training will be stopped.

"patience_value": 3,

Learning Rate

Sets the initial learning rate for the optimizer. This is a crucial hyperparameter that controls the adjustment of model weights during training.

 "learning_rate": 5e-6

Training Dataset Path

If more than 1 training dataset is provided, the classifier will combine all the datasets into one dataset path and train on all of them. In this case, please provide path to save the combined training dataset.

"training_dataset_path": "path/to/training/dataset.tsv"

Validation Dataset Path

If more than 1 validation dataset is provided, the classifier will combine all the datasets into one dataset path and utilize all of them for validation. In this case, please provide path to save the combined validation dataset.

"validation_dataset_path": "path/to/validation/dataset.tsv"

Validation Set Scoring

If True, the classifier will evaluate the model on the validation set after each epoch. If False, the classifier will only evaluate the model on the test set after the final epoch.

"validation_set_scoring": True,

Assigned Batch Size

The batch size for training. This is a crucial hyperparameter that controls the number of samples processed in each iteration.

"assigned_batch_size": 1,

Gradient Accumulation Multiplier

The number of steps to accumulate the gradients before performing a backward pass. This is a crucial hyperparameter that controls the number of steps to accumulate the gradients before performing a backward pass.

"gradient_accumulation_multiplier": 32,

Number of Runs

The number of times to run the training process. This is a crucial hyperparameter that controls the number of times to run the training process.

"number_of_runs": 1,

Num Warmup Steps

The number of steps to warm up the learning rate. This is a crucial hyperparameter that controls the number of steps to warm up the learning rate.

"num_warmup_steps": 100,

Training Row Limit

The number of rows to limit the training dataset to. This is a crucial hyperparameter that controls the number of rows to limit the training dataset to.

"training_row_limit": -1,

Validation Row Limit

The number of rows to limit the validation dataset to. This is a crucial hyperparameter that controls the number of rows to limit the validation dataset to.

"validation_row_limit": -1,