UES/IDP
This guide will walk you through the parameters for running a simple evaluation on an unlabeled evaluation set in conjunction with in-domain prompts (UES/IDP).
UES/IDP Configuration
Below is an example of a UES/IDP configuration. Blank parameters are placeholders.
from ares import ARES
ues_idp_config = {
"in_domain_prompts_dataset": <few_shot_filepath,
"unlabeled_evaluation_set": <eval_dataset_filepath>,
"context_relevance_system_prompt": context_relevance_system_prompt,
"answer_relevance_system_prompt": answer_relevance_system_prompt,
"answer_faithfulness_system_prompt": answer_faithfulness_system_prompt,
"debug_mode": False,
"documents": 0,
"model_choice": "gpt-3.5-turbo-1106",
"request_delay": 0,
"vllm": False,
"host_url": "None"
}
In-Domain Prompts
In-domain prompts are a set of few-shot examples that are relevant to the evaluation set.
in_domain_prompts_dataset = <few_shot_filepath>
Here is an example of an in-domain prompt dataset for the Natural Questions (NQ) dataset.
Unlabeled Evaluation Set
The unlabeled evaluation set is a set of unlabeled examples containing questions, documents, and answers which will be evaluated for either context relevance, answer relevance, or answer faithfulness.
unlabeled_evaluation_set = <eval_dataset_filepath>
Context Relevance System Prompt
The context relevance system prompt is a prompt that will be used to evaluate the context relevance of the answers to the questions.
context_relevance_system_prompt = (
"You are an expert dialogue agent. "
"Your task is to analyze the provided document and determine whether
"it is relevant for responding to the dialogue. "
"In your evaluation, you should consider the content of the document "
"and how it relates to the provided dialogue. "
"Output your final verdict by strictly following this format: "[[Yes]]"
"if the document is relevant and "[[No]]"
"if the document provided is not relevant."
"Do not provide any additional explanation for your decision.\n\n"
)
Answer Relevance System Prompt
The answer relevance system prompt is a prompt that will be used to evaluate the answer relevance of the answers to the questions.
answer_relevance_system_prompt = (
"Given the following question, document, and answer, "
"you must analyze the provided answer and document before determining "
"whether the answer is relevant for the provided question. "
"In your evaluation, you should consider whether the answer "
"addresses all aspects of the question and provides only correct "
"information from the document for answering the question. "
"Output your final verdict by strictly following this format: "
"[[Yes]]" if the answer is relevant for the given question and "
"[[No]]" if the answer is not relevant for the given question. "
"Do not provide any additional explanation for your decision.\n\n"
)
Answer Faithfulness System Prompt
The answer faithfulness system prompt is a prompt that will be used to evaluate the answer faithfulness of the answers to the questions.
answer_faithfulness_system_prompt = (
"You are an expert dialogue agent. "
"Your task is to analyze the provided document and determine
"whether it is relevant for responding to the dialogue. "
"In your evaluation, you should consider the content of the "
"document and how it relates to the provided dialogue. "
'Output your final verdict by strictly following this format: "[[Yes]]"
"if the document is relevant and "[[No]]" if the document
"provided is not relevant. 'Do not provide any additional
"explanation for your decision.\n\n"
)
Debug Mode
The debug mode is a flag that will be used to determine whether to run the evaluation in debug mode.
debug_mode = False
Documents
The documents parameter is the number of documents to be evaluated. Default is 0 which means all documents in the evaluation set will be evaluated.
documents = 0
Model Choice
The model_choice parameter is the model to be used for the evaluation. Default is GPT3.5
model_choice = "gpt-3.5-turbo-1106"
Request Delay
The request_delay parameter is the delay between requests to the API. Default is 0.
request_delay = 0
VLLM
The vllm parameter is the flag to use VLLM. Default is False.
vllm = False
Host URL
The host_url parameter is the host url to use for the evaluation. Default is None.
host_url = "None"
UES/IDP Full Example
from ares import ARES
ues_idp_config = {
"in_domain_prompts_dataset": nq_few_shot_prompt_for_judge_scoring.tsv,
"unlabeled_evaluation_set": nq_unlabeled_output.tsv,
"model_choice": "gpt-3.5-turbo-1106",
}
ares = ARES(ues_idp=ues_idp_config)
results = ares.ues_idp()
print(results)