Synthetic Generation Starter Guide

This page shows you how to automatically create synthetic datasets that closely mimic real-world scenarios for robust RAG testing.

Synth Gen Configuration

The synth_config dictionary is a configuration object that sets up ARES for generating synthetic queries based on a given dataset. Below is how the synthetic generation configuration style.

from ares import ARES

synth_config = { 
    "document_filepaths": [<document_filepaths>],
    "few_shot_prompt_filename": few_shot_filepath,
    "synthetic_queries_filenames": [<synthetic_queries_filepaths>],
    "model_choice": <model_choice>,
    "documents_sampled": 10000
}

ares = ARES(synthetic_query_generator=synth_config)
results = ares.generate_synthetic_data()
print(results)

Document File Path(s)

A single or list of file paths to the document(s) you want to use for generating synthetic queries. If given a list of file paths, each file path should point to a file containing raw text from which ARES can derive context for the synthetic queries.

"document_filepaths": ["/data/datasets_v2/nq/nq_ratio_0.5_.tsv"],

Link to ARES Github Repo for document example file used.

Few-Shot Prompt File Path

This refers to the file paths for a few-shot prompt file that provide examples of queries and answers for ARES to learn from. Few-shot learning uses a small amount of labeled training data to guide the generation of synthetic queries.

"few_shot_prompt_filename": "data/datasets/multirc_few_shot_prompt_for_synthetic_query_generation_v1.tsv",

Link to ARES Github Repo for few-shot file example used.

Synthetic Queries Filepath

A list of file paths where the generated synthetic queries will be saved. These files will store the queries created by ARES for use in training or evaluation.

NOTE - List Size Verification

Ensure the synthetic queries file paths list matches the document file paths list in size for consistency.

"synthetic_queries_filenames": ["/output/synthetic_queries_1.tsv"],

Model Choice

Specifies the pre-trained language model to create the synthetic data. By default, ARES uses "google/flan-t5-xxl". You can replace this with any Hugging Face model suitable for your task.

 "model_choice": "google/flan-t5-xxl",

Documents Sampled

An integer indicating how many documents to sample from your dataset when generating synthetic queries. Sampling can help speed up processing and manage computational resources. Choose a value that represents a large enough sample to generate meaningful synthetic queries, but not so large as to make processing infeasible. ARES will automatically filter documents

NOTE - Document Filter

ARES will automatically filter documents less than 50 words

"documents_sampled": 10000,

Synthetic Generation Configuration: Full Example

from ares import ARES

synth_config = { 
    "document_filepaths": ["/data/datasets_v2/nq/nq_ratio_0.5_.tsv"],
    "few_shot_prompt_filename": "data/datasets/multirc_few_shot_prompt_for_synthetic_query_generation_v1.tsv",
    "synthetic_queries_filenames": ["/output/synthetic_queries_1.tsv"],
    "model_choice": "google/flan-t5-xxl",
    "documents_sampled": 10000
}

ares = ARES(synthetic_query_generator=synth_config)
results = ares.generate_synthetic_data()
print(results)