Overview
RARE is a comprehensive framework for evaluating Retrieval-Augmented Generation (RAG) systems. It provides both an automated pipeline for generating synthetic evaluation datasets and novel robustness evaluation metrics. Unlike existing benchmarks that rely on static, general-knowledge queries, RARE emphasizes dynamics (time-sensitive data), query complexity (multi-hop reasoning), and content specialization (domain-specific technical queries), while rigorously testing system robustness against real-world noise, conflicting contexts, and retrieval failures.
RARE-Get: Dynamic Benchmark Generation Pipeline
RARE-Get automatically constructs challenging RAG evaluation datasets from domain-specific documents through four key stages: chunking documents while preserving semantic integrity, extracting knowledge graph triplets using LLMs, traversing graphs to identify query patterns, and generating quality-assured question-answer pairs.
The pipeline supports diverse query patterns through knowledge graph traversal: single-hop queries test direct fact lookup, chained queries require following linked triplets across chunks, star-shaped queries aggregate multiple facts about a focal entity, and inverted-star queries combine evidence toward a common target.
This automation enables dynamic benchmark evolution—as source documents update with new information, the pipeline automatically regenerates questions reflecting the latest facts, ensuring continued relevance for time-sensitive domains like finance and policy.
Examples of multi-hop query patterns generated through knowledge graph traversal
RARE-Set: Large-Scale Domain-Specific Benchmark
RARE-Set comprises 48,295 questions across 527 time-sensitive documents spanning three critical domains: finance (199 SEC 10-K filings), economics (114 OECD surveys), and policy (214 HUD CAPER reports). Unlike general-knowledge benchmarks, these expert-level questions require advanced information synthesis and cannot be answered through memorization alone.
The benchmark features diverse complexity levels: single-hop queries (baseline retrieval), chained multi-hop queries (2-3 linked reasoning steps), star-shaped queries (aggregating diverse facts), and inverted-star queries (convergent evidence). All questions undergo rigorous LLM-based quality assessment across readability, clarity, and correctness dimensions.
RARE-Met: Retrieval-Aware Robustness Metrics
RARE-Met introduces novel robustness metrics that account for whether models possess parametric knowledge to answer without retrieval. The framework distinguishes two scenarios: if $g(q, \\emptyset) = 1$ (model can answer without retrieval), it must consistently give correct answers regardless of retrieval content; if $g(q, \\emptyset) = 0$ (cannot answer without retrieval), it should provide correct answers given valid context or safely refuse when retrieval is incorrect or irrelevant.
Interactive Single-Case Robustness Definition
Model must provide correct answer (✓)
The framework evaluates three perturbation types: query perturbations (character-level typos, word substitutions, grammar variations, irrelevant information), document perturbations (lexical relevance vs. answer relevance variations), and real-world retrieval simulations using multiple state-of-the-art embedding models with re-ranking.
Interactive Robustness Metrics Explanation
RARE defines four robustness metrics. Select a metric below to understand what is fixed vs. varied, and how the score is calculated.
Overall Robustness
Tests model without any retrieved documents
All questions in the test set
Documents: $d \\in D$All variations of retrieved documents
Contact
For questions or collaborations, please reach out to the authors.