RARE: Retrieval-Aware Robustness Evaluation for Retrieval-Augmented Generation Systems

Yixiao Zeng1, Tianyu Cao1, Danqing Wang1, Xinran Zhao1, Zimeng Qiu2, Morteza Ziyadi2, Tongshuang Wu1, Lei Li1

1 Carnegie Mellon University · 2 Amazon

RARE pipeline

Overview

RARE is a comprehensive framework for evaluating Retrieval-Augmented Generation (RAG) systems. It provides both an automated pipeline for generating synthetic evaluation datasets and novel robustness evaluation metrics. Unlike existing benchmarks that rely on static, general-knowledge queries, RARE emphasizes dynamics (time-sensitive data), query complexity (multi-hop reasoning), and content specialization (domain-specific technical queries), while rigorously testing system robustness against real-world noise, conflicting contexts, and retrieval failures.

RARE-Get: Dynamic Benchmark Generation Pipeline

RARE-Get automatically constructs challenging RAG evaluation datasets from domain-specific documents through four key stages: chunking documents while preserving semantic integrity, extracting knowledge graph triplets using LLMs, traversing graphs to identify query patterns, and generating quality-assured question-answer pairs.

The pipeline supports diverse query patterns through knowledge graph traversal: single-hop queries test direct fact lookup, chained queries require following linked triplets across chunks, star-shaped queries aggregate multiple facts about a focal entity, and inverted-star queries combine evidence toward a common target.

This automation enables dynamic benchmark evolution—as source documents update with new information, the pipeline automatically regenerates questions reflecting the latest facts, ensuring continued relevance for time-sensitive domains like finance and policy.

Multi-hop query examples showing chained, star-shaped, and inverted-star patterns

Examples of multi-hop query patterns generated through knowledge graph traversal

RARE-Set: Large-Scale Domain-Specific Benchmark

RARE-Set comprises 48,295 questions across 527 time-sensitive documents spanning three critical domains: finance (199 SEC 10-K filings), economics (114 OECD surveys), and policy (214 HUD CAPER reports). Unlike general-knowledge benchmarks, these expert-level questions require advanced information synthesis and cannot be answered through memorization alone.

The benchmark features diverse complexity levels: single-hop queries (baseline retrieval), chained multi-hop queries (2-3 linked reasoning steps), star-shaped queries (aggregating diverse facts), and inverted-star queries (convergent evidence). All questions undergo rigorous LLM-based quality assessment across readability, clarity, and correctness dimensions.

RARE-Met: Retrieval-Aware Robustness Metrics

RARE-Met introduces novel robustness metrics that account for whether models possess parametric knowledge to answer without retrieval. The framework distinguishes two scenarios: if $g(q, \\emptyset) = 1$ (model can answer without retrieval), it must consistently give correct answers regardless of retrieval content; if $g(q, \\emptyset) = 0$ (cannot answer without retrieval), it should provide correct answers given valid context or safely refuse when retrieval is incorrect or irrelevant.

Interactive Single-Case Robustness Definition

Robustness Requirement:

Model must provide correct answer (✓)

The framework evaluates three perturbation types: query perturbations (character-level typos, word substitutions, grammar variations, irrelevant information), document perturbations (lexical relevance vs. answer relevance variations), and real-world retrieval simulations using multiple state-of-the-art embedding models with re-ranking.

Interactive Robustness Metrics Explanation

RARE defines four robustness metrics. Select a metric below to understand what is fixed vs. varied, and how the score is calculated.

Overall Robustness

Mathematical Expression:
$$\\text{Overall Robustness} = \\frac{1}{|Q||D|} \\sum_{q \\in Q} \\sum_{d \\in D} f(g(q,d), a)$$
Fixed Parameters
Empty retrieval: $\\emptyset$

Tests model without any retrieved documents

Varied Parameters
Queries: $q \\in Q$

All questions in the test set

Documents: $d \\in D$

All variations of retrieved documents

$|Q|$ Number of queries in the test set
$|D|$ Number of document variations per query
$f(g(q,d), a)$ Robustness function: checks if model response matches expected behavior given query $q$, document $d$, and answer $a$
What it measures: Average robustness across all query-document combinations, testing how consistently the model handles various retrieval contexts.

Contact

For questions or collaborations, please reach out to the authors.