QAS: A Composite Query-Attributed Score for Evaluating Retrieval-Augmented Generation Systems
Abstract
Retrieval Augmented Generation (RAG) systems have advanced knowledge-grounded QA, but evaluation remains challenging due to competing demands of faithfulness to evidence, coverage of query-relevant information, and computational efficiency. We introduce QAS, a composite Query-Attributed Score for fine-grained, interpretable evaluation of RAG. QAS decomposes quality into five dimensions—grounding, retrieval coverage, answer faithfulness, context efficiency, and relevance—each computed with lightweight, task-agnostic metrics (token/entity attribution, n-gram overlap, factual consistency, redundancy penalties, and embedding similarity). A linear combination with tunable weights yields a unified score plus per-dimension diagnostics. Across five QA benchmarks (open-domain, biomedical, legal/regulatory, customer-support, and news), QAS aligns closely with human judgments at moderate cost. Ablations confirm each dimension’s necessity, establishing QAS as a transparent, practical framework for reliable RAG evaluation.