BLUFF-1000: Measuring Uncertainty Expression in RAG
Abstract
Retrieval-augmented generation (RAG) systems often fail to adequately modulate their linguistic certainty when evidence deteriorates. This gap in how models respond to imperfect retrieval is critical for the safety and reliability in real-world RAG systems. To address this gap, we propose BLUFF-1000, a benchmark systematically designed to evaluate how large language models (LLMs) manage linguistic confidence under conflicting evidence to simulate poor retrieval. We created a novel dataset, introduced two novel metrics, and calculated comprehensive metrics to quantify faithfulness, factuality, linguistic uncertainty, and calibration. Finally, we tested generation components of RAG systems with controlled experiments on seven LLMs using the benchmark, measuring their awareness of uncertainty and general performance. While not definitive, our observations reveal initial indications of a misalignment between uncertainty and source quality across seven state-of-the-art RAG systems, underscoring the value of continued benchmarking in this space. We recommend that future RAG systems refine uncertainty-aware methods to convey confidence throughout the system transparently.