RefineBench: Evaluating Refinement Capability in Language Models
Abstract
We revisit the question of whether or not language models (LMs) can refine their responses. Extending beyond prior works, we explore both (1) guided refinement, where users explicitly provide natural language feedback about unsatisfactory response elements, and (2) self-refinement, where LMs improve responses without specific guidance. We test this with RefineBench, our new benchmark containing 1002 challenging problems across 11 domains that uses a controlled checklist-based evaluation framework. Each checklist item serves as an evaluation criterion for accurate response assessment and for testing whether LMs can integrate targeted feedback on specific failure points. Our experimental results show that even frontier LMs such as GPT-4.1 and DeepSeek-R1 achieve scores of only 17.23 and 7.94 points, respectively. More crucially, we find that in self-refinement settings, performance consistently degrades with each successive turn over five iterations. In contrast, guided refinement demonstrates more success: proprietary LMs and open-weight LMs larger than 70B can effectively incorporate natural language feedback about incorrect parts, refining their responses to near-perfect within five turns. However, open-weight LMs smaller than 70B show a persistent inability to incorporate feedback, even when explicitly told which parts are incorrect.