Instruction Following for Finance: Verifying language models’ ability to follow complex financial instructions
Abstract
Language Models (LMs) demonstrate an impressive ability to follow instructions, but the risk of hallucination when executing complex, interdependent commands has limited their effectiveness in domains like finance where precision is critical. We introduce IFF, a high-difficulty benchmark designed to assess the instruction-following capabilities of LMs for finance. IFF provides 88 human-authored prompts that mirror financial analysis tasks and uses a verification system with chainable, verifiable constraints to provide fine-grained reward signals. We evaluate 53 models in a zero-shot setting, including leading proprietary, open-weight, and open-source systems. Our key findings reveal that open-weight models can meet or surpass the instruction-following capabilities of proprietary systems. However, even the top-performing models fail to achieve perfect compliance and struggle with the IFF benchmark’s complex requirements. We release our dataset and code as an open-source resource to promote research into Reinforcement Learning with Verified Rewards (RLVR) for the financial domain.