Timezone: »

LILA: A Unified Benchmark for Mathematical Reasoning
Swaroop Mishra · Matthew Finlayson · Pan Lu · Leonard Tang · Sean Welleck · Chitta Baral · Tanmay Rajpurohit · Oyvind Tafjord · Ashish Sabharwal · Peter Clark · Ashwin Kalyan

Mathematical reasoning skills are essential for general-purpose intelligent systems to perform tasks from grocery shopping to climate modeling. Towards evaluating and improving AI systems in this domain, we propose LILA, a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions: (i) mathematical abilities e.g arithmetic, calculus, (ii) language format e.g. question-answering, fill-in-the-blanks, (iii) language diversity e.g. no language, simple language, (iv) external knowledge e.g. commonsense, physics. We construct our benchmark by extending 20 datasets benchmark by collecting task instructions and solutions in the form of Python programs, thereby obtaining explainable solutions in addition to the correct answer. We introduce two evaluation datasets to measure out-of-distribution performance and robustness to language perturbation. Finally, we introduce BHASKARA and its variants, a family of mathematical reasoning models fine-tuned on LILA. Importantly, we find that multi-tasking leads to significant improvements (average relative improvement of 21.83% F1 score vs single-task models), while the best performing model only obtains 60.40%, indicating the room for improvement in general mathematical reasoning and understanding.

Author Information

Swaroop Mishra (Arizona State University)
Matthew Finlayson (AI2)

Matthew Finlayson is a pre-doctoral investigator at the Allen Institute for AI. He completed his Bachelors in Computer Science and Linguistics from Harvard in 2021.

Pan Lu (UCLA; AI2)
Leonard Tang (Harvard University)
Sean Welleck (University of Washington)
Chitta Baral (Arizona State University)
Tanmay Rajpurohit (Georgia Institute of Technology)
Oyvind Tafjord (Allen Institute for AI)
Ashish Sabharwal (Allen Institute for AI)
Peter Clark (Allen Institute for AI)
Ashwin Kalyan (AI2)

More from the Same Authors