Skip to yearly menu bar Skip to main content

Workshop: MATH-AI: The 3rd Workshop on Mathematical Reasoning and AI

SCIBENCH: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

Xiaoxuan Wang · Ziniu Hu · Pan Lu · Yanqiao Zhu · Jieyu Zhang · Satyen Subramaniam · Arjun Loomba · Shichang Zhang · Yizhou Sun · Wei Wang

Keywords: [ Large Language Model Benchmark ]


Recent advances in Large Language Models (LLMs) have demonstrated notable progress on many mathematical benchmarks. However, most of these benchmarks only contain problems grounded in junior and senior high school subjects, contain only multiple-choice questions, and are confined to a limited scope of elementary arithmetic operations.To address these issues, this paper introduces an expansive benchmark suite Scibench that aims to systematically examine the reasoning capabilities required for solving complex scientific problems. Scibench contains two datasets: an open set featuring a range of collegiate-level scientific problems, and a closed set comprising problems from undergraduate-level exams.Based on the two datasets, we conduct an in-depth benchmarking study of five representative LLMs with various prompting strategies. Furthermore, through a detailed user study, we show that no single prompting strategy significantly outperforms the others and some strategies that demonstrate improvements in certain problem-solving skills could result in declines in other skills.

Chat is not available.