firstbacksecondback
16 Results
Workshop
|
Evaluating Language Models Planning Capabilities on Goal Ordering Challenges Eran Hirsch · Guy Uziel · Ateret Anaby Tavor |
||
Workshop
|
Analyzing Probabilistic Methods for Evaluating Agent Capabilities Axel Højmark · Govind Pimpale · Arjun Panickssery · Marius Hobbhahn · Jérémy Scheurer |
||
Workshop
|
AI Sandbagging: Language Models can Selectively Underperform on Evaluations Teun van der Weij · Felix Hofstätter · Oliver Jaffe · Samuel Brown · Francis Ward |
||
Poster
|
Fri 11:00 |
InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models Linyi Li · Shijie Geng · Zhenwen Li · Yibo He · Hao Yu · Ziyue Hua · Guanghan Ning · Siwei Wang · Tao Xie · Hongxia Yang |
|
Workshop
|
Sandbag Detection through Model Impairment Cameron Tice · Philipp Kreer · Nathan Helm-Burger · Prithviraj Singh Shahani · Fedor Ryzhenkov · Teun van der Weij · Felix Hofstätter · Jacob Haimes |
||
Workshop
|
CausalBench: A Comprehensive Benchmark for Evaluating Causal Reasoning Capabilities of Large Language Models ZEYU WANG |
||
Workshop
|
CausalBench: A Comprehensive Benchmark for Evaluating Causal Reasoning Capabilities of Large Language Models ZEYU WANG |
||
Workshop
|
The Elicitation Game: Stress-Testing Capability Elicitation Techniques Felix Hofstätter · Jayden Teoh · Teun van der Weij · Francis Ward |
||
Workshop
|
CausalBench: A Comprehensive Benchmark for Evaluating Causal Reasoning Capabilities of Large Language Models ZEYU WANG |
||
Poster
|
Wed 16:30 |
ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Ablation Capability for Large Vision-Language Models Shuo Liu · Kaining Ying · Hao Zhang · yue yang · Yuqi Lin · Tianle Zhang · Chuanhao Li · Yu Qiao · Ping Luo · Wenqi Shao · Kaipeng Zhang |
|
Workshop
|
Evaluating Interventional Reasoning Capabilities of Large Language Models Tejas Kasetty · Divyat Mahajan · Gintare Karolina Dziugaite · Alexandre Drouin · Dhanya Sridhar |
||
Workshop
|
Dimensions of Generative AI Evaluation Design Alex Dow · Jennifer Wortman Vaughan · Solon Barocas · Chad Atalla · Alexandra Chouldechova · Hanna Wallach |