firstbacksecondback
34 Results
Poster
|
Wed 16:30 |
Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making Manling Li · Shiyu Zhao · Qineng Wang · Kangrui Wang · Yu Zhou · Sanjana Srivastava · Cem Gokmen · Tony Lee · Erran Li Li · Ruohan Zhang · Weiyu Liu · Percy Liang · Fei-Fei Li · Jiayuan Mao · Jiajun Wu |
|
Workshop
|
FEABench: Evaluating Language Models on Real World Physics Reasoning Ability Nayantara Mudur · Hao Cui · Subhashini Venugopalan · Paul Raccuglia · Michael Brenner · Peter Norgaard |
||
Poster
|
Thu 16:30 |
WhodunitBench: Evaluating Large Multimodal Agents via Murder Mystery Games Junlin Xie · Ruifei Zhang · Zhihong Chen · Xiang Wan · Guanbin Li |
|
Workshop
|
InvestESG: A Multi-agent Reinforcement Learning Benchmark for Studying Climate Investment as a Social Dilemma Xiaoxuan Hou · Jiayi Yuan · Natasha Jaques |
||
Oral
|
Wed 15:50 |
Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making Manling Li · Shiyu Zhao · Qineng Wang · Kangrui Wang · Yu Zhou · Sanjana Srivastava · Cem Gokmen · Tony Lee · Erran Li Li · Ruohan Zhang · Weiyu Liu · Percy Liang · Fei-Fei Li · Jiayuan Mao · Jiajun Wu |
|
Poster
|
Wed 11:00 |
ZSC-Eval: An Evaluation Toolkit and Benchmark for Multi-agent Zero-shot Coordination Xihuai Wang · Shao Zhang · Wenhao Zhang · Wentao Dong · Jingxiao Chen · Ying Wen · Weinan Zhang |
|
Workshop
|
An Efficient Open World Benchmark for Multi-Agent Reinforcement Learning Eric Ye · Natasha Jaques |
||
Poster
|
Thu 11:00 |
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments Tianbao Xie · Danyang Zhang · Jixuan Chen · Xiaochuan Li · Siheng Zhao · Ruisheng Cao · Jing Hua Toh · Zhoujun Cheng · Dongchan Shin · Fangyu Lei · Yitao Liu · Yiheng Xu · Shuyan Zhou · Silvio Savarese · Caiming Xiong · Victor Zhong · Tao Yu |
|
Workshop
|
Sat 15:45 |
Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation Siyuan Wang · Zhuohan Long · Zhihao Fan · Xuanjing Huang · zhongyu wei |
|
Poster
|
Wed 11:00 |
SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents Niels Mündler · Mark Müller · Jingxuan He · Martin Vechev |
|
Workshop
|
FEABench: Evaluating Language Models on Real World Physics Reasoning Ability Nayantara Mudur · Hao Cui · Subhashini Venugopalan · Paul Raccuglia · Michael Brenner · Peter Norgaard |
||
Workshop
|
RefactorBench: Evaluating Stateful Reasoning In Language Agents Through Code Dhruv Gautam · Spandan Garg · Jinu Jang · Neel Sundaresan · Roshanak Zilouchian Moghaddam |