Skip to yearly menu bar Skip to main content


Search All 2024 Events
 

34 Results

<<   <   Page 1 of 3   >   >>
Poster
Wed 16:30 Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making
Manling Li · Shiyu Zhao · Qineng Wang · Kangrui Wang · Yu Zhou · Sanjana Srivastava · Cem Gokmen · Tony Lee · Erran Li Li · Ruohan Zhang · Weiyu Liu · Percy Liang · Fei-Fei Li · Jiayuan Mao · Jiajun Wu
Workshop
FEABench: Evaluating Language Models on Real World Physics Reasoning Ability
Nayantara Mudur · Hao Cui · Subhashini Venugopalan · Paul Raccuglia · Michael Brenner · Peter Norgaard
Poster
Thu 16:30 WhodunitBench: Evaluating Large Multimodal Agents via Murder Mystery Games
Junlin Xie · Ruifei Zhang · Zhihong Chen · Xiang Wan · Guanbin Li
Workshop
InvestESG: A Multi-agent Reinforcement Learning Benchmark for Studying Climate Investment as a Social Dilemma
Xiaoxuan Hou · Jiayi Yuan · Natasha Jaques
Oral
Wed 15:50 Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making
Manling Li · Shiyu Zhao · Qineng Wang · Kangrui Wang · Yu Zhou · Sanjana Srivastava · Cem Gokmen · Tony Lee · Erran Li Li · Ruohan Zhang · Weiyu Liu · Percy Liang · Fei-Fei Li · Jiayuan Mao · Jiajun Wu
Poster
Wed 11:00 ZSC-Eval: An Evaluation Toolkit and Benchmark for Multi-agent Zero-shot Coordination
Xihuai Wang · Shao Zhang · Wenhao Zhang · Wentao Dong · Jingxiao Chen · Ying Wen · Weinan Zhang
Workshop
An Efficient Open World Benchmark for Multi-Agent Reinforcement Learning
Eric Ye · Natasha Jaques
Poster
Thu 11:00 OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Tianbao Xie · Danyang Zhang · Jixuan Chen · Xiaochuan Li · Siheng Zhao · Ruisheng Cao · Jing Hua Toh · Zhoujun Cheng · Dongchan Shin · Fangyu Lei · Yitao Liu · Yiheng Xu · Shuyan Zhou · Silvio Savarese · Caiming Xiong · Victor Zhong · Tao Yu
Workshop
Sat 15:45 Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation
Siyuan Wang · Zhuohan Long · Zhihao Fan · Xuanjing Huang · zhongyu wei
Poster
Wed 11:00 SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents
Niels Mündler · Mark Müller · Jingxuan He · Martin Vechev
Workshop
FEABench: Evaluating Language Models on Real World Physics Reasoning Ability
Nayantara Mudur · Hao Cui · Subhashini Venugopalan · Paul Raccuglia · Michael Brenner · Peter Norgaard
Workshop
RefactorBench: Evaluating Stateful Reasoning In Language Agents Through Code
Dhruv Gautam · Spandan Garg · Jinu Jang · Neel Sundaresan · Roshanak Zilouchian Moghaddam