firstbacksecondback
378 Results
Workshop
|
VideoPhy: Evaluating Physical Commonsense for Video Generation Hritik Bansal · Zongyu Lin · Tianyi Xie · Zeshun Zong · Michal Yarom · Yonatan Bitton · Chenfanfu Jiang · Yizhou Sun · Kai-Wei Chang · Aditya Grover |
||
Workshop
|
Report Cards: Qualitative Evaluation of LLMs Using Natural Language Summaries Blair Yang · Fuyang Cui · Keiran Paster · Jimmy Ba · Pashootan Vaezipoor · Silviu Pitis · Michael Zhang |
||
Workshop
|
Critical human-AI use scenarios and interaction modes for societal impact evaluations Lujain Ibrahim · Saffron Huang · Lama Ahmad · Markus Anderljung |
||
Workshop
|
Sun 12:00 |
Legendre-SNN on Loihi-2: Evaluation and Insights Ramashish Gaurav · Terrence Stewart · Yang Yi |
|
Workshop
|
A Cautionary Tale on the Evaluation of Differentially Private In-Context Learning Anjun Hu · Jiyang Guan · Philip Torr · Francesco Pinto |
||
Workshop
|
I Think, Therefore I am: Benchmarking Awareness of Large Language Models Using AwareBench Yuan Li · Yue Huang · Yuli Lin · Siyuan Wu · Yao Wan · Lichao Sun |
||
Workshop
|
Analyzing Probabilistic Methods for Evaluating Agent Capabilities Axel Højmark · Govind Pimpale · Arjun Panickssery · Marius Hobbhahn · Jérémy Scheurer |
||
Workshop
|
Measuring AI Agent Autonomy: Towards a Scalable Approach With Code Inspection Merlin Stein · Peter Cihon · Gagan Bansal · Sam Manning |
||
Workshop
|
Troubling taxonomies in GenAI evaluation Glen Berman · Ned Cooper · Wesley Deng · Ben Hutchinson |
||
Workshop
|
Towards Deliberating Agents: Evaluating the Ability of Large Language Models to Deliberate Arjun Karanam · Farnaz Jahanbakhsh · Sanmi Koyejo |
||
Workshop
|
Towards Deliberating Agents: Evaluating the Ability of Large Language Models to Deliberate Arjun Karanam · Farnaz Jahanbakhsh · Sanmi Koyejo |
||
Workshop
|
Safe and Sound: Evaluating Language Models for Bias Mitigation and Understanding Shaina Raza · Deval Pandya · Shardul ghuge · Nifemi |