firstbacksecondback
38 Results
Workshop
|
Benchmark to Audit LLM Generated Clinical Notes for Disparities Arising from Biases and Stereotypes Hongyu Cai · Swetasudha Panda · Naveen Jafer Nizar · Qinlan Shen · Daeja Oxendine · Sumana Srivatsa · Krishnaram Kenthapadi |
||
Workshop
|
Sat 17:27 |
Benchmark to Audit LLM Generated Clinical Notes for Disparities Arising from Biases and Stereotypes Hongyu Cai · Swetasudha Panda · Naveen Jafer Nizar · Qinlan Shen · Daeja Oxendine · Sumana Srivatsa · Krishnaram Kenthapadi |
|
Poster
|
Wed 16:30 |
AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents Ma Chang · Junlei Zhang · Zhihao Zhu · Cheng Yang · Yujiu Yang · Yaohui Jin · Zhenzhong Lan · Lingpeng Kong · Junxian He |
|
Workshop
|
GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents Anthony Costarelli · Mat Allen · Roman Hauksson · Grace Sodunke · Suhas Hariharan · Carlson Cheng · Wenjie Li · Joshua Clymer · Arjun Yadav |
||
Workshop
|
Sat 15:45 |
Statistical Uncertainty Quantification for Aggregate Performance Metrics in Machine Learning Benchmarks Rachel Longjohn · Giri Gopalan · Emily Casleton |
|
Poster
|
Thu 16:30 |
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents Edoardo Debenedetti · Jie Zhang · Mislav Balunovic · Luca Beurer-Kellner · Marc Fischer · Florian Tramer |
|
Workshop
|
Statistical Bias in Bias Benchmark Design Hannah Powers · Ioana Baldini · Dennis Wei · Kristin P Bennett |
||
Workshop
|
MaCBench: A multimodal chemistry and materials science benchmark Nawaf Alampara · Indrajeet Mandal · Pranav Khetarpal · Hargun Grover · Mara Schilling-Wilhelmi · N M Anoop Krishnan · Kevin Maik Jablonka |
||
Poster
|
Thu 16:30 |
Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization Mucong Ding · Chenghao Deng · Jocelyn Choo · Zichu Wu · Aakriti Agrawal · Avi Schwarzschild · Tianyi Zhou · Tom Goldstein · John Langford · Animashree Anandkumar · Furong Huang |
|
Poster
|
Large Language Models' Expert-level Global History Knowledge Benchmark (HiST-LLM) Jakob Hauser · Dániel Kondor · Jenny Reddish · Majid Benam · Enrico Cioni · Federica Villa · James Bennett · Daniel Hoyer · Pieter Francois · Peter Turchin · R. Maria del Rio-Chanona |
||
Workshop
|
Sat 12:00 |
CPP-UT-Bench: Can LLMs Write Complex Unit Tests in C++? Vaishnavi Bhargava · Rajat Ghosh · Debojyoti Dutta |
|
Poster
|
Thu 16:30 |
EHRNoteQA: An LLM Benchmark for Real-World Clinical Practice Using Discharge Summaries Sunjun Kweon · Jiyoun Kim · Heeyoung Kwak · Dongchul Cha · Hangyul Yoon · Kwang Kim · Jeewon Yang · Seunghyun Won · Edward Choi |