firstbacksecondback
38 Results
Workshop
|
Auto-Enhance: Towards a Meta-Benchmark to Evaluate AI Agents' Ability to Improve Other Agents Samuel Brown · Basil Labib · Codruta Lugoj · Sai Sasank Y |
||
Poster
|
SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types Yutao Mou · Shikun Zhang · Wei Ye |
||
Workshop
|
Sat 12:00 |
Towards Optimal Statistical Watermarking Baihe Huang · Hanlin Zhu · Banghua Zhu · Kannan Ramchandran · Michael Jordan · Jason Lee · Jiantao Jiao |
|
Poster
|
CLAVE: An Adaptive Framework for Evaluating Values of LLM Generated Responses Jing Yao · Xiaoyuan Yi · Xing Xie |
||
Workshop
|
Sat 12:00 |
A Statistical Approach to Quantifying LLM Human Alignment Harbin Hong · Liu Leqi · Sebastian Caldas |
|
Poster
|
Thu 11:00 |
ClashEval: Quantifying the tug-of-war between an LLM’s internal prior and external evidence Kevin Wu · Eric Wu · James Zou |
|
Workshop
|
Towards Optimizing SQL Generation via LLM Routing Mohammadhossein Malekpour · Nour Shaheen · Foutse Khomh · Amine Mhedhbi |
||
Workshop
|
Sat 12:00 |
Distribution-based sensitivity analysis for large language models Paulius Rauba · Qiyao Wei · Mihaela van der Schaar |
|
Poster
|
Wed 11:00 |
DetectRL: Benchmarking LLM-Generated Text Detection in Real-World Scenarios Junchao Wu · Runzhe Zhan · Derek Wong · Shu Yang · Xinyi Yang · Yulin Yuan · Lidia Chao |
|
Workshop
|
Sat 12:00 |
Skilling laws: scaling laws for LLM benchmark performance Felipe Maia Polo · Seamus Somerstep · Leshem Choshen · Yuekai Sun · Mikhail Yurochkin |
|
Workshop
|
Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates Xiaosen Zheng · Tianyu Pang · Chao Du · Qian Liu · Jing Jiang · Min Lin |
||
Workshop
|
MMLU-Pro+: Evaluating Higher-Order Reasoning and Shortcut Learning in LLMs Saeid Asgari · Aliasghar Khani · Amir Khasahmadi |