firstbacksecondback
378 Results
Poster
|
Wed 16:30 |
UnlearnCanvas: Stylized Image Dataset for Enhanced Machine Unlearning Evaluation in Diffusion Models Yihua Zhang · Chongyu Fan · Yimeng Zhang · Yuguang Yao · Jinghan Jia · Jiancheng Liu · Gaoyuan Zhang · Gaowen Liu · Ramana Kompella · Xiaoming Liu · Sijia Liu |
|
Poster
|
Wed 16:30 |
AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents Ma Chang · Junlei Zhang · Zhihao Zhu · Cheng Yang · Yujiu Yang · Yaohui Jin · Zhenzhong Lan · Lingpeng Kong · Junxian He |
|
Poster
|
Wed 11:00 |
DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA Aman Patel · Arpita Singhal · Austin Wang · Anusri Pampari · Maya Kasowski · Anshul Kundaje |
|
Poster
|
Fri 16:30 |
Paloma: A Benchmark for Evaluating Language Model Fit Ian Magnusson · Akshita Bhagia · Valentin Hofmann · Luca Soldaini · Ananya Harsh Jha · Oyvind Tafjord · Dustin Schwenk · Evan Walsh · Yanai Elazar · Kyle Lo · Dirk Groeneveld · Iz Beltagy · Hanna Hajishirzi · Noah Smith · Kyle Richardson · Jesse Dodge |
|
Poster
|
Thu 11:00 |
Bias and Volatility: A Statistical Framework for Evaluating Large Language Model's Stereotypes and the Associated Generation Inconsistency Yiran Liu · Ke Yang · Zehan Qi · Xiao Liu · Yang Yu · Cheng Xiang Zhai |
|
Poster
|
Thu 11:00 |
WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia Yufang Hou · Alessandra Pascale · Javier Carnerero-Cano · Tigran Tchrakian · Radu Marinescu · Elizabeth Daly · Inkit Padhi · Prasanna Sattigeri |
|
Affinity Event
|
Ontology Extraction and Evaluation for the Blue Amazon Vivian Magri Alcaldi Soares · Renata Wassermann |
||
Poster
|
Wed 11:00 |
GameTraversalBenchmark: Evaluating Planning Abilities Of Large Language Models Through Traversing 2D Game Maps Muhammad Umair Nasir · Steven James · Julian Togelius |
|
Poster
|
Fri 11:00 |
InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models Linyi Li · Shijie Geng · Zhenwen Li · Yibo He · Hao Yu · Ziyue Hua · Guanghan Ning · Siwei Wang · Tao Xie · Hongxia Yang |
|
Poster
|
Fri 11:00 |
Evaluating language models as risk scores André F. Cruz · Moritz Hardt · Celestine Mendler-Dünner |
|
Poster
|
Wed 16:30 |
Diff-eRank: A Novel Rank-Based Metric for Evaluating Large Language Models Lai Wei · Zhiquan Tan · Chenghai Li · Jindong Wang · Weiran Huang |
|
Affinity Event
|
Evaluating Generative AI for Scenario Variation in Automated Driving Validation Manasa Mariam Mammen · Zafer Kayatas · Eva Zimmermann · Pavel Nedvědický |