NeurIPS 2025 San Diego Datasets & Benchmarks

Skip to yearly menu bar Skip to main content

Poster

TreeFinder: A US-Scale Benchmark Dataset for Individual Tree Mortality Monitoring Using High-Resolution Aerial Imagery

Zhihao Wang ⋅ Cooper Li ⋅ Ruichen Wang ⋅ Lei Ma ⋅ George Hurtt ⋅ Xiaowei Jia ⋅ Gengchen Mai ⋅ Zhili Li ⋅ Yiqun Xie

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Monitoring individual tree mortality at scale has been found to be crucial for understanding forest loss, ecosystem resilience, carbon fluxes, and climate-induced impacts. However, the fine-granularity monitoring faces major challenges on both the data and methodology sides because: (1) finding isolated individual-level tree deaths requires high-resolution remote sensing images with broad coverage, and (2) compared to regular geo-objects (e.g., buildings), dead trees often exhibit weaker contrast and high variability across tree types, landscapes and ecosystems. Existing datasets on tree mortality primarily rely on moderate-resolution satellite imagery (e.g., 30m resolution), which aims to detect large-patch wipe-outs but is unable to recognize individual-level tree mortality events. Several efforts have explored alternatives via very-high-resolution drone imagery. However, drone images are highly expensive and can only be collected at local scales, which are therefore not suitable for national-scale applications and beyond. To bridge the gaps,we introduce TreeFinder, the first high-resolution remote sensing benchmark dataset designed for individual-level tree mortality mapping across the Contiguous United States (CONUS). Specifically, the dataset uses NAIP imagery at 0.6m resolution that provides wall-to-wall coverage of the entire CONUS. TreeFinder contains images with pixel-level labels generated via extensive manual annotation that covers forested areas in 48 states with over 23,000 hectares. All annotations are rigorously validated using multi-temporal NAIP images and auxiliary vegetation indices from remote sensing imagery. Moreover, TreeFinder includes multiple evaluation scenarios to test the models' ability in generalizing across different geographic regions, climate zones, and forests with different plant function types. Finally, we develop benchmarks using a suite of semantic segmentation models, including both convolutional architectures and more recent foundation models based on vision transformers for general and remote sensing images. Our dataset and code are publicly available on Kaggle and GitHub: https://www.kaggle.com/datasets/zhihaow/tree-finder and https://github.com/zhwang0/treefinder.

View full details

Poster

Benchmarking Large Language Models with Integer Sequence Generation Tasks

Daniel O'Malley ⋅ Manish Bhattarai ⋅ Nishath Ranasinghe ⋅ Erick Draayer ⋅ Javier E. Santos

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We present a novel benchmark designed to rigorously evaluate the capabilities of large language models (LLMs) in mathematical reasoning and algorithmic code synthesis tasks. The benchmark comprises integer sequence generation tasks sourced from the Online Encyclopedia of Integer Sequences (OEIS), testing LLMs' abilities to accurately and efficiently generate Python code to compute these sequences without using lookup tables. Our comprehensive evaluation includes leading models from OpenAI (including the specialized reasoning-focused o-series), Anthropic, Meta, and Google across a carefully selected set of 1000 OEIS sequences categorized as ``easy'' or ``hard.'' Half of these sequences are classical sequences from the early days of OEIS and half were recently added to avoid contamination with the models' training data. To prevent models from exploiting memorized sequence values, we introduce an automated cheating detection mechanism that flags usage of lookup tables, validated by comparison with human expert evaluations. Experimental results demonstrate that reasoning-specialized models (o3, o3-mini, o4-mini from OpenAI, and Gemini 2.5-pro from Google) achieve substantial improvements in accuracy over non-reasoning models, especially on more complex tasks. However, overall model performance on the hard sequences is poor, highlighting persistent challenges in algorithmic reasoning. Our benchmark provides important insights into the strengths and limitations of state-of-the-art LLMs, particularly emphasizing the necessity for further advancements to reliably solve complex mathematical reasoning tasks algorithmically.

View full details

Poster

Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective

Jorge (Zhoujun) Cheng ⋅ Shibo Hao ⋅ Tianyang Liu ⋅ Fan Zhou ⋅ Yutao Xie ⋅ Feng Yao ⋅ Yuexin Bian ⋅ Nilabjo Dey ⋅ Yonghao Zhuang ⋅ Yuheng Zha ⋅ Yi Gu ⋅ Kun Zhou ⋅ Yuqi Wang ⋅ Yuan Li ⋅ Richard Fan ⋅ Jianshu She ⋅ Chengqian Gao ⋅ Abulhair Saparov ⋅ Taylor W. Killian ⋅ Haonan Li ⋅ Mikhail Yurochkin ⋅ Eric Xing ⋅ Zhengzhong Liu ⋅ Zhiting Hu

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Reinforcement learning (RL) has shown promise in enhancing large language model (LLM) reasoning, yet progress towards broader capabilities is limited by the availability of high-quality, multi-domain datasets. This work introduces \ours, a 92K RL-for-reasoning dataset designed to address this gap, covering six reasoning domains: Math, Code, Science, Logic, Simulation, and Tabular, each with corresponding verifiers. We build \ours via a careful data-curation pipeline, including sourcing, deduplication, reward design, and domain-specific and difficulty-based filtering, to facilitate the systematic investigation of cross-domain RL generalization. Our study using \ours suggests the efficacy of a simple mixed-domain RL training approach and reveals several key aspects affecting cross-domain transferability. We further train two models {\ours}-7B and {\ours}-32B purely with RL on our curated data and observe largely improved performance over leading open RL reasoning model baselines, with gains of 7.3\% and 7.8\% respectively on an extensive 17-task, six-domain evaluation suite. We are releasing our dataset, code, and evaluation suite to the community, aiming to support further research and development of more general RL-enhanced reasoning models.

View full details

Poster

Collaborating Vision, Depth, and Thermal Signals for Multi-Modal Tracking: Dataset and Algorithm

Xue-Feng Zhu ⋅ Tianyang Xu ⋅ Yifan Pan ⋅ Jinjie Gu ⋅ Xi Li ⋅ Jiwen Lu ⋅ Xiaojun Wu ⋅ Josef Kittler

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Existing multi-modal object tracking approaches primarily focus on dual-modal paradigms, such as RGB-Depth or RGB-Thermal, yet remain challenged in complex scenarios due to limited input modalities. To address this gap, this work introduces a novel multi-modal tracking task that leverages three complementary modalities, including visible RGB, Depth (D), and Thermal Infrared (TIR), aiming to enhance robustness in complex scenarios. To support this task, we construct a new multi-modal tracking dataset, coined RGBDT500, which consists of 500 videos with synchronised frames across the three modalities. Each frame provides spatially aligned RGB, depth, and thermal infrared images with precise object bounding box annotations.Furthermore, we propose a novel multi-modal tracker, dubbed RDTTrack.RDTTrack integrates tri-modal information for robust tracking by leveraging a pretrained RGB-only tracking model and prompt learning techniques.In specific, RDTTrack fuses thermal infrared and depth modalities under a proposed orthogonal projection constraint, then integrates them with RGB signals as prompts for the pre-trained foundation tracking model, effectively harmonising tri-modal complementary cues.The experimental results demonstrate the effectiveness and advantages of the proposed method, showing significant improvements over existing dual-modal approaches in terms of tracking accuracy and robustness in complex scenarios. The dataset and source code are publicly available at https://xuefeng-zhu5.github.io/RGBDT500.

View full details

Poster

Aeolus: A Multi-structural Flight Delay Dataset

Lin Xu ⋅ Xinyun Yuan ⋅ Yuxuan Liang ⋅ Suwan Yin ⋅ Yuankai Wu

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We introduce Aeolus, a large-scale Multi-modal Flight Delay Dataset designed to advance research on flight delay prediction and support the development of foundation models for tabular data. Existing datasets in this domain are typically limited to flat tabular structures and fail to capture the spatiotemporal dynamics inherent in delay propagation. Aeolus addresses this limitation by providing three aligned modalities: (i) a tabular dataset with rich operational, meteorological, and airportlevel features for over 50 million flights; (ii) a flight chain module that models delay propagation along sequential flight legs, capturing upstream and downstream dependencies; and (iii) a flight network graph that encodes shared aircraft, crew, and airport resource connections, enabling cross-flight relational reasoning. The dataset is carefully constructed with temporal splits, comprehensive features, and strict leakage prevention to support realistic and reproducible machine learning evaluation. Aeolus supports a broad range of tasks, including regression, classification, temporal structure modeling, and graph learning, serving as a unified benchmark across tabular, sequential, and graph modalities. We release baseline experiments and preprocessing tools to facilitate adoption. Aeolus fills a key gap for both domain-specific modeling and general-purpose structured data research.Our source code and data can be accessed at https://github.com/Flnny/Delay-data

View full details

Poster

GTPBD: A Fine-Grained Global Terraced Parcel and Boundary Dataset

Zhiwei Zhang ⋅ Zi Ye ⋅ Yibin Wen ⋅ Shuai Yuan ⋅ Haohuan Fu ⋅ Huang Jianxi ⋅ Juepeng Zheng

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Agricultural parcels serve as basic units for conducting agricultural practices and applications, which is vital for land ownership registration, food security assessment, soil erosion monitoring, etc. However, existing agriculture parcel extraction studies only focus on mid-resolution mapping or regular plain farmlands while lacking representation of complex terraced terrains due to the demands of precision agriculture. In this paper, we introduce a more fine-grained terraced parcel dataset named GTPBD (Global Terraced Parcel and Boundary Dataset), which is the first fine-grained dataset covering major worldwide terraced regions with more than 200,000 complex terraced parcels with manually annotation. GTPBD comprises 47,537 high-resolution images with three-level labels, including pixel-level boundary labels, mask labels, and parcel labels. It covers seven major geographic zones in China and transcontinental climatic regions around the world. Compared to the existing datasets, the GTPBD dataset brings considerable challenges due to the: (1) terrain diversity; (2) complex and irregular parcel objects; and (3) multiple domain styles. Our proposed GTPBD dataset is suitable for four different tasks, including semantic segmentation, edge detection, terraced parcel extraction and unsupervised domain adaptation (UDA) tasks. Accordingly, we benchmark the GTPBD dataset on eight semantic segmentation methods, four edge extraction methods, three parcel extraction methods and five UDA methods, along with a multi-dimensional evaluation framework integrating pixel-level and object-level metrics. GTPBD fills a critical gap in terraced remote sensing research, providing a basic infrastructure for fine-grained agricultural terrain analysis and cross-scenario knowledge transfer. The code and data are available at https://github.com/Z-ZW-WXQ/GTPBD/.

View full details

Poster

IRRISIGHT: A Large-Scale Multimodal Dataset and Scalable Pipeline to Address Irrigation and Water Management in Agriculture

Nibir Chandra Mandal ⋅ Oishee Bintey Hoque ⋅ Mandy Wilson ⋅ Samarth Swarup ⋅ Sayjro Nouwakpo ⋅ Abhijin Adiga ⋅ Madhav Marathe

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The lack of fine-grained, large-scale datasets on water availability presents a critical barrier to applying machine learning (ML) for agricultural water management. Since there are multiple natural and anthropogenic factors that influence water availability, incorporating diverse multimodal features can significantly improve modeling performance. However, integrating such heterogeneous data is challenging due to spatial misalignments, inconsistent formats, semantic label ambiguities, and class imbalances. To address these challenges, we introduce IRRISIGHT, a large-scale, multimodal dataset spanning 20 U.S. states. It consists of 1.4 million pixel-aligned 224×224 patches that fuse satellite imagery with rich environmental attributes. We develop a robust geospatial fusion pipeline that aligns raster, vector, and point-based data on a unified 10m grid, and employ domain-informed structured prompts to convert tabular attributes into natural language. With irrigation type classification as a representative problem, the dataset is AI-ready, offering a spatially disjoint train/test split and extensive benchmarking with both vision and vision–language models. Our results demonstrate that multimodal representations substantially improve model performance, establishing a foundation for future research on water availability.https://github.com/Nibir088/IRRISIGHThttps://huggingface.co/datasets/OBH30/IRRISIGHT

View full details

Poster

CPRet: A Dataset, Benchmark, and Model for Retrieval in Competitive Programming

Han Deng ⋅ Yuan Meng ⋅ SHIXIANG TANG ⋅ Wanli Ouyang ⋅ Xinzhu Ma

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Competitive programming is widely used to evaluate the coding and reasoning abilities of large language models. However, the growing presence of duplicate or highly similar problems raises concerns not only about competition fairness, but also about the validity of competitive programming as a benchmark for model evaluation. We introduce a retrieval-oriented benchmark suite for competitive programming, covering four retrieval tasks—two code-centric (Text-to-Code, Code-to-Code) and two newly proposed problem-centric tasks (Problem-to-Duplicate, Simplified-to-Full)—built from a combination of automatically crawled problem–solution data and manually curated annotations. Our contribution includes both high-quality training data and temporally separated test sets for reliable evaluation. We develop two task-specialized retrievers based on this dataset: CPRetriever-Code, trained with a novel Group-InfoNCE loss for problem–code alignment, and CPRetriever-Prob, fine-tuned for problem-level similarity. Both models achieve strong results and are open-sourced for local use. Finally, we analyze LiveCodeBench and find that high-similarity problems inflate model pass rates and reduce differentiation, underscoring the need for similarity-aware evaluation in future benchmarks.

View full details

Poster

MedicalNarratives: Connecting Medical Vision and Language with Localized Narratives

Wisdom Ikezogwo ⋅ Kevin M. Zhang ⋅ Saygin Seyfioglu

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Multi-modal models are data hungry. While datasets with natural images are abundant, medical image datasets can not afford the same luxury. To enable representation learning for medical images at scale, we turn to YouTube, a platform with a large reservoir of open-source medical pedagogical videos. We curate MedicalNarratives, a dataset 4.7M medical image-text pairs, with 1M samples containing dense annotations in the form of traces spatial traces (and bounding boxes), and 118K videos centered on the trace event (with aligned text), enabling spatiotemporal grounding beyond single frames. Similar to think-aloud studies where instructors speak while hovering their mouse cursor movements over relevant image regions, 1M images in MedicalNarratives contains localized mouse traces in image pixels, creating a spatial association between the text and pixels. To evaluate the utility of MedicalNarratives, we train GenMedClip with a CLIP-like objective using our dataset spanning 12 medical domains. GenMedClip outperforms previous state-of-the-art models on all 12 domains on a newly constructed medical imaging benchmark. Data, demo, code, and models will be made available.

View full details

Poster

MIRAGE: A Benchmark for Multimodal Information-Seeking and Reasoning in Agricultural Expert-Guided Conversations

Vardhan Dongre ⋅ Chi Gui ⋅ Shubham Garg ⋅ Hooshang Nayyeri ⋅ Gokhan Tur ⋅ Dilek Hakkani-Tur ⋅ Vikram Adve

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We introduce MIRAGE, a new benchmark for multimodal expert-level reasoning and decision-making in consultative interaction settings. Designed for the domain of agriculture, MIRAGE captures the full complexity of expert consultations by combining natural user queries, expert-authored responses, and image-based context, offering a high-fidelity benchmark for evaluating models on grounded reasoning, clarification strategies, and long-form generation in a real-world, knowledge-intensive domain. Grounded in over 35,000 real user-expert interactions, and curated through a carefully designed multi-step pipeline, MIRAGE spans diverse crop health, pest diagnosis, and crop management scenarios. The benchmark includes more than 7,000 unique biological entities, covering plant species, pests, and diseases, making it one of the most taxonomically diverse benchmarks available for vision-language models in real-world expert-guided domains. Unlike existing benchmarks that rely on well-specified user inputs, MIRAGE features underspecified, context-rich scenarios, requiring models to infer latent knowledge gaps and either proactively guide the interaction or respond. Our benchmark comprises two core components. The Single-turn Challenge to reason over a single user turn and image set, identify relevant entities, infer causal explanations, and generate actionable recommendations; and a Multi-Turn challenge for dialogue state tracking, goal-driven generation, and expert-level conversational decision-making. We evaluate more than 20 closed and open-source frontier vision-language models (VLMs), using three reasoning language models as evaluators, highlighting the significant challenges posed by MIRAGE in both single-turn and multi-turn interaction settings. Even the advanced GPT4.1 and GPT4o models achieve 44.6% and 40.9% accuracy, respectively, indicating significant room for improvement.

View full details

Poster

ORBIT - Open Recommendation Benchmark for Reproducible Research with Hidden Tests

Jingyuan He ⋅ Jiongnan Liu ⋅ Vishan Oberoi ⋅ Bolin Wu ⋅ Mahima Jagadeesh Patel ⋅ Kangrui Mao ⋅ Chuning Shi ⋅ I-Ta Lee ⋅ Arnold Overwijk ⋅ Chenyan Xiong

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Recommender systems are among the most impactful AI applications, interacting with billions of users every day, guiding them to relevant products, services, or information tailored to their preferences.However, the research and development of recommender systems are hindered by existing datasets that fail to capture realistic user behaviors and inconsistent evaluation settings that lead to ambiguous conclusions.This paper introduces the \textbf{O}pen \textbf{R}ecommendation \textbf{B}enchmark for Reproducible Research with H\textbf{I}dden \textbf{T}ests (\textbf{ORBIT}), a unified benchmark for consistent and realistic evaluation of recommendation models. ORBIT offers a standardized evaluation framework of public datasets with reproducible splits and transparent settings for its public leaderboard. Additionally, ORBIT introduces a new webpage recommendation task, ClueWeb-Reco, featuring web browsing sequences from 87 million public, high-quality webpages. ClueWeb-Reco is a synthetic dataset derived from real, user-consented, and privacy-guaranteed browsing data. It aligns with modern recommendation scenarios and is reserved as the hidden test part of our leaderboard to challenge recommendation models' generalization ability. ORBIT measures 12 representative recommendation models on its public benchmark and introduces a prompted LLM baseline on the ClueWeb-Reco hidden test.Our benchmark results reflect general improvements of recommender systems on the public datasets, with variable individual performances.The results on the hidden test reveal the limitations of existing approaches in large-scale webpage recommendation and highlight the potential for improvements with LLM integrations.ORBIT benchmark, leaderboard, and codebase are available at \url{https://www.open-reco-bench.ai}.

View full details

Poster

Meta-World+: An Improved, Standardized, RL Benchmark

Reginald McLean ⋅ Evangelos Chatzaroulas ⋅ Luc McCutcheon ⋅ Frank Röder ⋅ Tianhe Yu ⋅ Zhanpeng He ⋅ K.R. Zentner ⋅ Ryan Julian ⋅ J Terry ⋅ Isaac Woungang ⋅ Nariman Farsad ⋅ Pablo Samuel Castro

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Meta-World is widely used for evaluating multi-task and meta-reinforcement learning agents, which are challenged to master diverse skills simultaneously. Since its introduction however, there have been numerous undocumented changes which inhibit a fair comparison of algorithms. This work strives to disambiguate these results from the literature, while also leveraging the past versions of Meta-World to provide insights into multi-task and meta-reinforcement learning benchmark design. Through this process we release an open-source version of Meta-World that has full reproducibility of past results, is more technically ergonomic, and gives users more control over the tasks that are included in a task set.

View full details

Poster

MyoChallenge 2024: A New Benchmark for Physiological Dexterity and Agility in Bionic Humans

Huiyi Wang ⋅ Chun Kwang Tan ⋅ Balint Hodossy ⋅ Shirui Lyu ⋅ Pierre Schumacher ⋅ James Heald ⋅ Kai Biegun ⋅ Samo Hromadka ⋅ Maneesh Sahani ⋅ Gunwoo Park ⋅ Beomsoo Shin ⋅ JongHyeon Park ⋅ Seungbum Koo ⋅ Chenhui Zuo ⋅ Chengtian Ma ⋅ Yanan Sui ⋅ Nick Hansen ⋅ Stone Tao ⋅ Yuan Gao ⋅ Hao Su ⋅ Seungmoon Song ⋅ Letizia Gionfrida ⋅ Massimo Sartori ⋅ Guillaume Durandau ⋅ Vikash Kumar ⋅ Vittorio Caggiano

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Recent advancements in bionic prosthetic technology offer transformative opportunities to restore mobility and functionality for individuals with missing limbs. Users of bionic limbs, or bionic humans, learn to seamlessly integrate prosthetic extensions into their motor repertoire, regaining critical motor abilities. The remarkable movement generalization and environmental adaptability demonstrated by these individuals highlight motor intelligence capabilities unmatched by current artificial intelligence systems. Addressing these limitations, MyoChallenge '24 at NeurIPS 2024 established a benchmark for human-robot coordination with an emphasis on joint control of both biological and mechanical limbs. The competition featured two distinct tracks: a manipulation task utilizing the myoMPL model, integrating a virtual biological arm and the Modular Prosthetic Limb (MPL) for a passover task; and a locomotion task using the novel myoOSL model, combining a bilateral virtual biological leg with a trans-femoral amputation and the Open Source Leg (OSL) to navigate varied terrains. Marking the third iteration of the MyoChallenge, the event attracted over 50 teams with more than 290 submissions all around the globe, with diverse participants ranging from independent researchers to high school students. The competition facilitated the development of several state-of-the-art control algorithms for bionic musculoskeletal systems, leveraging techniques such as imitation learning, muscle synergy, and model-based reinforcement learning that significantly surpassed our proposed baseline performance by a factor of 10. By providing the open-source simulation framework of MyoSuite, standardized tasks, and physiologically realistic models, MyoChallenge serves as a reproducible testbed and benchmark for bridging ML and biomechanics. The competition website is featured here: https://sites.google.com/view/myosuite/myochallenge/myochallenge-2024.

View full details

Poster

STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models

Narun Raman ⋅ Taylor Lundy ⋅ Thiago Amin ⋅ Kevin Leyton-Brown ⋅ Jesse Perla

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Large language models (LLMs) are increasingly being asked to make economically rational decisions and indeed are already being applied to economic tasks like stock picking and financial analysis. Existing LLM benchmarks tend to focus on specific applications, making them insufficient for characterizing economic reasoning more broadly. In previous work, we offered a blueprint for comprehensively benchmarking $\textit{strategic}$ decision-making Raman et al. 2024. However, this work did not engage with the even larger microeconomic literature on $\textit{non-strategic}$ settings. We address this gap here, taxonomizing microeconomic reasoning into $58$ distinct elements, each grounded in up to $10$ distinct domains, $5$ perspectives, and $3$ types. The generation of benchmark data across this combinatorial space is powered by a novel LLM-assisted data generation protocol that we dub auto-STEER, which generates a set of questions by adapting handwritten templates to target new domains and perspectives. By generating fresh questions for each element, auto-STEER induces diversity which could help to reduce the risk of data contamination. We use this benchmark to evaluate $27$ LLMs spanning a range of scales and adaptation strategies, comparing performance across multiple formats—multiple-choice and free-text question answering—and scoring schemes. Our results surface systematic limitations in current LLMs' ability to generalize economic reasoning across types, formats, and textual perturbations, and establish a foundation for evaluating and improving economic competence in foundation models.

View full details

Poster

MedMax: Mixed-Modal Instruction Tuning for Training Biomedical Assistants

Hritik Bansal ⋅ Daniel Israel ⋅ Siyan Zhao ⋅ Shufan Li ⋅ Tung Nguyen ⋅ Aditya Grover

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Recent advancements in mixed-modal generative have opened new avenues for developing unified biomedical assistants capable of analyzing biomedical images, answering complex questions about them, and generating multimodal patient reports. However, existing datasets face challenges such as small sizes, limited coverage of biomedical tasks and domains, and a reliance on narrow sources. To address these gaps, we present MedMax, a large-scale multimodal biomedical instruction-tuning dataset for mixed-modal foundation models. With 1.47 million instances, MedMax encompasses a diverse range of tasks, including interleaved image-text generation, biomedical image captioning and generation, visual chat, and report understanding. These tasks span knowledge across diverse biomedical domains, including radiology and histopathology, grounded in medical papers and YouTube videos. Subsequently, we fine-tune a mixed-modal foundation model on the MedMax dataset, achieving significant performance improvements: a 26% gain over the Chameleon model and an 18.3% improvement over GPT-4o across 12 downstream biomedical visual question-answering tasks. Finally, we introduce a unified evaluation suite for biomedical tasks to guide the development of mixed-modal biomedical AI assistants. We release the code, data, and model at https://mint-medmax.github.io/.

View full details

Poster

TimE: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios

Shaohang Wei ⋅ Wei Li ⋅ Feifan Song ⋅ Wen Luo ⋅ Tianyi Zhuang ⋅ Haochen Tan ⋅ Zhijiang Guo ⋅ Houfeng Wang

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Temporal reasoning is pivotal for Large Language Models (LLMs) to comprehend the real world. However, existing works neglect the real-world challenges for temporal reasoning: (1) intensive temporal information, (2) fast-changing event dynamics, and (3) complex temporal dependencies in social interactions. To bridge this gap, we propose a multi-level benchmark TimE, designed for temporal reasoning in real-world scenarios. TimE consists of 38,522 QA pairs, covering 3 levels with 11 fine-grained sub-tasks. This benchmark encompasses 3 sub-datasets reflecting different real-world challenges: TimE-Wiki, TimE-News, and TimE-Dial. We conduct extensive experiments on reasoning models and non-reasoning models. And we conducted an in-depth analysis of temporal reasoning performance across diverse real-world scenarios and tasks, and summarized the impact of test-time scaling on temporal reasoning capabilities. Additionally, we release TimE-Lite, a human-annotated subset to foster future research and standardized evaluation in temporal reasoning.

View full details

Poster

Absence Bench: Language Models Can’t See What’s Missing

Harvey Yiyun Fu ⋅ Aryan Shrivastava ⋅ Jared Moore ⋅ Peter West ⋅ Chenhao Tan ⋅ Ari Holtzman

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Large language models (LLMs) are increasingly capable of processing long inputs and locating specific information within them, as evidenced by their performance on the Needle in a Haystack (NIAH) test. However, while models excel at recalling surprising information, they still struggle to identify clearly omitted information. We introduce AbsenceBench to assesses LLMs' capacity to detect missing information across three domains: numerical sequences, poetry, and GitHub pull requests. AbsenceBench asks models to identify which pieces of a document were deliberately removed, given access to both the original and edited contexts. Despite the apparent straightforwardness of these tasks, our experiments reveal that even state-of-the-art models like Claude-3.7-Sonnet achieve only 69.6% F1-score with a modest average context length of 5K tokens. Our analysis suggests this poor performance stems from a fundamental limitation: Transformer attention mechanisms cannot easily attend to "gaps" in documents since these absences don't correspond to any specific keys that can be attended to. Overall, our results and analysis provide a case study of the close proximity of tasks where models are already superhuman (NIAH) and tasks where models breakdown unexpectedly (AbsenceBench).

View full details

Poster

DrivAerStar: An Industrial-Grade CFD Dataset for Vehicle Aerodynamic Optimization

Jiyan Qiu ⋅ Lyulin Kuang ⋅ Guan Wang ⋅ Yichen Xu ⋅ Leiyao Cui ⋅ Shaotong Fu ⋅ Yixin Zhu ⋅ Rita Zhang

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Vehicle aerodynamics optimization has become critical for automotive electrification, where drag reduction directly determines electric vehicle range and energy efficiency. Traditional approaches face an intractable trade-off: computationally expensive Computational Fluid Dynamics (CFD) simulations requiring weeks per design iteration, or simplified models that sacrifice production-grade accuracy. While machine learning offers transformative potential, existing datasets exhibit fundamental limitations -- inadequate mesh resolution, missing vehicle components, and validation errors exceeding 5% -- preventing deployment in industrial workflows. We present DrivAerStar, comprising 12,000 industrial-grade automotive CFD simulations generated using STAR-CCM+${}^{\textregistered}$ software. The dataset systematically explores three vehicle configurations through 20 Computer Aided Design (CAD) parameters via Free Form Deformation (FFD) algorithms, including complete engine compartments and cooling systems with realistic internal airflow. DrivAerStar achieves wind tunnel validation accuracy below 1.04% -- a five-fold improvement over existing datasets -- through refined mesh strategies with strict wall $y^+$ control. Benchmarks demonstrate that models trained on this data achieve production-ready accuracy while reducing computational costs from weeks to minutes. This represents the first dataset bridging academic machine learning research and industrial CFD practice, establishing a new standard for data-driven aerodynamic optimization in automotive development. Beyond automotive applications, DrivAerStar demonstrates a paradigm for integrating high-fidelity physics simulations with Artificial Intelligence (AI) across engineering disciplines where computational constraints currently limit innovation.

View full details

Poster

PAC Bench: Do Foundation Models Understand Prerequisites for Executing Manipulation Policies?

Atharva Gundawar ⋅ Som Sagar ⋅ Ransalu Senanayake

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Vision-Language Models (VLMs) are increasingly pivotal for generalist robot manipulation, enabling tasks such as physical reasoning, policy generation, and failure detection. However, their proficiency in these high-level applications often assumes a deep understanding of low-level physical prerequisites, a capability that is largely unverified. To perform actions reliably, robots must comprehend intrinsic object properties (e.g., material, weight), action affordances (e.g., graspable, stackable), and physical constraints (e.g., stability, reachability, or an object's state like being closed). Despite their ubiquitous use in manipulation, we argue that off-the-shelf VLMs may lack this granular, physically-grounded understanding, as these specific prerequisites are often overlooked during training. Addressing this critical gap, we introduce PAC Bench, a comprehensive benchmark designed to systematically evaluate VLMs on their understanding of these core Properties, Affordances, and Constraints (PAC) from a task executability perspective. PAC Bench features a diverse dataset with more than 30,000 annotations, comprising 673 real-world images (115 object classes, 15 property types, 1–3 affordances defined per object class), 100 real-world humanoid view scenarios, and 120 unique simulated constraint scenarios across four tasks. Our evaluations reveal significant gaps in the ability of VLMs to grasp fundamental physical concepts, underscoring their current limitations for reliable robot manipulation and pointing to key areas that require targeted research. PAC Bench also serves as a standardized benchmark for rigorously evaluating the physical reasoning capabilities of VLMs guiding the development of more robust and physically grounded models for robot manipulation.

View full details

Poster

SciArena: An Open Evaluation Platform for Non-Verifiable Scientific Literature-Grounded Tasks

Yilun Zhao ⋅ Kaiyan Zhang ⋅ Tiansheng Hu ⋅ Sihong Wu ⋅ Ronan Le Bras ⋅ Yixin Liu ⋅ Robert Tang ⋅ Joseph Chee Chang ⋅ Jesse Dodge ⋅ Jonathan Bragg ⋅ Chen Zhao ⋅ Hanna Hajishirzi ⋅ Doug Downey ⋅ Arman Cohan

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We present SciArena, an open and collaborative platform for evaluating foundation models on scientific literature-grounded tasks. Unlike traditional benchmarks for scientific literature understanding and synthesis, SciArena engages the research community directly, following the Chatbot Arena evaluation approach of community voting on model comparisons.By leveraging collective intelligence, SciArena offers a community-driven evaluation of model performance on open-ended scientific tasks that demand literature-grounded, long-form responses.The platform currently supports 44 open-source and proprietary foundation models and has collected over 19,000 votes from human researchers across diverse scientific domains. Our analysis of the data collected so far confirms its high quality.We discuss the results and insights based on the model ranking leaderboard.To further promote research in building model-based automated evaluation systems for literature tasks, we release SciArena-Eval, a meta-evaluation benchmark based on our collected preference data. The benchmark measures the accuracy of models in judging answer quality by comparing their pairwise assessments with human votes. Our experiments highlight the benchmark’s challenges and emphasize the need for more reliable automated evaluation methods.

View full details

Poster

DrVD-Bench: Do Vision-Language Models Reason Like Human Doctors in Medical Image Diagnosis?

Tianhong Zhou ⋅ xu yin ⋅ Yingtao Zhu ⋅ Chuxi Xiao ⋅ Haiyang Bian ⋅ Lei Wei ⋅ Xuegong Zhang

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Vision–language models (VLMs) exhibit strong zero-shot generalization on natural images and show early promise in interpretable medical image analysis. However, existing benchmarks do not systematically evaluate whether these models truly reason like human clinicians or merely imitate superficial patterns. To address this gap, we propose DrVD-Bench, the first multimodal benchmark for clinical visual reasoning. DrVD-Bench consists of three modules: Visual Evidence Comprehension, Reasoning Trajectory Assessment, and Report Generation Evaluation, comprising a total of 7,789 image–question pairs. Our benchmark covers 20 task types, 17 diagnostic categories, and five imaging modalities—CT, MRI, ultrasound, radiography, and pathology. DrVD-Bench is explicitly structured to reflect the clinical reasoning workflow from modality recognition to lesion identification and diagnosis. We benchmark 19 VLMs, including general-purpose and medical-specific, open-source and proprietary models, and observe that performance drops sharply as reasoning complexity increases. While some models begin to exhibit traces of human-like reasoning, they often still rely on shortcut correlations rather than grounded visual understanding. DrVD-Bench offers a rigorous and structured evaluation framework to guide the development of clinically trustworthy VLMs.

View full details

Poster

STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving

Christian Fruhwirth-Reisinger ⋅ Dušan Malić ⋅ Wei Lin ⋅ David Schinagl ⋅ Samuel Schulter ⋅ Horst Possegger

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We introduce STSBench, a scenario-based framework to benchmark the holistic understanding of vision-language models (VLMs) for autonomous driving. The framework automatically mines predefined traffic scenarios from any dataset using ground-truth annotations, provides an intuitive user interface for efficient human verification, and generates multiple-choice questions for model evaluation. Applied to the nuScenes dataset, we present STSnu, the first benchmark that evaluates the spatio-temporal reasoning capabilities of VLMs based on comprehensive 3D perception. Existing benchmarks typically target off-the-shelf or fine-tuned VLMs for images or videos from a single viewpoint, focusing on semantic tasks such as object recognition, dense captioning, risk assessment, or scene understanding. In contrast, STSnu evaluates driving expert VLMs for end-to-end driving, operating on videos from multi-view cameras or LiDAR. It specifically assesses their ability to reason about both ego-vehicle actions and complex interactions among traffic participants, a crucial capability for autonomous vehicles. The benchmark features 43 diverse scenarios spanning multiple views and frames, resulting in 971 human-verified multiple-choice questions. A thorough evaluation uncovers critical shortcomings in existing models’ ability to reason about fundamental traffic dynamics in complex environments. These findings highlight the urgent need for architectural advancements that explicitly model spatio-temporal reasoning. By addressing a core gap in spatio-temporal evaluation, STSBench enables the development of more robust and explainable VLMs for autonomous driving.

View full details

Poster

KnowMol: Advancing Molecular Large Language Models with Multi-Level Chemical Knowledge

Zaifei Yang ⋅ Hong Chang ⋅ RuiBing Hou ⋅ Shiguang Shan ⋅ Xilin Chen

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The molecular large language models have garnered widespread attention due to their promising potential on molecular applications. However, current molecular large language models face significant limitations in understanding molecules due to inadequate textual descriptions and suboptimal molecular representation strategies during pretraining. To address these challenges, we introduce KnowMol-100K, a large-scale dataset with 100K fine-grained molecular annotations across multiple levels, bridging the gap between molecules and textual descriptions. Additionally, we propose chemically-informative molecular representation, effectively addressing limitations in existing molecular representation strategies. Building upon these innovations, we develop KnowMol, a state-of-the-art multi-modal molecular large language model. Extensive experiments demonstrate that KnowMol achieves superior performance across molecular understanding and generation tasks.

View full details

Poster

RADAR: Benchmarking Language Models on Imperfect Tabular Data

Ken Gu ⋅ Zhihan Zhang ⋅ Kate Lin ⋅ Yuwei Zhang ⋅ Akshay Paruchuri ⋅ Hong Yu ⋅ Mehran Kazemi ⋅ Kumar Ayush ⋅ A. Ali Heydari ⋅ Max Xu ⋅ Yun Liu ⋅ Ming-Zher Poh ⋅ Yuzhe Yang ⋅ Mark Malhotra ⋅ Shwetak Patel ⋅ Hamid Palangi ⋅ Xuhai "Orson" Xu ⋅ Daniel McDuff ⋅ Tim Althoff ⋅ Xin Liu

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Language models (LMs) are increasingly being deployed to perform autonomous data analyses. However, their data awareness—the ability to recognize, reason over, and appropriately handle data artifacts such as missing values, outliers, and logical inconsistencies—remains underexplored. These artifacts are especially common in real-world tabular data and, if mishandled, can significantly compromise the validity of analytical conclusions. To address this gap, we present RADAR, a benchmark for systematically evaluating data-aware reasoning on tabular data. We develop a framework to simulate data artifacts via programmatic perturbations to enable targeted evaluation of model behavior. RADAR comprises 2,980 table-query pairs, grounded in real-world data spanning 9 domains and 5 data artifact types. In addition to evaluating artifact handling, RADAR systematically varies table size to study how reasoning performance holds when increasing table size. Our evaluation reveals that, despite decent performance on tables without data artifacts, frontier models degrade significantly when data artifacts are introduced, exposing critical gaps in their capacity for robust, data-aware analysis. Designed to be flexible and extensible, RADAR supports diverse perturbation types and controllable table sizes, offering a valuable resource for advancing tabular reasoning.

View full details

Poster

In the Eye of MLLM: Benchmarking Egocentric Video Intent Understanding with Gaze-Guided Prompting

Taiying Peng ⋅ Jiacheng Hua ⋅ Miao Liu ⋅ Feng Lu

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The emergence of advanced multimodal large language models (MLLMs) has significantly enhanced AI assistants' ability to process complex information across modalities. Recently, egocentric videos, by directly capturing user focus, actions, and context in an unified coordinate, offer an exciting opportunity to enable proactive and personalized AI user experiences with MLLMs. However, existing benchmarks overlook the crucial role of gaze as an indicator of user intent. To address this gap, we introduce EgoGazeVQA, an egocentric gaze-guided video question answering benchmark that leverages gaze information to improve the understanding of longer daily-life videos. EgoGazeVQA consists of gaze-based QA pairs generated by MLLMs and refined by human annotators. Our experiments reveal that existing MLLMs struggle to accurately interpret user intentions using only global visual tokens. In contrast, our gaze-guided intent prompting methods significantly enhance performance by integrating spatial, temporal, and intent-related cues. We further conduct experiments on gaze-related fine-tuning and analyze how gaze estimation accuracy impacts prompting effectiveness. These results underscore the value of gaze for more personalized and effective AI assistants in egocentric settings.

View full details

Poster

ForensicHub: A Unified Benchmark & Codebase for All-Domain Fake Image Detection and Localization

Bo Du ⋅ Xuekang Zhu ⋅ Xiaochen Ma ⋅ Chenfan Qu ⋅ Kaiwen Feng ⋅ Zhe Yang ⋅ Chi-Man Pun ⋅ Jian liu ⋅ Ji-Zhe Zhou

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The field of Fake Image Detection and Localization (FIDL) is highly fragmented, encompassing four domains: deepfake detection (Deepfake), image manipulation detection and localization (IMDL), artificial intelligence-generated image detection (AIGC), and document image manipulation localization (Doc). Although individual benchmarks exist in some domains, a unified benchmark for all domains in FIDL remains blank. The absence of a unified benchmark results in significant domain silos, where each domain independently constructs its datasets, models, and evaluation protocols without interoperability, preventing cross-domain comparisons and hindering the development of the entire FIDL field. To close the domain silo barrier, we propose ForensicHub, the first unified benchmark \& codebase for all-domain fake image detection and localization. Considering drastic variations on dataset, model, and evaluation configurations across all domains, as well as the scarcity of open-sourced baseline models and the lack of individual benchmarks in some domains, ForensicHub: i) proposes a modular and configuration-driven architecture that decomposes forensic pipelines into interchangeable components across datasets, transforms, models, and evaluators, allowing flexible composition across all domains; ii) fully implements 10 baseline models (3 of which are reproduced from scratch), 6 backbones, 2 new benchmarks for AIGC and Doc, and integrates 2 existing benchmarks of DeepfakeBench and IMDLBenCo through an adapter-based design; iii) establishes an image forensic fusion protocol evaluation mechanism that supports unified training and testing of diverse forensic models across tasks; iv) conducts indepth analysis based on the ForensicHub, offering 8 key actionable insights into FIDL model architecture, dataset characteristics, and evaluation standards. Specifically, ForensicHub includes 4 forensic tasks, 23 datasets, 42 baseline models, 6 backbones, 11 GPU-accelerated pixel- and image-level evaluation metrics, and realizes 16 kinds of cross-domain evaluations. ForensicHub represents a significant leap forward in breaking the domain silos in the FIDL field and inspiring future breakthroughs. Code is available at: https://github.com/scu-zjz/ForensicHub.

View full details

Poster

STEP: A Unified Spiking Transformer Evaluation Platform for Fair and Reproducible Benchmarking

Sicheng Shen ⋅ Dongcheng Zhao ⋅ Linghao Feng ⋅ Zeyang Yue ⋅ Jindong Li ⋅ Tenglong Li ⋅ Guobin Shen ⋅ Yi Zeng

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Spiking Transformers have recently emerged as promising architectures for combining the efficiency of spiking neural networks with the representational power of self-attention. However, the lack of standardized implementations, evaluation pipelines, and consistent design choices has hindered fair comparison and principled analysis. In this paper, we introduce \textbf{STEP}, a unified benchmark framework for Spiking Transformers that supports a wide range of tasks, including classification, segmentation, and detection across static, event-based, and sequential datasets. STEP provides modular support for diverse components such as spiking neurons, input encodings, surrogate gradients, and multiple backends (e.g., SpikingJelly, BrainCog). Using STEP, we reproduce and evaluate several representative models, and conduct systematic ablation studies on attention design, neuron types, encoding schemes, and temporal modeling capabilities. We also propose a unified analytical model for energy estimation, accounting for spike sparsity, bitwidth, and memory access, and show that quantized ANNs may offer comparable or better energy efficiency. Our results suggest that current Spiking Transformers rely heavily on convolutional frontends and lack strong temporal modeling, underscoring the need for spike-native architectural innovations. The full code is available at: https://github.com/Fancyssc/STEP.

View full details

Poster

MimeQA: Towards Socially-Intelligent Nonverbal Foundation Models

Hengzhi Li ⋅ Megan Tjandrasuwita ⋅ Yi R. (May) Fung ⋅ Armando Solar-Lezama ⋅ Paul Liang

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

As AI becomes more closely integrated with peoples' daily activities, socially intelligent AI that can understand and interact seamlessly with humans in daily lives is increasingly important. However, current works in AI social reasoning all rely on language-only or language-dominant approaches to benchmark and training models, resulting in systems that are improving in verbal communication but struggle with nonverbal social understanding. To address this limitation, we tap into a novel data source rich in nonverbal social interactions -- mime videos. Mimes refer to the art of expression through gesture and movement without spoken words, which presents unique challenges and opportunities in interpreting nonverbal social communication. We contribute a new dataset called MimeQA, obtained by sourcing ~8 hours of videos clips from YouTube and developing a comprehensive video question-answering benchmark comprising 806 carefully annotated and verified question-answer pairs, designed to probe nonverbal social reasoning capabilities. Using MimeQA, we evaluate state-of-the-art video large language models (VideoLLMs) and find that they achieve low accuracy, generally ranging from 20-30%, while humans score 86\%. Our analysis reveals that VideoLLMs often fail to ground imagined objects and over-rely on the text prompt while ignoring subtle nonverbal interactions. We hope to inspire future work in AI models that embody true social intelligence capable of interpreting non-verbal human interactions.

View full details

Poster

Comprehensive Assessment and Analysis for NSFW Content Erasure in Text-to-Image Diffusion models

Die Chen ⋅ Zhiwen Li ⋅ Cen Chen ⋅ Yuexiang Xie ⋅ Xiaodan Li ⋅ Jinyan Ye ⋅ Yingda Chen ⋅ Yaliang Li

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Text-to-image diffusion models have gained widespread application across various domains, demonstrating remarkable creative potential. However, the strong generalization capabilities of diffusion models can inadvertently lead to the generation of not-safe-for-work (NSFW) content, posing significant risks to their safe deployment. While several concept erasure methods have been proposed to mitigate the issue associated with NSFW content, a comprehensive evaluation of their effectiveness across various scenarios remains absent. To bridge this gap, we introduce a full-pipeline toolkit specifically designed for concept erasure and conduct the first systematic study of NSFW concept erasure methods. By examining the interplay between the underlying mechanisms and empirical observations, we provide in-depth insights and practical guidance for the effective application of concept erasure methods in various real-world scenarios, with the aim of advancing the understanding of content safety in diffusion models and establishing a solid foundation for future research and development in this critical area.

View full details

Poster

Improving Deep Learning for Accelerated MRI With Data Filtering

Kang Lin ⋅ Anselm Krainovic ⋅ Kun Wang ⋅ Reinhard Heckel

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Deep neural networks achieve state-of-the-art results for accelerated MRI reconstruction. Most research on deep learning based imaging focuses on improving neural network architectures trained and evaluated on fixed and homogeneous training and evaluation data. In this work, we investigate data curation strategies for improving MRI reconstruction. We assemble a large dataset of raw k-space data from 18 public sources consisting of 1.1M images and construct a diverse evaluation set comprising 48 test sets, capturing variations in anatomy, contrast, number of coils, and other key factors. We propose and study different data filtering strategies to enhance performance of current state-of-the-art neural networks for accelerated MRI reconstruction. Our experiments show that filtering the training data leads to consistent, albeit modest, performance gains. These performance gains are robust across different training set sizes and accelerations, and we find that filtering is particularly beneficial when the proportion of in-distribution data in the unfiltered training set is low.

View full details

Poster

Benchmarking Egocentric Multimodal Goal Inference for Assistive Wearable Agents

Vijay Veerabadran ⋅ Fanyi Xiao ⋅ Nitin Kamra ⋅ Pedro Matias ⋅ Joy Chen ⋅ Caley Drooff ⋅ Brett Roads ⋅ Riley J Williams ⋅ Ethan Henderson ⋅ Xuanyi Zhao ⋅ Kevin Carlberg ⋅ Joseph Tighe ⋅ Karl Ridgeway

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

There has recently been a surge of interest in Wearable Assistant Agents: agents embodied in a wearable form factor such as smart glasses, who can take actions toward a user’s stated goal — a high-level language-expressed command such as “where did I leave my keys?”, “Text Alice I will be late”, or “What’s the weather in Cancun?”. In this work, we consider the complementary problem of eliminating the effort required to interact with such an agent by proactively inferring the user’s goal from multimodal contextual observations. As vision-language models (VLMs) hold strong potential to ultimately solve this problem, our work focuses on creating a strong benchmark to measure progress toward this end. Given the limited prior work in this area, establishing the benchmark required collecting a novel multimodal goal-inference dataset; our dataset comprises ~30 hours of data from 363 participants across 3,482 recordings, featuring ground-truth reference goals alongside accompanying visual, audio, digital, and longitudinal contextual observations. We ran a human predictability study, where we found that humans set a strong baseline that comprises a de facto upper bound on model performance: they show multiple choice question (MCQ) accuracy of 93%, with the best VLM achieving about 84% accuracy. However, MCQ assesses discrimination, not the model’s ultimate task of generating the goal through open-ended text generation. Through a meta-evaluation, we find that a VLM judging the generated goals is as good as a human judge if it has access to a human-authored script of the video or a correct reference goal. Finally, we evaluate several families of modern vision-language models on the benchmark, showing that larger models have a significant performance advantage, but are still far from being practically useful, as they produce relevant goals only ~57% of the time. The best-performing smaller models—whose size makes them better suited to wearable applications—perform significantly worse than their counterparts, generating ~49% accuracy on the benchmark. Through a modality ablation, we show that models benefit from extra information in relevant modalities with minimal performance degradation from irrelevant modalities, but don’t gain as much when noisy modalities are included (e.g., in the case of digital context when most of the app state is irrelevant).

View full details

Poster

CAPability: A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness

Zhihang Liu ⋅ Chen-Wei Xie ⋅ Bin Wen ⋅ Feiwu Yu ⋅ JixuanChen ⋅ Pandeng Li ⋅ Boqiang Zhang ⋅ Nianzu Yang ⋅ YingluLi ⋅ Zuan Gao ⋅ Yun Zheng ⋅ Hongtao Xie

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Visual captioning benchmarks have become outdated with the emergence of modern multimodal large language models (MLLMs), as the brief ground-truth sentences and traditional metrics fail to assess detailed captions effectively. While recent benchmarks attempt to address this by focusing on keyword extraction or object-centric evaluation, they remain limited to vague-view or object-view analyses and incomplete visual element coverage. In this paper, we introduce CAPability, a comprehensive multi-view benchmark for evaluating visual captioning across 12 dimensions spanning six critical views. We curate nearly 11K human-annotated images and videos with visual element annotations to evaluate the generated captions. CAPability stably assesses both the correctness and thoroughness of captions with \textit{precision} and \textit{hit} metrics. By converting annotations to QA pairs, we further introduce a heuristic metric, \textit{know but cannot tell} ($K\bar{T}$), indicating a significant performance gap between QA and caption capabilities. Our work provides a holistic analysis of MLLMs' captioning abilities, as we identify their strengths and weaknesses across various dimensions, guiding future research to enhance specific aspects of their capabilities.

View full details

Poster

UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?

Yuanxin Liu ⋅ Rui Zhu ⋅ Shuhuai Ren ⋅ Jiacong Wang ⋅ Haoyuan Guo ⋅ Xu Sun ⋅ Lu Jiang

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

With the rapid growth of video generative models (VGMs), it is essential to develop reliable and comprehensive automatic metrics for AI-generated videos (AIGVs). Existing methods either use off-the-shelf models optimized for other tasks or rely on human assessment data to train specialized evaluators. These approaches are constrained to specific evaluation aspects and are difficult to scale with the increasing demands for finer-grained and more comprehensive evaluations. To address this issue, this work investigates the feasibility of using multimodal large language models (MLLMs) as a unified evaluator for AIGVs, leveraging their strong visual perception and language understanding capabilities. To evaluate the performance of automatic metrics in unified AIGV evaluation, we introduce a benchmark called UVE-Bench. UVE-Bench collects videos generated by state-of-the-art VGMs and provides pairwise human preference annotations across 15 evaluation aspects. Using UVE-Bench, we extensively evaluate 18 MLLMs. Our empirical results suggest that while advanced MLLMs (e.g., Qwen2VL-72B and InternVL2.5-78B) still lag behind human evaluators, they demonstrate promising ability in unified AIGV evaluation, significantly surpassing existing specialized evaluation methods. Additionally, we conduct an in-depth analysis of key design choices that impact the performance of MLLM-driven evaluators, offering valuable insights for future research on AIGV evaluation.

View full details

Poster

LISAt: Language-Instructed Segmentation Assistant for Satellite Imagery

Jerome Quenum ⋅ Wen-Han Hsieh ⋅ Tsung-Han (Patrick) Wu ⋅ Ritwik Gupta ⋅ Trevor Darrell ⋅ David Chan

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Segmentation models can recognize a pre-defined set of objects in images. However, segmentation models capable of "reasoning" over complex user queries that implicitly refer to multiple objects of interest remain underexplored, especially in the geospatial domain. Recent advances in "reasoning segmentation"---generating segmentation masks from complex, implicit query text---demonstrate the potential of vision-language models (VLMs) to reason across an open domain of objects. Yet, our experiments reveal that these models struggle when applied to the unique challenges of remote-sensing imagery. To address this gap, we introduce a new dataset which consists of: GRES, a curated geospatial reasoning-segmentation dataset with 27,615 annotations across 9,205 images, and PreGRES, a collection of existing datasets to make up a large-scale multimodal pretraining corpus with over 1M question-answer pairs across 119,279 images. We propose an initial benchmark model, LISAt, a VLM for geospatial analysis that can describe complex remote-sensing scenes, answer detailed queries, and segment objects based on natural-language prompts. LISAt establishes a strong initial geospatial benchmark, outperforming prior foundation models such as RS-GPT4V by 10.04\% (BLEU-4) on visual description tasks and surpassing open-domain models on geospatial reasoning segmentation by 143.36\% (gIoU). Our model, dataset, and code are available on our project page: https://lisat-bair.github.io/LISAt/.

View full details

Poster

Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)

Liwei Jiang ⋅ Yuanjun Chai ⋅ Margaret Li ⋅ Mickel Liu ⋅ Raymond Fok ⋅ Nouha Dziri ⋅ Yulia Tsvetkov ⋅ Maarten Sap ⋅ Yejin Choi

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Large language models (LMs) often struggle to generate diverse, human-like creative content, raising concerns about the long-term homogenization of human thought through repeated exposure to similar outputs. Yet scalable methods for evaluating LM output diversity remain limited, especially beyond narrow tasks such as random number or name generation, or beyond repeated sampling from a single model. To address this gap, we introduce Infinity-Chat, a large-scale dataset of 26K diverse, real-world, open-ended user queries that admit a wide range of plausible answers with no single ground truth. We introduce the first comprehensive taxonomy for characterizing the full spectrum of open-ended prompts posed to LMs, comprising 6 top-level categories (e.g., creative content generation, brainstorm & ideation) that further breaks down to 17 subcategories. Using Infinity-Chat, we present a large-scale study of mode collapse in LMs, revealing a pronounced Artificial Hivemind effect in open-ended generation of LMs, characterized by (1) intra-model repetition, where a single model consistently generates similar responses, and more so (2) inter-model homogeneity, where different models produce strikingly similar outputs. Infinity-Chat also includes 31,250 human annotations, across absolute ratings and pairwise preferences, with 25 independent human annotations per example. This enables studying collective and individual-specific human preferences in response to open-ended queries. Our findings show that state-of-the-art LMs, reward models, and LM judges are less well calibrated to human ratings on model generations that elicit differing idiosyncratic annotator preferences, despite maintaining comparable overall quality. Overall, INFINITY-CHAT presents the first large-scale resource for systematically studying real-world open-ended queries to LMs, revealing critical insights to guide future research for mitigating long-term AI safety risks posed by the Artificial Hivemind.

View full details

Poster

Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers

Wei Pang ⋅ Kevin Qinghong Lin ⋅ Xiangru Jian ⋅ Xi He ⋅ Philip Torr

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Academic poster generation is a crucial yet challenging task in scientific communication, requiring the compression of long-context interleaved documents into a single, visually coherent page. To address this challenge, we introduce Paper2Poster, the first benchmark and metric suite for poster generation, which pairs recent conference papers with author-designed posters and evaluates outputs on (i) Visual Quality—semantic alignment with human posters, (ii) Textual Coherence—language fluency, (iii) Holistic Assessment—six fine-grained aesthetic and informational criteria scored by a VLM-as-judge, and notably (iv) PaperQuiz—the poster’s ability to convey core paper content as measured by VLMs answering generated quizzes. Building on this benchmark, we propose PosterAgent, a top‐down, visual‐in‐the‐loop multi‐agent pipeline: the (a) Parser distills the paper into a structured asset library; the (b) Planner aligns text–visual pairs into a binary‐tree layout that preserves reading order and spatial balance; and the (c) Painter–Commenter loop refines each panel by executing rendering code and using VLM feedback to eliminate overflow and ensure alignment.In our comprehensive evaluation, we find that GPT‐4o outputs—though visually appealing at first glance—often exhibit noisy text and poor PaperQuiz scores; We find that reader engagement is the primary aesthetic bottleneck, as human‐designed posters rely largely on visual semantics to convey meaning.Our fully open‐source Paper2Poster pipeline outperforms GPT‐4o–based systems across nearly all metrics while consuming 87 \% fewer tokens. These findings chart clear directions for the next generation of fully automated poster‐generation models.

View full details

Poster

WritingBench: A Comprehensive Benchmark for Generative Writing

Yuning Wu ⋅ Jiahao Mei ⋅ Ming Yan ⋅ Chenliang Li ⋅ Shaopeng Lai ⋅ Yuran Ren ⋅ Zijia Wang ⋅ Ji Zhang ⋅ Mengyue Wu ⋅ Qin Jin ⋅ Fei Huang

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Recent advancements in large language models (LLMs) have significantly enhanced text generation capabilities, yet evaluating their performance in generative writing remains a challenge. Existing benchmarks primarily focus on generic text generation or limited in writing tasks, failing to capture the diverse requirements of high-quality written contents across various domains. To bridge this gap, we present WritingBench, a comprehensive benchmark designed to evaluate LLMs across 6 core writing domains and 100 subdomains. We further propose a query-dependent evaluation framework that empowers LLMs to dynamically generate instance-specific assessment criteria. This framework is complemented by a fine-tuned critic model for criteria-aware scoring, enabling evaluations in style, format and length. The framework's validity is further demonstrated by its data curation capability, which enables a 7B-parameter model to outperform the performance of GPT-4o in writing. We open-source the benchmark, along with evaluation tools and modular framework components, to advance the development of LLMs in writing.

View full details

Poster

CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs

Sijia Chen ⋅ Xiaomin Li ⋅ mengxue zhang ⋅ Eric Hanchen Jiang ⋅ Qingcheng Zeng ⋅ Chen-Hsiang Yu

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Large language models (LLMs) are increasingly deployed in medical contexts, raising critical concerns about safety, alignment, and susceptibility to adversarial manipulation. While prior benchmarks assess model refusal capabilities for harmful prompts, they often lack clinical specificity, graded harmfulness levels, and coverage of jailbreak-style attacks. We introduce CARES (Clinical Adversarial Robustness and Evaluation of Safety), a benchmark for evaluating LLM safety in healthcare. CARES includes over 18,000 prompts spanning eight medical safety principles, four harm levels, and four prompting styles—direct, indirect, obfuscated, and role-play—to simulate both malicious and benign use cases. We propose a three-way response evaluation protocol (Accept, Caution, Refuse) and a fine-grained Safety Score metric to assess model behavior. Our analysis reveals that many state-of-the-art LLMs remain vulnerable to jailbreaks that subtly rephrase harmful prompts, while also over-refusing safe but atypically phrased queries. Finally, we propose a mitigation strategy using a lightweight classifier to detect jailbreak attempts and steer models toward safer behavior via reminder-based conditioning. CARES provides a rigorous framework for testing and improving medical LLM safety under adversarial and ambiguous conditions.

View full details

Poster

SWE-smith: Scaling Data for Software Engineering Agents

John Yang ⋅ Kilian Lieret ⋅ Carlos Jimenez ⋅ Alexander Wettig ⋅ Kabir Khandpur ⋅ Yanzhe Zhang ⋅ Binyuan Hui ⋅ Ofir Press ⋅ Ludwig Schmidt ⋅ Diyi Yang

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Despite recent progress in Language Models (LMs) for software engineering, collecting training data remains a significant pain point.Existing datasets are small, with at most 1,000s of training instances from 11 or fewer GitHub repositories.The procedures to curate such datasets are often complex, necessitating hundreds of hours of human labor; companion execution environments also take up several terabytes of storage, severely limiting their scalability and usability.To address this pain point, we introduce SWE-smith, a novel pipeline for generating software engineering training data at scale.Given any Python codebase, SWE-smith constructs a corresponding execution environment, then automatically synthesizes 100s to 1,000s of task instances that break existing test(s) in the codebase.Using SWE-smith, we create a dataset of 50k instances sourced from 128 GitHub repositories, an order of magnitude larger than all previous works.We train SWE-agent-LM-32B, achieving 40.2% Pass@1 resolve rate on the SWE-bench Verified benchmark, state of the art among open source models.We open source SWE-smith (collection procedure, task instances, trajectories, models) to lower the barrier of entry for research in LM systems for automated software engineering.All assets available at \url{https://swesmith.com}.

View full details

Poster

AutoHood3D: A Multi‑Modal Benchmark for Automotive Hood Design and Fluid–Structure Interaction

Vansh Sharma ⋅ Harish Ganesh ⋅ Maryam Akram ⋅ Wanjiao Liu ⋅ Venkat Raman

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

This study presents a new high-fidelity multi-modal dataset containing 16000+ geometric variants of automotive hoods useful for machine learning (ML) applications such as engineering component design and process optimization, and multiphysics system surrogates. The dataset is centered on a practical multiphysics problem—hood deformation from fluid entrapment and inertial loading during rotary‑dip painting. Each hood is numerically modeled with a coupled Large-Eddy Simulation (LES)-Finite Element Analysis (FEA), using 1.2M cells in total to ensure spatial and temporal accuracy. The dataset provides time-resolved physical fields, along with STL meshes and structured natural language prompts for text-to-geometry synthesis. Existing datasets are either confined to 2D cases, exhibit limited geometric variations, or lack the multi‑modal annotations and data structures—shortcomings we address with AutoHood3D. We validate our numerical methodology, establish quantitative baselines across five neural architectures, and demonstrate systematic surrogate errors in displacement and force predictions. These findings motivate the design of novel approaches and multiphysics loss functions that enforce fluid–solid coupling during model training. By providing fully reproducible workflows, AutoHood3D enables physics‑aware ML development, accelerates generative‑design iteration, and facilitates the creation of new FSI benchmarks.

View full details

Poster

TaiwanVQA: Benchmarking and Enhancing Cultural Understanding in Vision-Language Models

Hsin Yi Hsieh ⋅ Shang-Wei Liu ⋅ Chang-Chih Meng ⋅ Chien-Hua Chen ⋅ Shuo-Yueh Lin ⋅ Hung-Ju Lin ⋅ Hen-Hsen Huang ⋅ I-Chen Wu

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Vision-language models (VLMs) often struggle with culturally specific content — a challenge largely overlooked by existing benchmarks that focus on dominant languages and globalized datasets. We introduce TᴀɪᴡᴀɴVQA, a VQA benchmark designed for Taiwanese culture to evaluate recognition and reasoning in regional contexts. TᴀɪᴡᴀɴVQA contains 2,736 images and 5,472 manually curated questions covering topics such as traditional foods, public signs, festivals, and landmarks. The official benchmark set includes 1,000 images and 2,000 questions for systematic assessment, with the remainder of the data used as training material. Evaluations on state-of-the-art VLMs reveal strong visual recognition but notable weaknesses in cultural reasoning. To address this, we propose a data augmentation strategy that combines human-annotated and synthesized dialogues to enhance cultural understanding. Fine-tuning yields significant gains on TᴀɪᴡᴀɴVQA while maintaining stable performance on other multimodal tasks. To further explore the models’ cultural understanding, we conducted an open-ended question answering experiment. The results indicate a notable decline in cultural knowledge generation ($\approx$10–20\%), suggesting challenges remain. TᴀɪᴡᴀɴVQA offers a scalable framework for building culturally grounded AI models in low-resource cultures, promoting diversity and fairness in multimodal AI. Our dataset and code are publicly available on [Hugging Face](https://huggingface.co/datasets/hhhuang/TaiwanVQA) and [GitHub](https://github.com/hhhuang/TaiwanVQA).

View full details

Poster

ChemPile: A 250 GB Diverse and Curated Dataset for Chemical Foundation Models

Adrian Mirza ⋅ Nawaf Alampara ⋅ Martiño Ríos-García ⋅ Mohamed Abdelalim ⋅ Jack Butler ⋅ Bethany Connolly ⋅ Tunca Dogan ⋅ Marianna Nezhurina ⋅ Bünyamin Şen ⋅ Santosh Tirunagari ⋅ Mark Worrall ⋅ Adamo Young ⋅ Philippe Schwaller ⋅ Michael Pieler ⋅ Kevin Maik Jablonka

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Foundation models have shown remarkable success across scientific domains, yet their impact in chemistry remains limited due to the absence of diverse, large-scale, high-quality datasets that reflect the field's multifaceted nature. We present the ChemPile, an open dataset containing over 75 billion tokens of curated chemical data, specifically built for training and evaluating general-purpose models in the chemical sciences. The dataset mirrors the human learning journey through chemistry---from educational foundations to specialized expertise---spanning multiple modalities and content types including structured data in diverse chemical representations (SMILES, SELFIES, IUPAC names, InChI, molecular renderings), scientific and educational text, executable code, and chemical images. ChemPile integrates foundational knowledge (textbooks, lecture notes), specialized expertise (scientific articles and language-interfaced data), visual understanding (molecular structures, diagrams), and advanced reasoning (problem-solving traces and code)---mirroring how human chemists develop expertise through diverse learning materials and experiences. Constructed through hundreds of hours of expert curation, the ChemPile captures both foundational concepts and domain-specific complexity. We provide standardized training, validation, and test splits, enabling robust benchmarking. ChemPile is openly released via HuggingFace with a consistent API, permissive license, and detailed documentation. We hope the ChemPile will serve as a catalyst for chemical AI, enabling the development of the next generation of chemical foundation models.

View full details

Poster

DeceptionBench: A Comprehensive Benchmark for AI Deception Behaviors in Real-world Scenarios

Yao Huang ⋅ Yitong Sun ⋅ Yichi Zhang ⋅ Ruochen Zhang ⋅ Yinpeng Dong ⋅ Xingxing Wei

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Despite the remarkable advances of Large Language Models (LLMs) across diverse cognitive tasks, the rapid enhancement of these capabilities also introduces emergent deception behaviors that may induce severe risks in high-stakes deployments. More critically, the characterization of deception across realistic real-world scenarios remains underexplored. To bridge this gap, we establish **DeceptionBench**, the first benchmark that systematically evaluates how deceptive tendencies manifest across different societal domains, what their intrinsic behavioral patterns are, and how extrinsic factors affect them. Specifically, on the static count, the benchmark encompasses 150 meticulously designed scenarios in five domains, *i.e.*, *Economy, Healthcare, Education, Social Interaction, and Entertainment*, with over 1,000 samples, providing sufficient empirical foundations for deception analysis. On the intrinsic dimension, we explore whether models exhibit self-interested egoistic tendencies or sycophantic behaviors that prioritize user appeasement. On the extrinsic dimension, we investigate how contextual factors modulate deceptive outputs under neutral conditions, reward-based incentivization, and coercive pressures. Moreover, we incorporate sustained multi-turn interaction loops to construct a more realistic simulation of real-world feedback dynamics. Extensive experiments across LLMs and Large Reasoning Models (LRMs) reveal critical vulnerabilities, particularly amplified deception under reinforcement dynamics, demonstrating that current models lack robust resistance to manipulative contextual cues and the urgent need for advanced safeguards against various deception behaviors. Code and resources are publicly available at https://github.com/Aries-iai/DeceptionBench.

View full details

Poster

Scaling Physical Reasoning with the PHYSICS Dataset

Shenghe Zheng ⋅ Qianjia Cheng ⋅ Junchi Yao ⋅ Mengsong Wu ⋅ haonan he ⋅ Ning Ding ⋅ Yu Cheng ⋅ Shuyue Hu ⋅ LEI BAI ⋅ Dongzhan Zhou ⋅ Ganqu Cui ⋅ Peng Ye

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Large Language Models (LLMs) have achieved remarkable progress on advanced reasoning tasks such as mathematics and coding competitions. Meanwhile, physics, despite being both reasoning-intensive and essential to real-world understanding, received limited academic and industrial attention. This paper introduces PHYSICS, a dataset containing 16,568 high-quality physics problems spanning subjects and difficulty levels, to facilitate this issue. Specifically, PHYSICS is curated with exercises from over 100 textbooks through a carefully designed pipeline for quality control. It covers five major physics domains: Mechanics, Electromagnetism, Thermodynamics, Optics, and Modern Physics. It also spans a wide range of difficulty levels, from high school to graduate-level physics courses. To utilize the data for improving and evaluating the model's physical reasoning capabilities, we split the dataset into training and test sets, and provide reasoning paths generated by powerful reasoning models for the training data to facilitate model training. In addition, for the evaluation part, we find that existing evaluation frameworks exhibit biases in aspects such as units, simplification, and precision in physics domain. To balance efficiency and accuracy, we introduce a Rule+Model evaluation framework tailored to physics problems. Our evaluations on current state-of-the-art open-source and proprietary models highlight the limitations of current models in handling physics-related tasks. We hope that our dataset and evaluation methodology will jointly advance the development of LLMs in the field of physics. The code and data can be found at: https://github.com/Zhengsh123/PHYSICS.

View full details

Poster

Open-Insect: Benchmarking Open-Set Recognition of Novel Species in Biodiversity Monitoring

Yuyan Chen ⋅ Nico Lang ⋅ B. Schmidt ⋅ Aditya Jain ⋅ Yves Basset ⋅ Sara Beery ⋅ Maxim Larrivee ⋅ David Rolnick

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Global biodiversity is declining at an unprecedented rate, yet little information isknown about most species and how their populations are changing. Indeed, some90% Earth’s species are estimated to be completely unknown. Machine learning hasrecently emerged as a promising tool to facilitate long-term, large-scale biodiversitymonitoring, including algorithms for fine-grained classification of species fromimages. However, such algorithms typically are not designed to detect examplesfrom categories unseen during training – the problem of open-set recognition(OSR) – limiting their applicability for highly diverse, poorly studied taxa such asinsects. To address this gap, we introduce Open-Insect, a large-scale, fine-graineddataset to evaluate unknown species detection across different geographic regionswith varying difficulty. We benchmark 38 OSR algorithms across three categories:post-hoc, training-time regularization, and training with auxiliary data, finding thatsimple post-hoc approaches remain a strong baseline. We also demonstrate how toleverage auxiliary data to improve species discovery in regions with limited data.Our results provide timely insights to guide the development of computer visionmethods for biodiversity monitoring and species discovery.

View full details

Poster

STARC-9: A Large-scale Dataset for Multi-Class Tissue Classification for CRC Histopathology

Barathi Subramanian ⋅ Rathinaraja Jeyaraj ⋅ Mitchell Peterson ⋅ Terry Guo ⋅ Nigam Shah ⋅ Curtis Langlotz ⋅ Andrew Ng ⋅ Jeanne Shen

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Multi-class tissue-type classification of colorectal cancer (CRC) histopathologic images is a significant step in the development of downstream machine learning models for diagnosis and treatment planning. However, publicly available CRC datasets used to build tissue classifiers often suffer from insufficient morphologic diversity, class imbalance, and low-quality image tiles, limiting downstream model performance and generalizability. To address this research gap, we introduce STARC-9 (STAnford coloRectal Cancer), a large-scale dataset for multi-class tissue classification. STARC-9 comprises 630,000 histopathologic image tiles uniformly sampled across nine clinically relevant tissue classes (each represented by 70,000 tiles), systematically extracted from hematoxylin & eosin-stained whole-slide images (WSI) from 200 CRC patients at the Stanford University School of Medicine. To construct STARC-9, we propose a novel framework, DeepCluster++, consisting of two primary steps to ensure diversity within each tissue class, followed by pathologist verification. First, an encoder from an autoencoder trained specifically on histopathologic images is used to extract feature vectors from all tiles within a given input WSI. Next, K-means clustering groups morphologically similar tiles, followed by an equal-frequency binning method to sample diverse patterns within each tissue class. Finally, the selected tiles are verified by expert gastrointestinal pathologists to ensure classification accuracy. This semi-automated approach significantly reduces the manual effort required for dataset curation while producing high-quality training examples. To validate the utility of STARC-9, we benchmarked baseline convolutional neural networks, transformers, and pathology-specific foundation models on downstream multi-class CRC tissue classification and segmentation tasks when trained on STARC-9 versus publicly available datasets, demonstrating superior generalizability of models trained on STARC-9. Although we demonstrate the utility of DeepCluster++ on CRC as a pilot use-case, it is a flexible framework that can be used for constructing high-quality datasets from large WSI repositories across a wide range of cancer and non-cancer applications.

View full details

Poster

Feel-Good Thompson Sampling for Contextual Bandits: a Markov Chain Monte Carlo Showdown

Emile Anand ⋅ Sarah Liaw

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Thompson Sampling (TS) is widely used to address the exploration/exploitation tradeoff in contextual bandits, yet recent theory shows that it does not explore aggressively enough in high-dimensional problems. Feel-Good Thompson Sampling (FG-TS) addresses this by adding an optimism bonus that biases toward high-reward models, and it achieves the asymptotically minimax-optimal regret in the linear setting when posteriors are exact. However, its performance with \emph{approximate} posteriors, common in large-scale or neural problems, has not been benchmarked. We provide the first systematic study of FG-TS and its smoothed variant (SFG-TS) across fourteen real-world and synthetic benchmarks. To evaluate their robustness, we compare performance across settings with exact posteriors (linear and logistic bandits) to approximate regimes produced by fast but coarse stochastic-gradient samplers. Ablations over preconditioning, bonus scale, and prior strength reveal a trade-off: larger bonuses help when posterior samples are accurate, but hurt when sampling noise dominates. FG-TS generally outperforms vanilla TS in linear and logistic bandits, but tends to be weaker in neural bandits. Nevertheless, because FG-TS and its variants are competitive and easy-to-use, we recommend them as baselines in modern contextual-bandit benchmarks. Finally, we provide source code for all our experiments in https://github.com/SarahLiaw/ctx-bandits-mcmc-showdown.

View full details

Poster

ExAct: A Video-Language Benchmark for Expert Action Analysis

Han Yi ⋅ Yulu Pan ⋅ Feihong He ⋅ Xinyu Liu ⋅ Benjamin Zhang ⋅ Oluwatumininu Oguntola ⋅ Gedas Bertasius

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We present ExAct, a new video-language benchmark for expert-level understanding of skilled physical human activities. Our new benchmark contains 3,521 expert-curated video question-answer pairs spanning 11 physical activities in 6 domains: Sports, Bike Repair, Cooking, Health, Music, and Dance. ExAct requires the correct answer to be selected from five carefully designed candidate options, thus necessitating a nuanced, fine-grained, expert-level understanding of physical human skills. Evaluating the recent state-of-the-art VLMs on ExAct reveals a substantial performance gap relative to human expert performance. Specifically, the best-performing Gemini 2.5 Pro model achieves only 55.35% accuracy, well below the 82.02% attained by trained human experts. We believe that ExAct will be beneficial for developing and evaluating VLMs capable of precise understanding of human skills in various physical and procedural domains. Dataset and code are available at https://texaser.github.io/exact_project_page/.

View full details

Poster

SURDS: Benchmarking Spatial Understanding and Reasoning in Driving Scenarios with Vision Language Models

Xianda Guo ⋅ Ruijun Zhang ⋅ Yiqun Duan ⋅ Yuhang He ⋅ Dujun Nie ⋅ Wenke Huang ⋅ Chenming Zhang ⋅ Shuai Liu ⋅ Hao Zhao ⋅ Long Chen

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Accurate spatial reasoning in outdoor environments—covering geometry, object pose, and inter-object relationships—is fundamental to downstream tasks such as mapping, motion forecasting, and high-level planning in autonomous driving. We introduce SURDS, a large-scale benchmark designed to systematically evaluate the spatial reasoning capabilities of vision language models (VLMs). Built on the nuScenes dataset, SURDS comprises 41,080 vision–question–answer training instances and 9,250 evaluation samples, spanning six spatial categories: orientation, depth estimation, pixel-level localization, pairwise distance, lateral ordering, and front–behind relations. We benchmark leading general-purpose VLMs, including GPT, Gemini, and Qwen, revealing persistent limitations in fine-grained spatial understanding. To address these deficiencies, we go beyond static evaluation and explore whether alignment techniques can improve spatial reasoning performance. Specifically, we propose a reinforcement learning–based alignment scheme leveraging spatially grounded reward signals—capturing both perception-level accuracy (location) and reasoning consistency (logic). We further incorporate final-answer correctness and output-format rewards to guide fine-grained policy adaptation. Our GRPO-aligned variant achieves overall score of 40.80 in SURDS benchmark. Notably, it outperforms proprietary systems such as GPT-4o (13.30) and Gemini-2.0-flash (35.71). To our best knowledge, this is the first study to demonstrate that reinforcement learning–based alignment can significantly and consistently enhance the spatial reasoning capabilities of VLMs in real-world driving contexts. We release the SURDS benchmark, evaluation toolkit, and GRPO alignment code through: https://github.com/XiandaGuo/Drive-MLLM.

View full details

Poster

PatientSim: A Persona-Driven Simulator for Realistic Doctor-Patient Interactions

Daeun Kyung ⋅ Hyunseung Chung ⋅ Seongsu Bae ⋅ Jiho Kim ⋅ Jae Ho Sohn ⋅ Taerim Kim ⋅ Soo Kyung Kim ⋅ Edward Choi

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Doctor-patient consultations require multi-turn, context-aware communication tailored to diverse patient personas. Training or evaluating doctor LLMs in such settings requires realistic patient interaction systems. However, existing simulators often fail to reflect the full range of personas seen in clinical practice. To address this, we introduce PatientSim, a patient simulator that generates realistic and diverse patient personas for clinical scenarios, grounded in medical expertise. PatientSim operates using: 1) clinical profiles, including symptoms and medical history, derived from real-world data in the MIMIC-ED and MIMIC-IV datasets, and 2) personas defined by four axes: personality, language proficiency, medical history recall level, and cognitive confusion level, resulting in 37 unique combinations.We evaluate eight LLMs for factual accuracy and persona consistency. The top-performing open-source model, Llama 3.3 70B, is validated by four clinicians to confirm the robustness of our framework. As an open-source, customizable platform, PatientSim provides a reproducible and scalable solution that can be customized for specific training needs. Offering a privacy-compliant environment, it serves as a robust testbed for evaluating medical dialogue systems across diverse patient presentations and shows promise as an educational tool for healthcare. The code is available at https://github.com/dek924/PatientSim.

View full details

Poster

ProteinConformers: Benchmark Dataset for Simulating Protein Conformational Landscape Diversity and Plausibility

Yihang Zhou ⋅ Chen Wei ⋅ Minghao Sun ⋅ Jin Song ⋅ Yang Li ⋅ Lin Wang ⋅ Yang Zhang

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Understanding the conformational landscape of proteins is essential for elucidating protein function and facilitating drug design. However, existing protein conformation benchmarks fail to capture the full energy landscape, limiting their ability to evaluate the diversity and physical plausibility of AI-generated structures. We introduce ProteinConformers, a large-scale benchmark dataset comprising over 381,000 physically realistic conformations for 87 CASP targets. These were derived from more than 40,000 structural decoys via extensive all-atom molecular dynamics simulations totaling over 6 million CPU hours. Using this dataset, we propose novel metrics to evaluate conformational diversity and plausibility, and systematically benchmark six protein conformation generative models. Our results highlight that leveraging large-scale protein sequence data can enhance a model’s ability to explore conformational space, potentially reducing reliance on MD-derived data. Additionally, we find that PDB and MD datasets influence model performance differently, current models perform well on inter-atomic distance prediction but struggle with inter-residue orientation generation. Overall, our dataset, evaluation metrics, and benchmarking results provide the first comprehensive foundation for assessing generative models in protein conformational modeling. Dataset and instructions are available at https://huggingface.co/ datasets/Jim990908/ProteinConformers/tree/main. Codes are stored at https://github.com/auroua/ProteinConformers. An interactive website locates at https://zhanggroup.org/ProteinConformers.

View full details

Poster

ML4CO-Bench-101: Benchmark Machine Learning for Classic Combinatorial Problems on Graphs

Jiale Ma ⋅ Wenzheng Pan ⋅ Yang Li ⋅ Junchi Yan

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Combinatorial problems on graphs have attracted extensive efforts from the machine learning community over the past decade. Despite notable progress in this area under the umbrella of ML4CO, a comprehensive categorization, unified reproducibility, and transparent evaluation protocols are still lacking for the emerging and immense pool of neural CO solvers. In this paper, we establish a modular and streamlined framework benchmarking prevalent neural CO methods, dissecting their design choices via a tri-leveled "paradigm-model-learning'' taxonomy to better characterize different approaches. Further, we integrate their shared features and respective strengths to form 3 unified solvers representing global prediction (GP), local construction (LC), and adaptive expansion (AE) mannered neural solvers. We also collate a total of 65 datasets for 7 mainstream CO problems (including both edge-oriented tasks: TSP, ATSP, CVRP, as well as node-oriented: MIS, MCl, MVC, MCut) across scales to facilitate more comparable results among literature. Extensive experiments upon our benchmark reveal a fair and exact performance exhibition indicative of the raw contribution of the learning components in each method, rethinking and insisting that pre- and post-inference heuristic tricks are not supposed to compensate for sub-par capability of the data-driven counterparts. Under this unified benchmark, an up-to-date replication of typical ML4CO methods is maintained, hoping to provide convenient reference and insightful guidelines for both engineering development and academic exploration of the ML4CO community in the future. Code is available at https://github.com/Thinklab-SJTU/ML4CO-Bench-101, and the dataset is at https://huggingface.co/datasets/ML4CO/ML4CO-Bench-101-SL.

View full details

Poster

ML4CFD Competition: Results and Retrospective Analysis

Mouadh Yagoubi ⋅ David Danan ⋅ Milad LEYLI ABADI ⋅ Jocelyn Mazari ⋅ Jean-Patrick Brunet ⋅ Abbas Kabalan ⋅ Fabien Casenave ⋅ Yuxin Ma ⋅ Giovanni Catalani ⋅ Jean Fesquet ⋅ Jacob Helwig ⋅ Xuan Zhang ⋅ Haiyang Yu ⋅ Xavier BERTRAND ⋅ Frédéric TOST ⋅ Michaël Bauerheim ⋅ Joseph Morlier ⋅ Shuiwang Ji

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The integration of machine learning (ML) into the physical sciences is reshaping computational paradigms, offering the potential to accelerate demanding simulations such as computational fluid dynamics (CFD). Yet, persistent challenges in accuracy, generalization, and physical consistency hinder the practical deployment of ML models in scientific domains. To address these limitations and systematically benchmark progress, we organized the ML4CFD competition, centered on surrogate modeling for aerodynamic simulations over two-dimensional airfoils. The competition attracted over 240 teams, who were provided with a curated dataset generated via OpenFOAM and evaluated through a multi-criteria framework encompassing predictive accuracy, physical fidelity, computational efficiency, and out-of-distribution generalization. This retrospective analysis reviews the competition outcomes, highlighting several approaches that outperformed baselines under our global evaluation score. Notably, the top entry exceeded the performance of the original OpenFOAM solver on aggregate metrics, illustrating the promise of ML based surrogates to outperform traditional solvers under tailored criteria. However, this does not imply that the winning solution could replace the OpenFOAM solver or that it was overall superior, even for this specific task. Drawing from these results, we analyze the key design principles of top submissions, assess the robustness of our evaluation framework, and offer guidance for future scientific ML challenges.

View full details

Poster

Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

Bettina Messmer ⋅ Vinko Sabolčec ⋅ Martin Jaggi

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Dataset curation has become a basis for strong large language model (LLM) performance.While various rule-based filtering heuristics exist for English and multilingual datasets, model-based filtering techniques have primarily focused on English.To address the disparity stemming from limited research on non-English languages, we develop a model-based filtering framework for multilingual datasets that aims to identify a diverse set of structured and knowledge-rich samples.Our approach emphasizes transparency, simplicity, and efficiency, leveraging Transformer- and FastText-based classifiers to ensure the broad accessibility of our technique and data.We conduct comprehensive ablation studies on the FineWeb-2 web crawl dataset across diverse language families, scripts, and resource availability to demonstrate the effectiveness of our method.Training a 1B-parameter Llama model for 70B and 119B tokens, our approach can match the baseline MMLU score with as little as 15\% of the training tokens, while also improving across other benchmarks and mitigating the curse of multilinguality.These findings provide strong evidence for the generalizability of our approach to other languages. As a result, we extend our framework to 20 languages for which we release the refined pretraining datasets.

View full details

Poster

Words That Unite The World: A Unified Framework for Deciphering Central Bank Communications

Agam Shah ⋅ Siddhant Sukhani ⋅ Huzaifa Pardawala ⋅ Saketh Budideti ⋅ Riya Bhadani ⋅ Rudra Gopal ⋅ Siddhartha Somani ⋅ Rutwik Routu ⋅ Michael Galarnyk ⋅ Soungmin Lee ⋅ Arnav Hiray ⋅ Akshar Ravichandran ⋅ Eric Kim ⋅ Pranav Aluru ⋅ Joshua Zhang ⋅ Sebastian Jaskowski ⋅ Veer Guda ⋅ Meghaj Tarte ⋅ Liqin Ye ⋅ Spencer Gosden ⋅ Rachel Yuh ⋅ Sloka Chava ⋅ Sahasra Chava ⋅ Dylan Patrick Kelly ⋅ Aiden Chiang ⋅ Harsit Mittal ⋅ Sudheer Chava

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Central banks around the world play a crucial role in maintaining economic stability. Deciphering policy implications in their communications is essential, especially as misinterpretations can disproportionately impact vulnerable populations. To address this, we introduce the World Central Banks (WCB) dataset, the most comprehensive monetary policy corpus to date, comprising over 380k sentences from 25 central banks across diverse geographic regions, spanning 28 years of historical data. After uniformly sampling 1k sentences per bank (25k total) across all available years, we annotate and review each sentence using dual annotators, disagreement resolutions, and secondary expert reviews. We define three tasks: Stance Detection, Temporal Classification, and UncertaintyEstimation, with each sentence annotated for all three. We benchmark seven Pretrained Language Models (PLMs) and nine Large Language Models (LLMs) (Zero-Shot, Few-Shot, and with annotation guide) on these tasks, running 15,075 benchmarking experiments. We find that a model trained on aggregated data across banks significantly surpasses a model trained on an individual bank's data, confirming the principle *"the whole is greater than the sum of its parts."* Additionally, rigorous human evaluations, error analyses, and predictive tasks validate our framework's economic utility. Our artifacts are accessible through the HuggingFace and GitHub under the CC-BY-NC-SA 4.0 license.

View full details

Poster

NeuroRenderedFake: A Challenging Benchmark to Detect Fake Images Generated by Advanced Neural Rendering Methods

Chengdong Dong ⋅ B. V. K. Vijaya Kumar ⋅ Zhenyu Zhou ⋅ Ajay Kumar

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The remarkable progress in neural-network-driven visual data generation, especially with neural rendering techniques like Neural Radiance Fields and 3D Gaussian splatting, offers a powerful alternative to GANs and diffusion models. These methods can generate high-fidelity images and lifelike avatars, highlighting the need for robust detection methods. However, the lack of any large dataset containing images from neural rendering methods becomes a bottleneck for the detection of such sophisticated fake images. To address this limitation, we introduce NeuroRenderedFake, a comprehensive benchmark for evaluating emerging fake image detection methods. Our key contributions are threefold: (1) A large-scale dataset of fake images synthesized using state-of-the-art neural rendering techniques, significantly expanding the scope of fake image detection beyond generative models; (2) A cross-domain evaluation protocol designed to assess the domain gap and common artifacts between generative and neural rendering-based fake images; and (3) An in-depth spectral energy analysis that reveals how frequency domain characteristics influence the performance of fake image detectors. We train representative detectors, based on spatial, spectral, and multimodal architectures, on fake images generated by both generative and neural rendering models. We evaluate these detectors on 15 groups of fake images synthesized by cutting-edge neural rendering models, generative models, and combined methods that can exhibit artifacts from both domains. Additionally, we provide insightful findings through detailed experiments on degraded fake image detection and the impact of spectral features, aiming to advance research in this critical area.

View full details

Poster

SWE-bench Goes Live!

Linghao Zhang ⋅ Shilin He ⋅ Chaoyun Zhang ⋅ Yu Kang ⋅ Bowen Li ⋅ Chengxing Xie ⋅ Junhao Wang ⋅ Maoquan Wang ⋅ Yufan Huang ⋅ Shengyu Fu ⋅ Elsie Nallipogu ⋅ Qingwei Lin ⋅ Yingnong Dang ⋅ Saravan Rajmohan ⋅ Dongmei Zhang

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The issue-resolving task, where a model generates patches to fix real-world bugs, has emerged as a key benchmark for evaluating the capabilities of large language models (LLMs). While SWE-bench has become the dominant benchmark in this domain, it suffers from several limitations: it has not been updated since its release, is restricted to only 12 repositories, and relies heavily on manual effort for constructing test instances and setting up executable environments, significantly limiting its scalability. We present SWE-bench-Live, a live-updatable benchmark designed to address these limitations. SWE-bench-Live currently includes 1,890 tasks derived from real GitHub issues created since 2024, spanning 223 repositories. Each task is accompanied by a dedicated Docker image to ensure reproducible execution. Additionally, we introduce an automated curation pipeline that streamlines the entire process from instance creation to environment setup, removing manual bottlenecks and enabling scalability and continuous updates. We evaluate a range of state-of-the-art models and agent frameworks on SWE-bench-Live, offering detailed empirical insights into their real-world bug-fixing capabilities. By providing a fresh, diverse, and executable benchmark grounded in live repository activity, SWE-bench-Live supports reliable, large-scale assessment of code LLMs and code agents in realistic development settings.

View full details

Poster

ArchPower: Dataset for Architecture-Level Power Modeling of Modern CPU Design

Qijun Zhang ⋅ Yao Lu ⋅ Mengming Li ⋅ Shang Liu ⋅ Zhiyao Xie

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Power is the primary design objective of large-scale integrated circuits (ICs), especially for complex modern processors (i.e., CPUs). Accurate CPU power evaluation requires designers to go through the whole time-consuming IC implementation process, easily taking months. At the early design stage (e.g., architecture-level), classical power models are notoriously inaccurate. Recently, ML-based architecture-level power models have been proposed to boost accuracy, but the data availability is a severe challenge. Currently, there is no open-source dataset for this important ML application. A typical dataset generation process involves correct CPU design implementation and repetitive execution of power simulation flows, requiring significant design expertise, engineering effort, and execution time. Even private in-house datasets often fail to reflect realistic CPU design scenarios. In this work, we propose ArchPower, the first open-source dataset for architecture-level processor power modeling. We go through complex and realistic design flows to collect the CPU architectural information as features and the ground-truth simulated power as labels. Our dataset includes 200 CPU data samples, collected from 25 different CPU configurations when executing 8 different workloads. There are more than 100 architectural features in each data sample, including both hardware and event parameters. The label of each sample provides fine-grained power information, including the total design power and the power for each of the 11 components. Each power value is further decomposed into four fine-grained power groups: combinational logic power, sequential logic power, memory power, and clock power. ArchPower is available at https://github.com/hkust-zhiyao/ArchPower.

View full details

Poster

The Temporal Graph of Bitcoin Transactions

Vahid Jalili

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Since its 2009 genesis block, the Bitcoin network has processed >1.08 billion (B) transactions representing >8.72B BTC, offering rich potential for machine learning (ML); yet, its pseudonymity and obscured flow of funds inherent in its UTxO-based design, have rendered this data largely inaccessible for ML research. Addressing this gap, we present an ML-compatible graph modeling the Bitcoin's economic topology by reconstructing the flow of funds. This temporal, heterogeneous graph encompasses complete transaction history up to block 863000, consisting of >2.4B nodes and >39.72B edges. Additionally, we provide custom sampling methods yielding node and edge feature vectors of sampled communities, tools to load and analyze the Bitcoin graph data within specialized graph databases, and ready-to-use database snapshots. This comprehensive dataset and toolkit empower the ML community to tackle Bitcoin's intricate ecosystem at scale, driving progress in applications such as anomaly detection, address classification, market analysis, and large-scale graph ML benchmarking. Dataset and code available at https://github.com/B1AAB/EBA.

View full details

Poster

Quantifying Generalisation in Imitation Learning

Nathan Gavenski ⋅ Odinaldo Rodrigues

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Imitation learning benchmarks often lack sufficient variation between training and evaluation, limiting meaningful generalisation assessment. We introduce Labyrinth, a benchmarking environment designed to test generalisation with precise control over structure, start and goal positions, and task complexity.It enables verifiably distinct training, evaluation, and test settings.Labyrinth provides a discrete, fully observable state space and known optimal actions, supporting interpretability and fine-grained evaluation.Its flexible setup allows targeted testing of generalisation factors and includes variants like partial observability, key-and-door tasks, and ice-floor hazards.By enabling controlled, reproducible experiments, Labyrinth advances the evaluation of generalisation in imitation learning and provides a valuable tool for developing more robust agents.

View full details

Poster

Towards Better Dental AI: A Multimodal Benchmark and Instruction Dataset for Panoramic X-ray Analysis

Jing Hao ⋅ Yuxuan Fan ⋅ Yanpeng Sun ⋅ Kaixin Guo ⋅ Lin Lizhuo ⋅ Jinrong Yang ⋅ Qiyong Ai ⋅ Lun Wong ⋅ Hao Tang ⋅ Kuo Hung

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Recent advances in large vision-language models (LVLMs) have demonstrated strong performance on general-purpose medical tasks. However, their effectiveness in specialized domains such as dentistry remains underexplored. In particular, panoramic X-rays, a widely used imaging modality in oral radiology, pose interpretative challenges due to dense anatomical structures and subtle pathological cues, which are not captured by existing medical benchmarks or instruction datasets. To this end, we introduce MMOral, the first large-scale multimodal instruction dataset and benchmark tailored for panoramic X-ray interpretation. MMOral consists of 20,563 annotated images paired with 1.3 million instruction-following instances across diverse task types, including attribute extraction, report generation, visual question answering, and image-grounded dialogue. In addition, we present MMOral-Bench, a comprehensive evaluation suite covering five key diagnostic dimensions in dentistry. We evaluate 64 LVLMs on MMOral-Bench and find that even the best-performing model, i.e., GPT-4o, only achieves 43.31% accuracy, revealing significant limitations of current models in this domain. To promote the progress of this specific domain, we provide the supervised fine-tuning (SFT) process utilizing our meticulously curated MMOral instruction dataset. Remarkably, a single epoch of SFT yields substantial performance enhancements for LVLMs, e.g., Qwen2.5-VL-7B demonstrates a 24.73% improvement. MMOral holds significant potential as a critical foundation for intelligent dentistry and enables more clinically impactful multimodal AI systems in the dental field.

View full details

Poster

VideoCAD: A Dataset and Model for Learning Long‑Horizon 3D CAD UI Interactions from Video

King Yiu Brandon Man ⋅ Ghadi Nehme ⋅ Md Ferdous Alam ⋅ Faez Ahmed

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Computer-Aided Design (CAD) is a time-consuming and complex process, requiring precise, long-horizon user interactions with intricate 3D interfaces. While recent advances in AI-driven user interface (UI) agents show promise, most existing datasets and methods focus on short, low-complexity tasks in mobile or web applications, failing to capture the demands of professional engineering tools. In this work, we introduce VideoCAD, the first attempt to model UI interactions for precision engineering tasks. Specifically, VideoCAD is a large-scale synthetic dataset consisting of over 41K annotated video recordings of CAD operations, generated using an automated framework for collecting high-fidelity UI action data from human-made CAD designs.Compared to existing datasets, VideoCAD offers an order-of-magnitude increase in complexity for real-world engineering UI tasks, with time horizons up to $20\times$ longer than those in other datasets. We show two important downstream applications of VideoCAD:(1) learning UI interactions from professional 3D CAD tools for precision tasks and (2) a visual question-answering (VQA) benchmark designed to evaluate multimodal large language models (LLMs) on spatial reasoning and video understanding. To learn the UI interactions, we propose VideoCADFormer, a state-of-the-art model for learning CAD interactions directly from video, which outperforms existing behavior cloning baselines. Both VideoCADFormer and the VQA benchmark derived from VideoCAD reveal key challenges in the current state of video-based UI understanding, including the need for precise action grounding, multi-modal and spatial reasoning, and long-horizon dependencies. Dataset and code available at: https://github.com/ghadinehme/VideoCAD.

View full details

Poster

SentinelKilnDB: A Large-Scale Dataset and Benchmark for OBB Brick Kiln Detection in South Asia Using Satellite Imagery

Rishabh Mondal ⋅ Jeet Parab ⋅ Heer Kubadia ⋅ Shataxi Dubey ⋅ Shardul Junagade ⋅ Zeel Bharatkumar Patel ⋅ Nipun Batra

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Air pollution was responsible for 2.6 million deaths across South Asia in 2021 alone, with brick manufacturing contributing significantly to this burden. In particular, the Indo-Gangetic Plain; a densely populated and highly polluted region spanning northern India, Pakistan, Bangladesh, and parts of Afghanistan sees brick kilns contributing 8–14% of ambient air pollution. Traditional monitoring approaches, such as field surveys and manual annotation using tools like Google Earth Pro, are time and labor-intensive. Prior ML-based efforts for automated detection have relied on costly high-resolution commercial imagery and non-public datasets, limiting reproducibility and scalability. In this work, we introduce SENTINELKILNDB, a publicly available, hand-validated benchmark of 62,671 brick kilns spanning threekiln types Fixed Chimney Bull’s Trench Kiln (FCBK), Circular FCBK (CFCBK), and Zigzag kilns - annotated with oriented bounding boxes (OBBs) across 2.8 million km2 using free and globally accessible Sentinel-2 imagery. We benchmark state-of-the-art oriented object detection models and evaluate generalization across in-region, out-of-region, and super-resolution settings. SENTINELKILNDB enables rigorous evaluation of geospatial generalization and robustness for low-resolution object detection, and provides a new testbed for ML models addressing real-world environmental and remote sensing challenges at a continental scale. Datasets and code are available in SentinelKilnDB Dataset and SentinelKilnDB Bench-mark, under the Creative Commons Attribution–NonCommercial 4.0 International License.

View full details

Poster

What’s in Common? Multimodal Models Hallucinate When Reasoning Across Scenes

Candace Ross ⋅ Florian Bordes ⋅ Adina Williams ⋅ Polina Kirichenko ⋅ Mark Ibrahim

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Multimodal language models possess a remarkable ability to handle an open-vocabulary worth of objects. Yet the best models still suffer from hallucinations when reasoning about scenes in the real world, revealing a gap between their seemingly strong performance on existing perception benchmarks that are saturating and their reasoning in the real world. To address this gap, we build a novel benchmark of in-the-wild scenes that we call Common-O Bench with more than 10.5k examples using exclusively new images not found in web training data to avoid contamination, Common-O goes beyond just perception, inspired by cognitive tests for humans, to probe reasoning across scenes by asking ``what’s in common?''. We evaluate leading multimodal language models, including models specifically trained to reason. We find that perceiving objects in single images is easy for most models, yet reasoning across scenes is very challenging even for the best models, including reasoning models. Despite saturating many leaderboards focusing on perception, the best performing model only achieves 35\% on Common-O Bench---and on Common-O Complex, consisting of more complex scenes, the best model achieves only 1\%. Curiously, we find models are more prone to hallucinate when similar objects are present in the scene, suggesting models may be relying on object co-occurrence seen during training. Among the models we evaluated, we found scale can provide modest improvements while models explicitly trained with multi-image inputs show bigger improvements, suggesting scaled multi-image training may offer promise. We make our benchmark publicly available to spur research into the challenge of hallucination when reasoning across scenes.

View full details

Poster

WorldModelBench: Judging Video Generation Models As World Models

Dacheng Li ⋅ Yunhao Fang ⋅ Yukang Chen ⋅ Shuo Yang ⋅ Shiyi Cao ⋅ Justin Wong ⋅ Michael Luo ⋅ Xiaolong Wang ⋅ Hongxu Yin ⋅ Joseph Gonzalez ⋅ Ion Stoica ⋅ Song Han ⋅ Yao Lu

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Video generation models have rapidly progressed, positioning themselves as video world models capable of supporting decision-making applications like robotics and autonomous driving. However, current benchmarks fail to rigorously evaluate these claims, focusing only on general video quality, ignoring important factors to world models such as physics adherence.To bridge this gap, we propose WorldModelBench, a benchmark designed to evaluate the world modeling capabilities of video generation models in application-driven domains. WorldModelBench offers two key advantages: (1) Against to nuanced world modeling violations: By incorporating instruction-following and physics-adherence dimensions, WorldModelBench detects subtle violations, such as irregular changes in object size that breach the mass conservation law—issues overlooked by prior benchmarks. (2) Aligned with large-scale human preferences: We crowd-source 67K human labels to accurately measure 14 frontier models. Using our high-quality human labels, we further fine-tune an accurate judger to automate the evaluation procedure, achieving 9.9% lower error in predicting world modeling violations than GPT-4o with 2B parameters. In addition, we demonstrate that training to align human annotations by maximizing the rewards from the judger noticeably improve the world modeling capability. The dataset is hosted in HuggingFace at https://huggingface.co/datasets/Efficient-Large-Model/worldmodelbench. The code to run evaluation is available at https://github.com/WorldModelBench-Team/WorldModelBench.

View full details

Poster

ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models

Liyan Tang ⋅ Grace Kim ⋅ Xinyu Zhao ⋅ Thom Lake ⋅ Wenxuan Ding ⋅ Fangcong Yin ⋅ Prasann Singhal ⋅ Manya Wadhwa ⋅ Zeyu Liu ⋅ Zayne Sprague ⋅ Ramya Namuduri ⋅ Bodun Hu ⋅ Juan Rodriguez ⋅ Puyuan Peng ⋅ Greg Durrett

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Chart understanding presents a unique challenge for large vision-language models (LVLMs), as it requires the integration of sophisticated textual and visual reasoning capabilities. However, current LVLMs exhibit a notable imbalance between these skills, falling short on visual reasoning that is difficult to perform in text. We conduct a case study using a synthetic dataset solvable only through visual reasoning and show that model performance degrades significantly with increasing visual complexity, while human performance remains robust. We then introduce *ChartMuseum*, a new Chart Question Answering (QA) benchmark containing 1,162 expert-annotated questions spanning multiple reasoning types, curated from real-world charts across 184 sources, specifically built to evaluate complex visual and textual reasoning. Unlike prior chart understanding benchmarks---where frontier models perform similarly and near saturation---our benchmark exposes a substantial gap between model and human performance, while effectively differentiating model capabilities: although humans achieve 93% accuracy, the best-performing model Gemini-2.5-Pro attains only 63.0%, and the leading open-source LVLM Qwen2.5-VL-72B-Instruct achieves only 38.5%. Moreover, on questions requiring primarily visual reasoning, *all* models experience a 35%-55% performance drop from text-reasoning-heavy question performance. Lastly, our qualitative error analysis reveals specific categories of visual reasoning that are challenging for current LVLMs. Both ChartMuseum and the evaluation code are available at [https://github.com/Liyan06/ChartMuseum](https://github.com/Liyan06/ChartMuseum).

View full details

Poster

Seeing in the Dark: Benchmarking Egocentric 3D Vision with the Oxford Day-and-Night Dataset

Zirui Wang ⋅ Wenjing Bian ⋅ Xinghui Li ⋅ Yifu Tao ⋅ Jianeng Wang ⋅ Maurice Fallon ⋅ Victor Prisacariu

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We introduce Oxford Day-and-Night, a large-scale, egocentric dataset for novel view synthesis (NVS) and visual relocalisation under challenging lighting conditions. Existing datasets often lack crucial combinations of features such as ground-truth 3D geometry, wide-ranging lighting variation, and full 6DoF motion. Oxford Day-and-Night addresses these gaps by leveraging Meta ARIA glasses to capture egocentric video and applying multi-session SLAM to estimate camera poses, reconstruct 3D point clouds, and align sequences captured under varying lighting conditions, including both day and night. The dataset spans over 30 km of recorded trajectories and covers an area of $40{,}000\mathrm{m}^2$, offering a rich foundation for egocentric 3D vision research. It supports two core benchmarks, NVS and relocalisation, providing a unique platform for evaluating models in realistic and diverse environments. Project page: https://oxdan.active.vision/

View full details

Poster

MM-OPERA: Benchmarking Open-ended Association Reasoning for Large Vision-Language Models

Zimeng Huang ⋅ Jinxin Ke ⋅ Xiaoxuan Fan ⋅ Yufeng Yang ⋅ Yang Liu ⋅ Liu Zhonghan ⋅ Zedi Wang ⋅ Junteng Dai ⋅ Haoyi Jiang ⋅ Yuyu Zhou ⋅ Keze Wang ⋅ Ziliang Chen

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Large Vision-Language Models (LVLMs) have exhibited remarkable progress. However, deficiencies remain compared to human intelligence, such as hallucination and shallow pattern matching. In this work, we aim to evaluate a fundamental yet underexplored intelligence: association, a cornerstone of human cognition for creative thinking and knowledge integration. Current benchmarks, often limited to closed-ended tasks, fail to capture the complexity of open-ended association reasoning vital for real-world applications. To address this, we present MM-OPERA, a systematic benchmark with 11,497 instances across two open-ended tasks: Remote-Item Association (RIA) and In-Context Association (ICA), aligning association intelligence evaluation with human psychometric principles. It challenges LVLMs to resemble the spirit of divergent thinking and convergent associative reasoning through free-form responses and explicit reasoning paths. We deploy tailored LLM-as-a-Judge strategies to evaluate open-ended outputs, applying process-reward-informed judgment to dissect reasoning with precision. Extensive empirical studies on state-of-the-art LVLMs, including sensitivity analysis of task instances, validity analysis of LLM-as-a-Judge strategies, and diversity analysis across abilities, domains, languages, cultures, etc., provide a comprehensive and nuanced understanding of the limitations of current LVLMs in associative reasoning, paving the way for more human-like and general-purpose AI. The dataset and code are available at https://github.com/MM-OPERA-Bench/MM-OPERA.

View full details

Poster

Results of the Big ANN: NeurIPS’23 competition

Harsha Vardhan simhadri ⋅ Martin Aumüller ⋅ Matthijs Douze ⋅ Dmitry Baranchuk ⋅ Amir Ingber ⋅ Edo Liberty ⋅ George Williams ⋅ Ben Landrum ⋅ Magdalen Manohar ⋅ Mazin Karjikar ⋅ Laxman Dhulipala ⋅ Meng Chen ⋅ Yue Chen ⋅ Rui Ma ⋅ Kai Zhang ⋅ Yuzheng Cai ⋅ Jiayang Shi ⋅ Weiguo Zheng ⋅ Yizhuo Chen ⋅ Jie Yin ⋅ Ben Huang

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The 2023 Big ANN Challenge, held at NeurIPS 2023, focused on advancing the state-of-the-art in indexing data structures and search algorithms for practical variants of Approximate Nearest Neighbor (ANN) search that reflect its the growing complexity and diversity of workloads. Unlike prior challenges that emphasized scaling up classical ANN search (Simhadri et al., NeurIPS 2021), this competition addressed sparse, filtered, out-of-distribution, and streaming variants of ANNS. Participants developed and submitted innovative solutions that were evaluated on new standard datasets with constrained computational resources. The results showcased significant improvements in search accuracy and efficiency, with notable contributions from both academic and industrial teams. This paper summarizes the competition tracks, datasets, evaluation metrics, and the innovative approaches of the top-performing submissions, providing insights into the current advancements and future directions in the field of approximate nearest neighbor search.

View full details

Poster

Evaluating Generalization Capabilities of LLM-Based Agents in Mixed-Motive Scenarios Using Concordia

Chandler Smith ⋅ Marwa Abdulhai ⋅ Manfred Díaz ⋅ Marko Tesic ⋅ Rakshit Trivedi ⋅ Sasha Vezhnevets ⋅ Lewis Hammond ⋅ Jesse Clifton ⋅ Minsuk Chang ⋅ Edgar Duenez-Guzman ⋅ John Agapiou ⋅ Jayd Matyas ⋅ Danny Karmon ⋅ Beining Zhang ⋅ Jim Dilkes ⋅ Akash Kundu ⋅ Hieu Minh Nguyen ⋅ Emanuel Tewolde ⋅ Jebish Purbey ⋅ Ram Mohan Rao Kadiyala ⋅ Siddhant Gupta ⋅ Aliaksei Korshuk ⋅ Buyantuev Alexander ⋅ Ilya Makarov ⋅ Gang Zhao ⋅ Rolando Fernandez ⋅ Zhihan Wang ⋅ Caroline Wang ⋅ Jiaxun Cui ⋅ Lingyun Xiao ⋅ Di Shi ⋅ Yoonchang Sung ⋅ Muhammad Arrasy Rahman ⋅ Peter Stone ⋅ Yipeng Kang ⋅ Hyeonggeun Yun ⋅ Ananya Ananya ⋅ Taehun Cha ⋅ Zhiqiang Wu ⋅ Elizaveta Tennant ⋅ Olivia Macmillan-Scott ⋅ Marta Segura ⋅ Diana Riazi ⋅ Fuyang Cui ⋅ Sriram Ganapathi ⋅ Toryn Klassen ⋅ Nico Schiavone ⋅ Mogtaba Alim ⋅ Sheila McIlraith ⋅ Manuel Rios ⋅ Oswaldo Peña ⋅ Carlos Rojas ⋅ Manuela Chacon-Chamorro ⋅ Rubén Manrique ⋅ Luis Felipe Giraldo ⋅ Nicanor Quijano ⋅ Yiding Wang ⋅ Yuxuan Chen ⋅ Fangwei Zhong ⋅ Mengmeng Wang ⋅ Wenming Tu ⋅ Zhaowei Zhang ⋅ Ziang Chen ⋅ Zixia Jia ⋅ Xue Feng ⋅ Zilong Zheng ⋅ Chichen Lin ⋅ Weijian Fan ⋅ Chenao Liu ⋅ Sneheel Sarangi ⋅ Ziyan Wang ⋅ shuqing shi ⋅ Yali Du ⋅ Avinaash Anand Kulandaivel ⋅ Yang Liu ⋅ Wu Ruiyang ⋅ Chetan Talele ⋅ Sunjia Lu ⋅ Gema Parreno ⋅ Shamika Dhuri ⋅ Bain McHale ⋅ Tim Baarslag ⋅ Dylan Hadfield-Menell ⋅ Natasha Jaques ⋅ José Hernández-Orallo ⋅ Joel Leibo

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Large Language Model (LLM) agents have demonstrated impressive capabilities for social interaction and are increasingly being deployed in situations where they might engage with both human and artificial agents. These interactions represent a critical frontier for LLM-based agents, yet existing evaluation methods fail to measure how well these capabilities generalize to novel social situations. In this paper, we introduce a method for evaluating the ability of LLM-based agents to cooperate in zero-shot, mixed-motive environments using Concordia, a natural language multi-agent simulation environment. Our method measures general cooperative intelligence by testing an agent's ability to identify and exploit opportunities for mutual gain across diverse partners and contexts. We present empirical results from the NeurIPS 2024 Concordia Contest, where agents were evaluated on their ability to achieve mutual gains across a suite of diverse scenarios ranging from negotiation to collective action problems. Our findings reveal significant gaps between current agent capabilities and the robust generalization required for reliable cooperation, particularly in scenarios demanding persuasion and norm enforcement.

View full details

Poster

EngiBench: A Framework for Data-Driven Engineering Design Research

Florian Felten ⋅ Gabriel Apaza ⋅ Gerhard Bräunlich ⋅ Cashen Diniz ⋅ Xuliang Dong ⋅ Arthur Drake ⋅ Milad Habibi ⋅ Nathaniel Hoffman ⋅ Matthew Keeler ⋅ Soheyl Massoudi ⋅ Francis VanGessel ⋅ Mark Fuge

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Engineering design optimization seeks to automatically determine the shapes, topologies, or parameters of components that maximize performance under given conditions. This process often depends on physics-based simulations, which are difficult to install, computationally expensive, and require domain-specific expertise. To mitigate these challenges, we introduce EngiBench, the first open‐source library and datasets spanning diverse domains for data‐driven engineering design. EngiBench provides a unified API and a curated set of benchmarks---covering aeronautics, heat conduction, photonics, and more---that enable fair, reproducible comparisons of optimization and machine learning algorithms, such as generative or surrogate models. We also release EngiOpt, a companion library offering a collection of such algorithms compatible with the EngiBench interface. Both libraries are modular, letting users plug in novel algorithms or problems, automate end-to-end experiment workflows, and leverage built-in utilities for visualization, dataset generation, feasibility checks, and performance analysis. We demonstrate their versatility through experiments comparing state-of-the-art techniques across multiple engineering design problems, an undertaking that was previously prohibitively time-consuming to perform. Finally, we show that these problems pose significant challenges for standard machine learning methods due to highly sensitive and constrained design manifolds.

View full details

Poster

NS-Gym: A Comprehensive and Open-Source Simulation Framework for Non-Stationary Markov Decision Processes

Nathaniel S. Keplinger ⋅ Baiting Luo ⋅ Yunuo Zhang ⋅ Kyle H Wray ⋅ Aron Laszka ⋅ Abhishek Dubey ⋅ Ayan Mukhopadhyay

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Many real-world applications require decision-making where the environmental dynamics evolve over time. These non-stationary environments pose significant challenges to traditional decision-making models, which typically assume stationary dynamics. Non-stationary Markov decision processes (NS-MDPs) offer a framework to model and solve decision problems under such changing conditions. However, there are no standardized simulation frameworks for NS-MDPs, as opposed to widely popular frameworks for stationary problems. We present NS-Gym, the first simulation toolkit designed explicitly for NS-MDPs, integrated within the popular Gymnasium framework. In NS-Gym, we segregate the evolution of the environmental parameters that characterize non-stationarity from the agent’s decision-making module, allowing for modular and flexible adaptations to dynamic environments. We review prior work in this domain and present a toolkit encapsulating key problem characteristics and types in NS-MDPs. This toolkit is the first effort to develop a set of standardized interfaces and benchmark problems to enable consistent and reproducible evaluation of algorithms under non-stationary conditions. We also benchmark several algorithmic approaches from prior work on NS-MDPs using NS-Gym. We envision that NS-Gym will enable researchers to study decision-making under non-stationarity by providing standardized interfaces, simulation frameworks, and benchmark problems.

View full details

Poster

Benchmarking Retrieval-Augmented Multimomal Generation for Document Question Answering

Kuicai Dong ⋅ CHANG YUJING ⋅ Shijie Huang ⋅ Yasheng Wang ⋅ Ruiming Tang ⋅ Yong Liu

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Document Visual Question Answering (DocVQA) faces dual challenges in processing lengthy multimodal documents (text, images, tables) and performing cross-modal reasoning. Current document retrieval-augmented generation (DocRAG) methods remain limited by their text-centric approaches, frequently missing critical visual information. The field also lacks robust benchmarks for assessing multimodal evidence selection and integration. We introduce MMDocRAG, a comprehensive benchmark featuring 4,055 expert-annotated QA pairs with multi-page, cross-modal evidence chains. Our framework introduces innovative metrics for evaluating multimodal quote selection and enables answers that interleave text with relevant visual elements. Through large-scale experiments with 60 VLM/LLM models and 14 retrieval systems, we identify persistent challenges in multimodal evidence retrieval, selection, and integration. Key findings reveal that advanced proprietary LVMs show superior performance than open-sourced alternatives. Also, they show moderate advantages using multimodal inputs over text-only inputs, while open-source alternatives show significant performance degradation. Notably, fine-tuned LLMs achieve substantial improvements when using detailed image descriptions. MMDocRAG establishes a rigorous testing ground and provides actionable insights for developing more robust multimodal DocVQA systems.

View full details

Poster

Surprise3D: A Dataset for Spatial Understanding and Reasoning in Complex 3D Scenes

Jiaxin Huang ⋅ Ziwen Li ⋅ Hanlue Zhang ⋅ Runnan Chen ⋅ Zhengqing Gao ⋅ Xiao He ⋅ Yandong Guo ⋅ Wenping Wang ⋅ Tongliang Liu ⋅ Mingming Gong

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The integration of language and 3D perception is critical for embodied AI and robotic systems to perceive, understand, and interact with the physical world. Spatial reasoning, a key capability for understanding spatial relationships between objects, remains underexplored in current 3D vision-language research. Existing datasets often mix semantic cues (e.g., object name) with spatial context, leading models to rely on superficial shortcuts rather than genuinely interpreting spatial relationships. To address this gap, we introduce Surprise3D, a novel dataset designed to evaluate language-guided spatial reasoning segmentation in complex 3D scenes. Surprise3D consists of more than 200k vision language pairs across 900+ detailed indoor scenes from ScanNet++ v2, including more than 2.8k unique object classes. The dataset contains 89k+ human-annotated spatial queries deliberately crafted without object name, thereby mitigating shortcut biases in spatial understanding. These queries comprehensively cover various spatial reasoning skills, such as relative position, narrative perspective, parametric perspective, and absolute distance reasoning. Initial benchmarks demonstrate significant challenges for current state-of-the-art expert 3D visual grounding methods and 3D-LLMs, underscoring the necessity of our dataset and the accompanying 3D Spatial Reasoning Segmentation (3D-SRS) benchmark suite. Surprise3D and 3D-SRS aim to facilitate advancements in spatially aware AI, paving the way for effective embodied interaction and robotic planning.

View full details

Poster

MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness

Yunlong Tang ⋅ Pinxin Liu ⋅ Mingqian Feng ⋅ Zhangyun Tan ⋅ Rui Mao ⋅ Chao Huang ⋅ Jing Bi ⋅ Yunzhong Xiao ⋅ Susan Liang ⋅ Hang Hua ⋅ Ali Vosoughi ⋅ Luchuan Song ⋅ Zeliang Zhang ⋅ Chenliang Xu

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Understanding perspective is fundamental to human visual perception, yet the extent to which multimodal large language models (MLLMs) internalize perspective geometry remains unclear. We introduce MMPerspective, the first benchmark specifically designed to systematically evaluate MLLMs' understanding of perspective through 10 carefully crafted tasks across three complementary dimensions: Perspective Perception, Reasoning, and Robustness. Our benchmark comprises 2,711 real-world and synthetic image instances with 5,083 question-answer pairs that probe key capabilities, such as vanishing point perception and counting, perspective type reasoning, line relationship understanding in 3D space, invariance to perspective-preserving transformations, etc. Through a comprehensive evaluation of 43 state-of-the-art MLLMs, we uncover significant limitations: while models demonstrate competence on surface-level perceptual tasks, they struggle with compositional reasoning and maintaining spatial consistency under perturbations. Our analysis further reveals intriguing patterns between model architecture, scale, and perspective capabilities, highlighting both robustness bottlenecks and the benefits of chain-of-thought prompting. MMPerspective establishes a valuable testbed for diagnosing and advancing spatial understanding in vision-language systems. Resources are available at https://yunlong10.github.io/MMPerspective/

View full details

Poster

FGBench: A Dataset and Benchmark for Molecular Property Reasoning at Functional Group-Level in Large Language Models

Xuan Liu ⋅ Siru Ouyang ⋅ Xianrui Zhong ⋅ Jiawei Han ⋅ Huimin Zhao

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Large language models (LLMs) have gained significant attention in chemistry. However, most existing datasets center on molecular-level property prediction and overlook the role of fine-grained functional group (FG) information. Incorporating FG-level data can provide valuable prior knowledge that links molecular structures with textual descriptions, which can be used to build more interpretable, structure-aware LLMs for reasoning on molecule-related tasks. Moreover, LLMs can learn from such fine-grained information to uncover hidden relationships between specific functional groups and molecular properties, thereby advancing molecular design and drug discovery. Here, we introduce FGBench, a dataset comprising 625K molecular property reasoning problems with functional group information. Functional groups are precisely annotated and localized within the molecule, which ensures the dataset's interoperability thereby facilitating further multimodal applications. FGBench includes both regression and classification tasks on 245 different functional groups across three categories for molecular property reasoning: (1) single functional group impacts, (2) multiple functional group interactions, and (3) direct molecular comparisons. In the benchmark of state-of-the-art LLMs on 7K curated data, the results indicate that current LLMs struggle with FG-level property reasoning, highlighting the need to enhance reasoning capabilities in LLMs for chemistry tasks. We anticipate that the methodology employed in FGBench to construct datasets with functional group-level information will serve as a foundational framework for generating new question–answer pairs, enabling LLMs to better understand fine-grained molecular structure–property relationships. The dataset and evaluation code are available at this \href{https://github.com/xuanliugit/FGBench}{link}.

View full details

Poster

AnomalyCoT: A Multi-Scenario Chain-of-Thought Dataset for Multimodal Large Language Models

Jiaxi Cheng ⋅ Yuliang Xu ⋅ Shoupeng Wang ⋅ Tao Ma ⋅ Yuchen He ⋅ Jinghe Zhang ⋅ Sihang Cai ⋅ Jiawei Zhen ⋅ Jingyi Jia ⋅ Yao Wan ⋅ Yan Xia ⋅ Zhou Zhao

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Industrial Anomaly Detection (IAD) is an indispensable quality control technology in modern production processes. Recently, on account of the outstanding visual comprehension and cross-domain knowledge transfer capabilities of multimodal large language models (MLLMs), existing studies have explored the application of MLLMs in the IAD domain and established some multimodal IAD datasets. However, although the latest datasets contain various fundamental IAD tasks, they formulate tasks in a general question-and-answer format lacking a rigorous reasoning process, and they are relatively limited in the diversity of scenarios, which restricts their reliability in practical applications. In this paper, we propose AnomalyCoT, a multimodal Chain-of-Thought (CoT) dataset for multi-scenario IAD tasks. It consists of 37,565 IAD samples with the CoT data and is defined by challenging composite IAD tasks. Meanwhile, the CoT data for each sample provides precise coordinates of anomaly regions, thereby improving visual comprehension of defects across different types. AnomalyCoT is constructed through a systematic pipeline and involves multiple manual operations. Based on AnomalyCoT, we conducted a comprehensive evaluation of various mainstream MLLMs and fine-tuned representative models in different ways. The final results show that Gemini-2.0-flash achieved the best performance in the direct evaluation with an accuracy rate of 59.6\%, while Llama 3.2-Vision achieves the best performance after LoRA fine-tuning with an accuracy rate of 94.0\%. Among all the fine-tuned models, the average accuracy improvement reaches 36.5\%, demonstrating the potential of integrating CoT datasets in future applications within the IAD field. The code and data are available at \url{https://github.com/Zhaolutuan/AnomalyCoT}.

View full details

Poster

EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge

Ruskin Raj Manku ⋅ Yuzhi Tang ⋅ Xingjian Shi ⋅ Mu Li ⋅ Alexander Smola

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Text-to-Speech (TTS) benchmarks often fail to capture how well models handle nuanced and semantically complex text. Building on $\textit{EmergentTTS}$, we introduce $\textit{EmergentTTS-Eval}$, a comprehensive benchmark covering six challenging TTS scenarios: emotions, paralinguistics, foreign words, syntactic complexity, complex pronunciation (e.g. URLs, formulas), and questions. Crucially, our framework automates both test-case generation and evaluation, making the benchmark easily extensible. Starting from a small set of human-written seed prompts, we iteratively extend them using LLMs to target specific structural, phonetic and prosodic challenges, resulting in 1,645 diverse test samples. Moreover, we employ a model-as-a-judge approach, using a Large Audio Language Model (LALM) to assess the speech across multiple dimensions such as expressed emotion, prosodic, intonational, and pronunciation accuracy. We evaluate state-of-the-art open-source and proprietary TTS systems, such as 11Labs, Deepgram, and OpenAI's 4o-mini-TTS, on EmergentTTS-Eval, demonstrating its ability to reveal fine-grained performance differences. Results show that the model-as-a-judge approach offers robust TTS assessment and a high correlation with human preferences. We open-source the code and the dataset.

View full details

Poster

CineTechBench: A Benchmark for Cinematographic Technique Understanding and Generation

Xinran Wang ⋅ Songyu Xu ⋅ Shan Xiangxuan ⋅ Yuxuan Zhang ⋅ Muxi Diao ⋅ Xueyan Duan ⋅ Yanhua huang ⋅ Kongming Liang ⋅ Zhanyu Ma

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Cinematography is a cornerstone of film production and appreciation, shaping mood, emotion, and narrative through visual elements such as camera movement, shot composition, and lighting. Despite recent progress in multimodal large language models (MLLMs) and video generation models, the capacity of current models to grasp and reproduce cinematographic techniques remains largely uncharted, hindered by the scarcity of expert-annotated data. To bridge this gap, we present CineTechBench, a pioneering benchmark founded on precise, manual annotation by seasoned cinematography experts across key cinematography dimensions. Our benchmark covers seven essential aspects—shot scale, shot angle, composition, camera movement, lighting, color, and focal length—and includes over 600 annotated movie images and 120 movie clips with clear cinematographic techniques. For the understanding task, we design question–answer pairs and annotated descriptions to assess MLLMs’ ability to interpret and explain cinematographic techniques. For the generation task, we assess advanced video generation models on their capacity to reconstruct cinema-quality camera movements given conditions such as textual prompts or keyframes. We conduct a large-scale evaluation on 15+ MLLMs and 5+ video generation models. Our results offer insights into the limitations of current models and future directions for cinematography understanding and generation in automatical film production and appreciation. The code and benchmark can be accessed at \url{https://github.com/PRIS-CV/CineTechBench}.

View full details

Poster

WearVQA: A Visual Question Answering Benchmark for Wearables in Egocentric Authentic Real-world scenarios

Eun Chang ⋅ Zhuangqun Huang ⋅ Yiwei Liao ⋅ Sagar Bhavsar ⋅ Amogh Param ⋅ Tammy Stark ⋅ Adel Ahmadyan ⋅ Xiao Yang ⋅ Jiaqi Wang ⋅ Ahsan Abdullah ⋅ Giang Nguyen ⋅ Akil Iyer ⋅ David Hall ⋅ Elissa Li ⋅ Nicolas Scheffer ⋅ Ahmed Kirmani ⋅ Babak Damavandi ⋅ Rakesh Wanga ⋅ Anuj Kumar ⋅ Rohit Patel ⋅ Seungwhan Moon ⋅ Xin Luna Dong

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We introduce WearVQA, the first benchmark specifically designed to evaluate the visual questionanswering (VQA) capabilities of multi-modal AI assistant on wearable devices like smart glasses. Unlikeprior benchmarks that focus on high-quality, third-person imagery, WearVQA reflects the unique chal-lenges of ego-centric interaction—where visual inputs may be occluded, poorly lit, unzoomed, or blurry,and questions are grounded in realistic wearable use cases. The benchmark comprises 2,500 carefullycurated image-question-answer triplets, spanning 7 diverse image domains including both text-centricand general scenes, 10 cognitive task types ranging from basic recognition to various forms of reasoning,and 6 common wearables-specific image quality issues. All questions are designed to be answerable usingonly the visual input and common senses. WearVQA is paired with a rigorous LLM-as-a-judge evaluationframework with 96% labeling accuracy. Open-source and proprietary multi-modal LLMs achieved a QAaccuracy as low as 24–52% on WearVQA, with substantial drops on lower-quality images and reasoning-heavy tasks. These observations position WearVQA as a comprehensive and challenging benchmark forguiding technicial advancement towards robust, real-world multi-modal wearables AI systems.

View full details

Poster

COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation

Xueqing Deng ⋅ Linjie Yang ⋅ Qihang Yu ⋅ Ali Athar ⋅ Chenglin Yang ⋅ Xiaojie Jin ⋅ Xiaohui Shen ⋅ Liang-Chieh Chen

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

This paper introduces the COCONut-PanCap dataset, created to enhance panoptic segmentation and grounded image captioning. Building upon the COCO dataset with advanced COCONut panoptic masks, this dataset aims to overcome limitations in existing image-text datasets that often lack detailed, scene-comprehensive descriptions. The COCONut-PanCap dataset incorporates fine-grained, region-level captions grounded in panoptic segmentation masks, ensuring consistency and improving the detail of generated captions.Through human-edited, densely annotated descriptions, COCONut-PanCap supports improved training of vision-language models (VLMs) for image understanding and generative models for text-to-image tasks.Experimental results demonstrate that COCONut-PanCap significantly boosts performance across understanding and generation tasks, offering complementary benefits to large-scale datasets. This dataset sets a new benchmark for evaluating models on joint panoptic segmentation and grounded captioning tasks, addressing the need for high-quality, detailed image-text annotations in multi-modal learning.

View full details

Poster

BRACE: A Benchmark for Robust Audio Caption Quality Evaluation

Tianyu Guo ⋅ Hongyu Chen ⋅ Hao Liang ⋅ Meiyi Qiang ⋅ Bohan Zeng ⋅ Linzhuang Sun ⋅ Bin CUI ⋅ Wentao Zhang

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Automatic audio captioning is essential for audio understanding, enabling applications such as accessibility and content indexing. However, evaluating the quality of audio captions remains a major challenge, especially in reference-free settings where high-quality ground-truth captions are unavailable. While CLAPScore is currently the most widely used reference-free Audio Caption Evaluation Metric(ACEM), its robustness under diverse conditions has not been systematically validated. To address this gap, we introduce BRACE, a new benchmark designed to evaluate audio caption alignment quality in a reference-free setting. BRACE is primarily designed for assessing ACEMs, and can also be extended to measure the modality alignment abilities of Large Audio Language Model(LALM). BRACE consists of two sub-benchmarks: BRACE-Main for fine-grained caption comparison and BRACE-Hallucination for detecting subtle hallucinated content. We construct these datasets through high-quality filtering, LLM-based corruption, and human annotation. Given the widespread adoption of CLAPScore as a reference-free ACEM and the increasing application of LALMs in audio-language tasks, we evaluate both approaches using the BRACE benchmark, testing CLAPScore across various CLAP model variants and assessing multiple LALMs. Notably, even the best-performing CLAP-based ACEM achieves only a 70.01 F1-score on the BRACE-Main benchmark, while the best LALM reaches just 63.19. By revealing the limitations of CLAP models and LALMs, our BRACE benchmark offers valuable insights into the direction of future research. Our evaluation code and benchmark dataset are released in https://github.com/HychTus/BRACE_Evaluation and https://huggingface.co/datasets/gtysssp/audio_benchmarks.

View full details

Poster

MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning

Yuxuan Luo ⋅ Ryan Yuan ⋅ Junwen Chen ⋅ Haonan Cai ⋅ Ziyi Yue ⋅ Yuwei Yang ⋅ Fatima Zohra Daha ⋅ Ji Li ⋅ Zhouhui Lian

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

In this paper, we introduce knowledge image generation as a new task, alongside the Massive Multi-Discipline Multi-Tier Knowledge-Image Generation Benchmark (MMMG) to probe the reasoning capability of image generation models.Knowledge images have been central to human civilization and to the mechanisms of human learning—a fact underscored by dual-coding theory and the picture-superiority effect.Generating such images is challenging, demanding multimodal reasoning that fuses world knowledge with pixel-level grounding into clear explanatory visuals.To enable comprehensive evaluation, MMMG offers $4,456$ expert-validated (knowledge) image-prompt pairs spanning $10$ disciplines, $6$ educational levels, and diverse knowledge formats such as charts, diagrams, and mind maps. To eliminate confounding complexity during evaluation, we adopt a unified Knowledge Graph (KG) representation. Each KG explicitly delineates a target image’s core entities and their dependencies.We further introduce MMMG-Score to evaluate generated knowledge images. This metric combines factual fidelity, measured by graph-edit distance between KGs, with visual clarity assessment.Comprehensive evaluations of $21$ state-of-the-art text-to-image generation models expose serious reasoning deficits—low entity fidelity, weak relations, and clutter—with GPT-4o achieving an MMMG-Score of only $50.20$, underscoring the benchmark’s difficulty.To spur further progress, we release FLUX-Reason (MMMG-Score of $34.45$), an effective and open baseline that combines a reasoning LLM with diffusion models and is trained on $16,000$ curated knowledge image–prompt pairs.

View full details

Poster

AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy

Sebastian Joseph ⋅ Syed M. Husain ⋅ Stella Offner ⋅ Stéphanie Juneau ⋅ Paul Torrey ⋅ Adam Bolton ⋅ Juan Farias ⋅ Niall Gaffney ⋅ Greg Durrett ⋅ Junyi Jessy Li

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Large Language Models (LLMs) are being explored for applications in scientific research, including their capabilities to synthesize literature, answer research questions, generate research ideas, and even conduct computational experiments.Ultimately, our goal is for these to help scientists derive novel scientific insights. In many areas of science, such insights often arise from processing and visualizing data to understand its patterns. However, evaluating whether an LLM-mediated scientific workflow produces outputs conveying the correct scientific insights is challenging to evaluate and has not been addressed in past work.We introduce AstroVisBench, the first benchmark for both scientific computing and visualization in the astronomy domain.AstroVisBench judges a language model’s ability to both (1) create astronomy-specific workflows to process and analyze data and (2) visualize the results of these workflows through complex plots.Our evaluation of visualizations uses a novel LLM-as-a-judge workflow, which is validated against annotation by five professional astronomers.Using AstroVisBench we present an evaluation of state-of-the-art language models, showing a significant gap in their ability to engage in astronomy research as useful assistants.This evaluation provides a strong end-to-end evaluation for AI scientists that offers a path forward for the development of visualization-based workflows, which are central to a broad range of domains from physics to biology.

View full details

Poster

Demystifying Network Foundation Models

Roman Beltiukov ⋅ Satyandra Guthula ⋅ Wenbo Guo ⋅ Walter Willinger ⋅ Arpit Gupta

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

This work presents a systematic investigation into the latent knowledge encoded within Network Foundation Models (NFMs). Different from existing efforts, we focus on hidden representations analysis rather than pure downstream task performance and analyze NFMs through a three-part evaluation: Embedding Geometry Analysis to assess representation space utilization, Metric Alignment Assessment to measure correspondence with domain-expert features, and Causal Sensitivity Testing to evaluate robustness to protocol perturbations. Using five diverse network datasets spanning controlled and real-world environments, we evaluate four state-of-the-art NFMs, revealing that they all exhibit significant anisotropy, inconsistent feature sensitivity patterns, an inability to separate the high-level context, payload dependency, and other properties. Our work identifies numerous limitations across all models and demonstrates that addressing them can significantly improve model performance (up to 0.35 increase in $F_1$ scores without architectural changes).

View full details

Poster

FEEL: Quantifying Heterogeneity in Physiological Signals for Generalizable Emotion Recognition

Pragya Singh ⋅ Ankush Gupta ⋅ Somay Jalan ⋅ Mohan Kumar ⋅ Pushpendra Singh

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Emotion recognition from physiological signals has substantial potential for applications in mental health and emotion-aware systems. However, the lack of standardized, large-scale evaluations across heterogeneous datasets limits progress and model generalization. We introduce FEEL (Framework for Emotion Evaluation), the first large-scale benchmarking study of emotion recognition usingelectrodermal activity (EDA) and photoplethysmography (PPG) signals across 19 publicly available datasets. We evaluate 16 architectures spanning traditional machine learning, deep learning, and self-supervised pretraining approaches, structured into four representative modeling paradigms. Our study includes both within-dataset and cross-dataset evaluations, analyzing generalization across variations in experimental settings, device types, and labeling strategies. Our results showed that fine-tuned contrastive signal-language pretraining (CLSP) models (71/114) achieve the highest F1 across arousal and valence classification tasks, while simpler models like Random Forests, LDA, and MLP remain competitive (36/114). Models leveraging handcrafted features (107/114) consistently outperform those trained on raw signal segments, underscoring the value of domain knowledge in low-resource, noisy settings. Further cross-dataset analyses reveal that models trained on real-life setting data generalize well to lab (F1 = 0.79) and constraint-based settings (F1 = 0.78). Similarly, models trained on expert-annotated data transfer effectively to stimulus-labeled (F1 = 0.72) and self-reported datasets (F1 = 0.76). Moreover, models trained on lab-based devices also demonstrated high transferability to both custom wearable devices (F1 = 0.81) and the Empatica E4 (F1 = 0.73), underscoring the influence of heterogeneity. Overall, FEEL provides a unified framework for benchmarking physiological emotion recognition, delivering insights to guide the development of generalizable emotion-aware technologies. Code implementationis available at https://github.com/alchemy18/FEEL. More information about FEEL can be found on our website https://alchemy18.github.io/FEEL_Benchmark/.

View full details

Poster

PANORAMA: A Dataset and Benchmarks Capturing Decision Trails and Rationales in Patent Examination

Hyunseung Lim ⋅ Sooyohn Nam ⋅ Sungmin Na ⋅ Ji Yong Cho ⋅ June Yong Yang ⋅ Hyungyu Shin ⋅ Yoonjoo Lee ⋅ Juho Kim ⋅ Moontae Lee ⋅ Hwajung Hong

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Patent examination remains an ongoing challenge in the NLP literature even after the advent of large language models (LLMs), as it requires an extensive yet nuanced human judgment on whether a submitted $\textit{claim}$ meets the statutory standards of $\textit{novelty}$ and $\textit{non-obviousness}$ against previously granted claims—$\textit{prior art}$—in expert domains. Previous NLP studies have approached this challenge as a prediction task (e.g., forecasting grant outcomes) with high-level proxies such as similarity metrics or classifiers trained on historical labels. However, this approach often overlooks the step-by-step evaluations that examiners must make with profound information, including rationales for the decisions provided in $\textit{office actions}$ documents, which also makes it harder to measure the current state of techniques in patent review processes. To fill this gap, we construct PANORAMA, a dataset of 8,143 U.S. patent examination records that preserves the full decision trails, including original applications, all cited references, $\textit{Non-Final Rejections}$, and $\textit{Notices of Allowance}$. Also, PANORAMA decomposes the trails into sequential benchmarks that emulate patent professionals' patent review processes and allow researchers to examine large language models' capabilities at each step of them. Our findings indicate that, although LLMs are relatively effective at retrieving relevant prior art and pinpointing the pertinent paragraphs, they struggle to assess the novelty and non-obviousness of patent claims. We discuss these results and argue that advancing NLP, including LLMs, in the patent domain requires a deeper understanding of real-world patent examination. Our dataset is openly available at https://huggingface.co/datasets/LG-AI-Research/PANORAMA.

View full details

Poster

scGeneScope: A Treatment-Matched Single Cell Imaging and Transcriptomics Dataset and Benchmark for Treatment Response Modeling

Joel Dapello ⋅ Marcel Nassar ⋅ Ridvan Eksi ⋅ Ban Wang ⋅ Jules Gagnon-Marchand ⋅ Kenneth Gao ⋅ akram Baharlouei ⋅ Kyra Thrush ⋅ Nina Riehs ⋅ Amy Peterson ⋅ Aniket Tolpadi ⋅ Abhejit Rajagopal ⋅ Henry Miller ⋅ Ashley Conard ⋅ David Alvarez-Melis ⋅ Rory Stark ⋅ Simone Bianco ⋅ Morgan Levine ⋅ Ava Amini ⋅ Alex X Lu ⋅ Nicolo Fusi ⋅ Ravi Pandya ⋅ Valentina Pedoia ⋅ Hana El-Samad

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Understanding cellular responses to chemical interventions is critical to the discovery of effective therapeutics. Because individual biological techniques often measure only one axis of cellular response at a time, high-quality multimodal datasets are needed to unlock a holistic understanding of how cells respond to treatments and to advance computational methods that integrate modalities. However, many techniques destroy cells and thus preclude paired measurements, and attempts to match disparate unimodal datasets are often confounded by data being generated in incompatible experimental settings. Here we introduce scGeneScope, a multimodal single‑cell RNA sequencing (scRNA-seq) and Cell Painting microscopy image dataset conditionally paired by chemical treatment, designed to facilitate the development and benchmarking of unimodal, multimodal, and multiple profile machine learning methods for cellular profiling. 28 chemicals, each acting on distinct biological pathways or mechanisms of action (MoAs), were applied to U2-OS cells in two experimental data generation rounds, creating paired sets of replicates that were then profiled independently by scRNA‑seq or Cell Painting. Using scGeneScope, we derive a replicate- and experiment-split treatment identification benchmark simulating MoA discovery under realistic laboratory variability conditions and evaluate unimodal, multimodal, and multiprofile models ranging in complexity from linear approaches to recent foundation models. Multiprofile integration improved performance in both the unimodal and multimodal settings, with gains more consistent in the former. Evaluation of unimodal models for MoA identification demonstrated that recent scRNA-seq foundation models deployed zero-shot were consistently outperformed by classic fit-to-data methods, underscoring the need for careful, realistic benchmarking in machine learning for biology. We release the scGeneScope dataset and benchmarking code to support further research.

View full details

Poster

CPSea: Large-scale cyclic peptide-protein complex dataset for machine learning in cyclic peptide design

Ziyi Yang ⋅ Hanyuan Xie ⋅ Yinjun Jia ⋅ Xiangzhe Kong ⋅ Jiqing Zheng ⋅ Ziting Zhang ⋅ Yang Liu ⋅ Lei Liu ⋅ Yanyan Lan

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Cyclic peptides exhibit better binding affinity and proteolytic stability compared to their linear counterparts. However, the development of cyclic peptide design models is hindered by the scarcity of data. To address this, we introduce **CPSea**(**C**yclic **P**eptide **Sea**), a dataset of 2.71 million cyclic peptide-receptor complexes, curated through systematic mining of the AlphaFold Database (AFDB). Our pipeline extracts compact domains from AFDB, identifies cyclization sites using the $\beta$-carbon (C$_\beta$) distance thresholds, and applies multi-stage filtering to ensure structure fidelity and binding compatibility. Compared with experimental data of cyclic peptides, CPSea shows similar distributions in metrics on structure fidelity and wet-lab compatibility. To our knowledge, CPSea is the largest cyclic peptide-receptor dataset to date, enabling end-to-end model training for the first time. The dataset also showcases the feasibility of simulating inter-chain interactions using intra-chain interactions, expanding available resources for machine-learning models on protein-protein interactions. The dataset and relevant scripts are accessible on GitHub ([https://github.com/YZY010418/CPSea](https://github.com/YZY010418/CPSea)).

View full details

Poster

CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays

Hyungyung Lee ⋅ Geon Choi ⋅ Jung-Oh Lee ⋅ Hangyul Yoon ⋅ Hyuk Hong ⋅ Edward Choi

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Recent progress in Large Vision-Language Models (LVLMs) has enabled promising applications in medical tasks, such as report generation and visual question answering. However, existing benchmarks focus mainly on the final diagnostic answer, offering limited insight into whether models engage in clinically meaningful reasoning. To address this, we present CheXStruct and CXReasonBench, a structured pipeline and benchmark built on the publicly available MIMIC-CXR-JPG dataset. CheXStruct automatically derives a sequence of intermediate reasoning steps directly from chest X-rays, such as segmenting anatomical regions, deriving anatomical landmarks and diagnostic measurements, computing diagnostic indices, and applying clinical thresholds. CXReasonBench leverages this pipeline to evaluate whether models can perform clinically valid reasoning steps and to what extent they can learn from structured guidance, enabling fine-grained and transparent assessment of diagnostic reasoning.The benchmark comprises 18,988 QA pairs across 12 diagnostic tasks and 1,200 cases, each paired with up to 4 visual inputs, and supports multi-path, multi-stage evaluation including visual grounding via anatomical region selection and diagnostic measurements.Even the strongest of 12 evaluated LVLMs struggle with structured reasoning and generalization, often failing to link abstract knowledge with anatomically grounded visual interpretation. The code is available at https://github.com/ttumyche/CXReasonBench

View full details

Poster

GS2E: Gaussian Splatting is an Effective Data Generator for Event Stream Generation

Yuchen Li ⋅ Chaoran Feng ⋅ Zhenyu Tang ⋅ Kaiyuan Deng ⋅ Wangbo Yu ⋅ Yonghong Tian ⋅ Li Yuan

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We introduce GS2E (Gaussian Splatting to Event Generation), a large-scale synthetic event dataset designed for high-fidelity event vision tasks, captured from real-world sparse multi-view RGB images. Existing event datasets are often synthesized from dense RGB videos, which typically suffer from limited viewpoint diversity and geometric inconsistency, or rely on expensive, hard-to-scale hardware setups. GS2E addresses these limitations by first reconstructing photorealistic static scenes using 3D Gaussian Splatting, followed by a novel, physically-informed event simulation pipeline. This pipeline integrates adaptive trajectory interpolation with physically-consistent event contrast threshold modeling. As a result, it generates temporally dense and geometrically consistent event streams under diverse motion and lighting conditions, while maintaining strong alignment with the underlying scene structure. Experimental results on event-based 3D reconstruction highlight GS2E’s superior generalization capabilities and its practical value as a benchmark for advancing event vision research.

View full details

Poster

CheMixHub: Datasets and Benchmarks for Chemical Mixture Property Prediction

Ella Miray Rajaonson ⋅ Mahyar Rajabi Kochi ⋅ Luis Martin Mejia Mendoza ⋅ Mohamad Moosavi ⋅ Benjamin Sanchez-Lengeling

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Developing improved predictive models for multi-molecular systems is crucial, as nearly every chemical product used results from a mixture of chemicals. While being a vital part of the industry pipeline, the chemical mixture space remains relatively unexplored by the Machine Learning (ML) community. In this paper, we introduce CheMixHub, a holistic benchmark for molecular mixtures spanning a corpus of 11 chemical mixtures property prediction tasks. With applications ranging from drug delivery formulations to battery electrolytes, CheMixHub currently totals approximately 500k data points gathered and curated from 7 publicly available datasets. We devise various data splitting techniques to assess context-specific generalization and model robustness, providing a foundation for the development of predictive models for chemical mixture properties. Furthermore, we map out the modelling space of deep learning models for chemical mixtures, establishing initial benchmarks for the community. This dataset has the potential to accelerate chemical mixture development, encompassing reformulation, optimization, and discovery. The dataset and code for the benchmarks can be found at: https://github.com/chemcognition-lab/chemixhub

View full details

Poster

Rethinking Evaluation of Infrared Small Target Detection

Youwei Pang ⋅ Xiaoqi Zhao ⋅ Lihe Zhang ⋅ Huchuan Lu ⋅ Georges Fakhri ⋅ Xiaofeng Liu ⋅ Shijian Lu

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

As an essential vision task, infrared small target detection (IRSTD) has seen significant advancements through deep learning. However, critical limitations in current evaluation protocols impede further progress. First, existing methods rely on fragmented pixel- and target-level specific metrics, which fails to provide a comprehensive view of model capabilities. Second, an excessive emphasis on overall performance scores obscures crucial error analysis, which is vital for identifying failure modes and improving real-world system performance. Third, the field predominantly adopts dataset-specific training-testing paradigms, hindering the understanding of model robustness and generalization across diverse infrared scenarios. This paper addresses these issues by introducing a hybrid-level metric incorporating pixel- and target-level performance, proposing a systematic error analysis method, and emphasizing the importance of cross-dataset evaluation. These aim to offer a more thorough and rational hierarchical analysis framework, ultimately fostering the development of more effective and robust IRSTD models. An open-source toolkit has be released to facilitate standardized benchmarking.

View full details

Poster

MMIG-Bench: Towards Comprehensive and Explainable Evaluation of Multi-Modal Image Generation Models

Hang Hua ⋅ Ziyun Zeng ⋅ Yizhi Song ⋅ Yunlong Tang ⋅ Liu He ⋅ Daniel Aliaga ⋅ Wei Xiong ⋅ Jiebo Luo

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Recent multimodal image generators such as GPT-4o, Gemini 2.0 Flash, and Gemini 2.5 Pro excel at following complex instructions, editing images and maintaining concept consistency. However, they are still evaluated by disjoint toolkits: text-to-image (T2I) benchmarks that lacks multi-modal conditioning, and customized image generation benchmarks that overlook compositional semantics and common knowledge. We propose **MMIG-Bench**, a comprehensive **M**ulti-**M**odal **I**mage **G**eneration **Bench**mark that unifies these tasks by pairing 4,850 richly annotated text prompts with 1,750 multi-view reference images across 380 subjects, spanning humans, animals, objects, and artistic styles. **MMIG-Bench** is equipped with a three-level evaluation framework: (1) low-level metrics for visual artifacts and identity preservation of objects; (2) novel Aspect Matching Score (AMS): a VQA-based mid-level metric that delivers fine-grained prompt-image alignment and shows strong correlation with human judgments; and (3) high-level metrics for aesthetics and human preference. Using **MMIG-Bench**, we benchmark 17 state-of-the-art models, including Gemini 2.5 Pro, FLUX, DreamBooth, and IP-Adapter, and validate our metrics with 32k human ratings, yielding in-depth insights into architecture and data design.

View full details

Poster

Hi3DEval: Advancing 3D Generation Evaluation with Hierarchical Validity

Yuhan Zhang ⋅ Long Zhuo ⋅ Ziyang Chu ⋅ Tong Wu ⋅ Zhibing Li ⋅ Liang Pan ⋅ Dahua Lin ⋅ Ziwei Liu

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Despite rapid advances in 3D content generation, quality assessment for the generated 3D assets remains challenging.Existing methods mainly rely on image-based metrics and operate solely at the object level, limiting their ability to capture spatial Despite rapid advances in 3D content generation, quality assessment for the generated 3D assets remains challenging.Existing methods mainly rely on image-based metrics and operate solely at the object level, limiting their ability to capture spatial coherence, material authenticity, and high-fidelity local details.1) To address these challenges, we introduce Hi3DEval, a hierarchical evaluation framework tailored for 3D generative content. It combines both object-level and part-level evaluation, enabling holistic assessments across multiple dimensions as well as fine-grained quality analysis. Additionally, we extend texture evaluation beyond aesthetic appearance by explicitly assessing material realism, focusing on attributes such as albedo, saturation, and metallicness. 2) To support this framework, we construct Hi3DBench, a large-scale dataset comprising diverse 3D assets and high-quality annotations, accompanied by a reliable multi-agent annotation pipeline.We further propose a 3D-aware automated scoring system based on hybrid 3D representations. Specifically, we leverage video-based representations for object-level and material-subject evaluations to enhance modeling of spatio-temporal consistency and employ pretrained 3D features for part-level perception.Extensive experiments demonstrate that our approach outperforms existing image-based metrics in modeling 3D characteristics and achieves superior alignment with human preference, providing a scalable alternative to manual evaluations.

View full details

Poster

HO-Cap: A Capture System and Dataset for 3D Reconstruction and Pose Tracking of Hand-Object Interaction

Jikai Wang ⋅ Qifan Zhang ⋅ Yu-Wei Chao ⋅ Bowen Wen ⋅ Xiaohu Guo ⋅ Yu Xiang

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We introduce a data capture system and a new dataset, HO-Cap, for 3D reconstruction and pose tracking of hands and objects in videos. The system leverages multiple RGB-D cameras and a HoloLens headset for data collection, avoiding the use of expensive 3D scanners or motion capture systems. We propose a semiautomatic method for annotating the shape and pose of hands and objects in the collected videos, significantly reducing the annotation time and cost compared to manual labeling. With this system, we captured a video dataset of humans performing various single- and dual-hand manipulation tasks, including simple pick-and-place actions, handovers between hands, and using objects according to their affordance. This dataset can serve as human demonstrations for research in embodied AI and robot manipulation. Our capture setup and annotation framework will be made available to the community for reconstructing 3D shapes of objects and human hands, as well as tracking their poses in videos.

View full details

Poster

CoralVQA: A Large-Scale Visual Question Answering Dataset for Coral Reef Image Understanding

hongyong han ⋅ Wei Wang ⋅ Gaowei Zhang ⋅ Mingjie Li ⋅ Yi Wang

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Coral reefs are vital yet vulnerable ecosystems that require continuous monitoring to support conservation. While coral reef images provide essential information in coral monitoring, interpreting such images remains challenging due to the need for domain expertise. Visual Question Answering (VQA), powered by Large Vision-Language Models (LVLMs), has great potential in user-friendly interaction with coral reef images. However, applying VQA to coral imagery demands a dedicated dataset that addresses two key challenges: domain-specific annotations and multidimensional questions. In this work, we introduce CoralVQA, the first large-scale VQA dataset for coral reef analysis. It contains 12,805 real-world coral images from 67 coral genera collected from 3 oceans, along with 277,653 question-answer pairs that comprehensively assess ecological and health-related conditions. To construct this dataset, we develop a semi-automatic data construction pipeline in collaboration with marine biologists to ensure both scalability and professional-grade data quality. CoralVQA presents novel challenges and provides a comprehensive benchmark for studying vision-language reasoning in the context of coral reef images. By evaluating several state-of-the-art LVLMs, we reveal key limitations and opportunities. These insights form a foundation for future LVLM development, with a particular emphasis on supporting coral conservation efforts.

View full details

Poster

Can LLMs Correct Themselves? A Benchmark of Self-Correction in LLMs

Guiyao Tie ⋅ Zenghui Yuan ⋅ Zeli Zhao ⋅ Chaoran Hu ⋅ Tianhe Gu ⋅ Ruihang Zhang ⋅ Sizhe Zhang ⋅ Junran Wu ⋅ Xiaoyue Tu ⋅ Ming Jin ⋅ Qingsong Wen ⋅ Lixing Chen ⋅ Pan Zhou ⋅ Lichao Sun

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Self-correction of large language models (LLMs) emerges as a critical component for enhancing their reasoning performance. Although various self-correction methods have been proposed, a comprehensive evaluation of these methods remains largely unexplored, and the question of whether LLMs can truly correct themselves is a matter of significant interest and concern. In this study, we introduce **CorrectBench**, a benchmark developed to evaluate the effectiveness of self-correction strategies, including intrinsic, external, and fine-tuned approaches, across three tasks: commonsense reasoning, mathematical reasoning, and code generation. Our findings reveal that: 1) Self-correction methods can improve accuracy, especially for complex reasoning tasks; 2) Mixing different self-correction strategies yields further improvements, though it reduces efficiency; 3) Reasoning LLMs (e.g., DeepSeek-V3) have limited optimization under additional self-correction methods and have high time costs. Interestingly, a comparatively simple chain-of-thought (CoT) baseline demonstrates competitive accuracy and efficiency. These results underscore the potential of self-correction to enhance LLM's reasoning performance while highlighting the ongoing challenge of improving their efficiency. Consequently, we advocate for further research focused on optimizing the balance between reasoning capabilities and operational efficiency.

View full details

Poster

SoMi-ToM: Evaluating Multi-Perspective Theory of Mind in Embodied Social Interactions

Xianzhe Fan ⋅ Xuhui Zhou ⋅ Chuanyang Jin ⋅ Kolby Nottingham ⋅ Hao Zhu ⋅ Maarten Sap

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Humans continuously infer the states, goals, and behaviors of others by perceiving their surroundings in dynamic, real-world social interactions. However, most Theory of Mind (ToM) benchmarks only evaluate static, text-based scenarios, which have a significant gap compared to real interactions. We propose the SoMi-ToM benchmark, designed to evaluate multi-perspective ToM in embodied multi-agent complex social interactions. This benchmark is based on rich multimodal interaction data generated by the interaction environment SoMi, covering diverse crafting goals and social relationships. Our framework supports multi-level evaluation: (1) first-person evaluation provides multimodal (visual, dialogue, action, etc.) input from a first-person perspective during a task for real-time state inference, (2) third-person evaluation provides complete third-person perspective video and text records after a task for goal and behavior inference. This evaluation method allows for a more comprehensive examination of a model's ToM capabilities from both the subjective immediate experience and the objective global observation. We constructed a challenging dataset containing 35 third-person perspective videos, 363 first-person perspective images, and 1225 expert-annotated multiple-choice questions (three options). On this dataset, we systematically evaluated the performance of human subjects and several state-of-the-art large vision-language models (LVLMs). The results show that LVLMs perform significantly worse than humans on SoMi-ToM: the average accuracy gap between humans and models is 40.1% in first-person evaluation and 26.4% in third-person evaluation. This indicates that future LVLMs need to further improve their ToM capabilities in embodied, complex social interactions.

View full details

Poster

OverLayBench: A Benchmark for Layout-to-Image Generation with Dense Overlaps

Bingnan Li ⋅ Chen-Yu Wang ⋅ Haiyang Xu ⋅ Xiang Zhang ⋅ Ethan Armand ⋅ Divyansh Srivastava ⋅ Shan Xiaojun ⋅ Zeyuan Chen ⋅ Jianwen Xie ⋅ Zhuowen Tu

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Despite steady progress in layout-to-image generation, current methods still struggle with layouts containing significant overlap between bounding boxes. We identify two primary challenges: (1) large overlapping regions and (2) overlapping instances with minimal semantic distinction. Through both qualitative examples and quantitative analysis, we demonstrate how these factors degrade generation quality. To systematically assess this issue, we introduce OverLayScore, a novel metric that quantifies the complexity of overlapping bounding boxes. Our analysis reveals that existing benchmarks are biased toward simpler cases with low OverLayScore values, limiting their effectiveness in evaluating models under more challenging conditions. To reduce this gap, we present OverLayBench, a new benchmark featuring balanced OverLayScore distributions and high-quality annotations. As an initial step toward improved performance on complex overlaps, we also propose CreatiLayout-AM, a model trained on a curated amodal mask dataset. Together, our contributions establish a foundation for more robust layout-to-image generation under realistic and challenging scenarios.

View full details

Poster

EmoNet-Face: An Expert-Annotated Benchmark for Synthetic Emotion Recognition

Christoph Schuhmann ⋅ Robert Kaczmarczyk ⋅ Gollam Rabby ⋅ Maurice Kraus ⋅ Felix Friedrich ⋅ Huu Nguyen ⋅ Kalyan Sai Krishna ⋅ Kourosh Nadi ⋅ Kristian Kersting ⋅ Sören Auer

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Effective human-AI interaction relies on AI's ability to accurately perceive and interpret human emotions. Current benchmarks for vision and vision-language models are severely limited, offering a narrow emotional spectrum that overlooks nuanced states (e.g., bitterness, intoxication) and fails to distinguish subtle differences between related feelings (e.g., shame vs. embarrassment). Existing datasets also often use uncontrolled imagery with occluded faces and lack demographic diversity, risking significant bias. To address these critical gaps, we introduce EmoNet Face, a comprehensive benchmark suite. EmoNet Face features: (1) A novel 40-category emotion taxonomy, meticulously derived from foundational research to capture finer details of human emotional experiences. (2) Three large-scale, AI-generated datasets (EmoNet HQ, Binary, and Big) with explicit, full-face expressions and controlled demographic balance across ethnicity, age, and gender. (3) Rigorous, multi-expert annotations for training and high-fidelity evaluation. (4) We build Empathic Insight Face, a model achieving human-expert-level performance on our benchmark. The publicly released EmoNet Face suite—taxonomy, datasets, and model—provides a robust foundation for developing and evaluating AI systems with a deeper understanding of human emotions.

View full details

Poster

SeasonBench-EA: A Multi-Source Benchmark for Seasonal Prediction and Numerical Model Post-Processing in East Asia

Mengxuan Chen ⋅ Li ⋅ Zou Ziheng ⋅ Fang Wang ⋅ Jinxiao Zhang ⋅ Runmin Dong ⋅ Juepeng Zheng ⋅ Haohuan Fu

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Seasonal-scale climate prediction plays a critical role in supporting agricultural planning, disaster prevention, and long-term decision making. In particular, reliable forecasts issued 1-6 months in advance are essential for early warning of flood and drought risks associated with precipitation during the East Asian summer monsoon season. However, while the use of machine learning techniques has advanced rapidly in weather and subseasonal-to-seasonal forecasting, partly driven by the availability of benchmark datasets, their application to seasonal-scale prediction remains limited. Existing seasonal prediction primarily relies on ensemble forecasts from numerical models, which, while physically grounded, are subject to biases and uncertainties at long lead times. Motivated by these challenges, we propose SeasonBench-EA, a benchmark dataset for seasonal prediction in East Asia region. It features multi-resolution, multi-source data with both regional and global coverage, integrating ERA5 reanalysis data and ensemble forecasts from multiple leading forecast centers. Beyond key atmospheric fields, the dataset also includes boundary-related variables, such as ocean state, soil and solar radiation, that are essential for capturing seasonal-scale atmospheric variability. Two tasks are defined and evaluated: 1) machine learning-based seasonal prediction using ERA5 reanalysis, and 2) post-processing of seasonal forecasts from numerical model ensembles. A suite of deterministic and probabilistic metrics is provided for tasks evaluation, along with a hindcast assessment focused on precipitation during the East Asian summer monsoon, aligned with model evaluation protocols used in operations. By offering a unified data and evaluation framework, SeasonBench-EA aims to promote the development and application of data-driven methods for seasonal prediction, a challenging yet highly impactful task with board implications for society and public well-being. Our benchmark is available at https://github.com/SauryChen/SeasonBench-EA.

View full details

Poster

Measuring Scientific Capabilities of Language Models with a Systems Biology Dry Lab

Haonan Duan ⋅ Stephen Lu ⋅ Caitlin F Harrigan ⋅ Nishkrit Desai ⋅ Jiarui Lu ⋅ Michał Koziarski ⋅ Leonardo Cotta ⋅ Chris Maddison

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Designing experiments and result interpretations are core scientific competencies, particularly in biology, where researchers perturb complex systems to uncover the underlying systems. Recent efforts to evaluate the scientific capabilities of large language models (LLMs) fail to test these competencies because wet-lab experimentation is prohibitively expensive: in expertise, time and equipment. We introduce SciGym, a first-in-class benchmark that assesses LLMs' iterative experiment design and analysis abilities in open-ended scientific discovery tasks. SciGym overcomes the challenge of wet-lab costs by running a dry lab of biological systems. These models, encoded in Systems Biology Markup Language, are efficient for generating simulated data, making them ideal testbeds for experimentation on realistically complex systems. We evaluated six frontier LLMs on 137 small systems, and released a total of 350 systems at https://huggingface.co/datasets/h4duan/scigym-sbml. Our evaluation shows that while more capable models demonstrated superior performance, all models' performance declined significantly as system complexity increased, suggesting substantial room for improvement in the scientific capabilities of LLM agents.

View full details

Poster

OmniBench: Towards The Future of Universal Omni-Language Models

Yizhi Li ⋅ Ge Zhang ⋅ Yinghao Ma ⋅ Ruibin Yuan ⋅ Zhu ⋅ Hangyu Guo ⋅ Yiming Liang ⋅ Jiaheng Liu ⋅ Noah Wang ⋅ Jian Yang ⋅ Siwei Wu ⋅ Xingwei Qu ⋅ Jinjie Shi ⋅ Xinyue Zhang ⋅ Zhenzhu Yang ⋅ Yidan WEN ⋅ Yanghai Wang ⋅ Shihao Li ⋅ ZHAO-XIANG ZHANG ⋅ Ruibo Liu ⋅ Emmanouil Benetos ⋅ Wenhao Huang ⋅ Chenghua Lin

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Recent advancements in multimodal large language models (MLLMs) have focused on integrating multiple modalities, yet their ability to simultaneously process and reason across different inputs remains underexplored. We introduce OmniBench, a novel benchmark designed to evaluate models’ ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. We define language models capable of such tri-modal processing as omni-language models (OLMs). OmniBench features high-quality human annotations that require integrated understanding across all modalities. Our evaluation reveals that: i) open-source OLMs show significant limitations in instruction-following and reasoning in tri-modal contexts; and ii) most baseline models perform poorly (below 50% accuracy) even with textual alternatives to image/audio inputs. To address these limitations, we develop OmniInstruct, an 96K-sample instruction tuning dataset for training OLMs. We advocate for developing more robust tri-modal integration techniques and training strategies to enhance OLM performance. Codes and data could be found at https://m-a-p.ai/OmniBench/.

View full details

Poster

PSMBench: A Benchmark and Dataset for Evaluating LLMs Extraction of Protocol State Machines from RFC Specifications

Zilin Shen ⋅ Xinyu Luo ⋅ Imtiaz Karim ⋅ Elisa Bertino

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Accurately extracting protocol-state machines (PSMs) from the long, densely written Request-for-Comments (RFC) standards that govern Internet‐scale communication remains a bottleneck for automated security analysis and protocol testing. In this paper, we introduce RFC2PSM, the first large-scale dataset that pairs 1,580 pages of cleaned RFC text with 108 manually validated states and 297 transitions covering 14 widely deployed protocols spanning the data-link, transport, session, and application layers. Built on this corpus, we propose PsmBench, a benchmark that (i) feeds chunked RFC to an LLM, (ii) prompts the model to emit a machine-readable PSM, and (iii) scores the output with structure-aware, semantic fuzzy-matching metrics that reward partially correct graphs.A comprehensive baseline study of nine state-of-the-art open and commercial LLMs reveals a persistent state–transition gap: models identify many individual states (up to $0.82$ F1) but struggle to assemble coherent transition graphs ($\leq 0.38$ F1), highlighting challenges in long-context reasoning, alias resolution, and action/event disambiguation. We release the dataset, evaluation code, and all model outputs as open-sourced, providing a fully reproducible starting point for future work on reasoning over technical prose and generating executable graph structures. RFC2PSM and PsmBench aim to catalyze cross-disciplinary progress toward LLMs that can interpret and verify the protocols that keep the Internet safe.

View full details

Poster

egoEMOTION: Egocentric Vision and Physiological Signals for Emotion and Personality Recognition in Real-world Tasks

Matthias Jammot ⋅ Björn Braun ⋅ Paul Streli ⋅ Rafael Wampfler ⋅ Christian Holz

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Understanding affect is central to anticipating human behavior, yet current egocentric vision benchmarks largely ignore the person’s emotional states that shape their decisions and actions. Existing tasks in egocentric perception focus on physical activities, hand-object interactions, and attention modeling—assuming neutral affect and uniform personality. This limits the ability of vision systems to capture key internal drivers of behavior. In this paper, we present egoEMOTION, the first dataset that couples egocentric visual and physiological signals with dense self-reports of emotion and personality across controlled and real-world scenarios. Our dataset includes over 50 hours of recordings from 43 participants, captured using Meta’s Project Aria glasses. Each session provides synchronized eye-tracking video, head-mounted photoplethysmography, inertial motion data, and physiological baselines for reference. Participants completed emotion-elicitation tasks and naturalistic activities while self-reporting their affective state using the Circumplex Model and Mikels’ Wheel as well as their personality via the Big Five model. We define three benchmark tasks: (1) continuous affect classification (valence, arousal, dominance); (2) discrete emotion classification; and (3) trait-level personality inference. We show that a classical learning-based method, as a simple baseline in real-world affect prediction, produces better estimates from signals captured on egocentric vision systems than processing physiological signals. Our dataset establishes emotion and personality as core dimensions in egocentric perception and opens new directions in affect-driven modeling of behavior, intent, and interaction.

View full details

Poster

CLEVER: A Curated Benchmark for Formally Verified Code Generation

Amitayush Thakur ⋅ Jasper Lee ⋅ George Tsoukalas ⋅ Meghana Sistla ⋅ Matthew Zhao ⋅ Stefan Zetzsche ⋅ Greg Durrett ⋅ Yisong Yue ⋅ Swarat Chaudhuri

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We introduce ${\rm C{\small LEVER}}$, a high-quality, manually curated benchmark of 161 problems for end-to-end verified code generation in Lean. Each problem consists of (1) the task of generating a specification that matches a held-out ground-truth specification, and (2) the task of generating a Lean implementation that provably satisfies this specification. Unlike prior benchmarks,${\rm C{\small LEVER}}$ avoids test-case supervision, LLM-generated annotations, and specifications that leak implementation logic or allow vacuous solutions. All outputs are verified post-hoc using Lean's type checker to ensure machine-checkable correctness. We use ${\rm C{\small LEVER}}$ to evaluate several few-shot and agentic approaches based on state-of-the-art language models. These methods all struggle to achieve full verification, establishing it as a challenging frontier benchmark for program synthesis and formal reasoning. Our benchmark can be found on [GitHub](https://github.com/trishullab/clever) as well as [HuggingFace](https://huggingface.co/datasets/amitayusht/clever). All our evaluation code is also available [online](https://github.com/trishullab/clever-prover).

View full details

Poster

CosmoBench: A Multiscale, Multiview, Multitask Cosmology Benchmark for Geometric Deep Learning

Teresa Huang ⋅ Richard Stiskalek ⋅ Jun-Young Lee ⋅ Adrian Bayer ⋅ Charles Margossian ⋅ Christian Kragh Jespersen ⋅ Lucia Perez ⋅ Lawrence Saul ⋅ Francisco Villaescusa

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Cosmological simulations provide a wealth of data in the form of point clouds and directed trees. A crucial goal is to extract insights from this data that shed light on the nature and composition of the Universe. In this paper we introduce CosmoBench, a benchmark dataset curated from state-of-the-art cosmological simulations whose runs required more than 41 million core-hours and generated over two petabytes of data. CosmoBench is the largest dataset of its kind: it contains 34 thousand point clouds from simulations of dark matter halos and galaxies at three different length scales, as well as 25 thousand directed trees that record the formation history of halos on two different time scales. The data in CosmoBench can be used for multiple tasks---to predict cosmological parameters from point clouds and merger trees, to predict the velocities of individual halos and galaxies from their collective positions, and to reconstruct merger trees on finer time scales from those on coarser time scales. We provide multiple baselines on these tasks, some based on established approaches from cosmological modeling and others rooted in machine learning. For the latter, we study different approaches---from simple linear models that are minimally constrained by symmetries to much larger and more computationally-demanding models in deep learning, such as graph neural networks. We find that least-squares fits with a handful of invariant features sometimes outperform deep architectures with many more parameters and far longer training times. Still there remains tremendous potential to improve these baselines by combining machine learning and cosmological modeling in a more principled way, one that fully exploits the structure in the data. CosmoBench sets the stage for bridging cosmology and geometric deep learning at scale. We invite the community to push the frontier of scientific discovery by engaging with this challenging, high-impact dataset. The data and code are available at [this URL](https://cosmobench.streamlit.app/).

View full details

Poster

EgoBlind: Towards Egocentric Visual Assistance for the Blind

Junbin Xiao ⋅ Nanxin Huang ⋅ Hao Qiu ⋅ Zhulin Tao ⋅ Xun Yang ⋅ Richang Hong ⋅ Meng Wang ⋅ Angela Yao

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We present EgoBlind, the first egocentric VideoQA dataset collected from blind individuals to evaluate the assistive capabilities of contemporary multimodal large language models (MLLMs). EgoBlind comprises 1,392 first-person videos from the daily lives of blind and visually impaired individuals. It also features 5,311 questions directly posed or verified by the blind to reflect their in-situation needs for visual assistance. Each question has an average of 3 manually annotated reference answers to reduce subjectiveness.Using EgoBlind, we comprehensively evaluate 16 advanced MLLMs and find that all models struggle. The best performers achieve an accuracy near 60\%, which is far behind human performance of 87.4\%. To guide future advancements, we identify and summarize major limitations of existing MLLMs in egocentric visual assistance for the blind and explore heuristic solutions for improvement. With these efforts, we hope that EgoBlind will serve as a foundation for developing effective AI assistants to enhance the independence of the blind and visually impaired. Data and code are available at \url{https://github.com/doc-doc/EgoBlind}.

View full details

Poster

SMMILE: An expert-driven benchmark for multimodal medical in-context learning

Melanie Rieff ⋅ Maya Varma ⋅ Ossian Rabow ⋅ Subathra Adithan ⋅ Julie Kim ⋅ Ken Chang ⋅ Hannah Lee ⋅ Nidhi Rohatgi ⋅ Christian Bluethgen ⋅ Mohamed Muneer ⋅ Jean-Benoit Delbrouck ⋅ Michael Moor

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Multimodal in-context learning (ICL) remains underexplored despite significant potential for domains such as medicine. Clinicians routinely encounter diverse, specialized tasks requiring adaptation from limited examples, such as drawing insights from a few relevant prior cases or considering a constrained set of differential diagnoses. While multimodal large language models (MLLMs) have shown advances in medical visual question answering (VQA), their ability to learn multimodal tasks from context is largely unknown. We introduce SMMILE, the first expert-driven multimodal ICL benchmark for medical tasks. Eleven medical experts curated problems, each including a multimodal query and multimodal in-context examples as task demonstrations. SMMILE encompasses 111 problems (517 question-image-answer triplets) covering 6 medical specialties and 13 imaging modalities. We further introduce SMMILE++, an augmented variant with 1038 permuted problems. A comprehensive evaluation of 15 MLLMs demonstrates that most models exhibit moderate to poor multimodal ICL ability in medical tasks. In open-ended evaluations, ICL contributes only an 8% average improvement over zero-shot on SMMILE and 9.4% on SMMILE++. We observe a susceptibility for irrelevant in-context examples: even a single noisy or irrelevant example can degrade performance by up to 9.5%. Moreover, we observe that MLLMs are affected by a recency bias, where placing the most relevant example last can lead to substantial performance improvements of up to 71%. Our findings highlight critical limitations and biases in current MLLMs when learning multimodal medical tasks from context. SMMILE is available at https://smmile-benchmark.github.io.

View full details

Poster

Scientists' First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning

Yuhao Zhou ⋅ Yiheng Wang ⋅ Xuming He ⋅ Ruoyao Xiao ⋅ Zhiwei Li ⋅ Qiantai Feng ⋅ Zijie Guo ⋅ Yuejin Yang ⋅ Hao Wu ⋅ Wenxuan Huang ⋅ Jiaqi Wei ⋅ Dan Si ⋅ YAO XIUQI ⋅ Jia Bu ⋅ Haiwen Huang ⋅ Tianfan Fu ⋅ SHIXIANG TANG ⋅ Ben Fei ⋅ Dongzhan Zhou ⋅ Fenghua Ling ⋅ Yan Lu ⋅ Siqi Sun ⋅ Chenhui Li ⋅ Guanjie Zheng ⋅ Jiancheng Lv ⋅ Wenlong Zhang ⋅ LEI BAI

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Scientific discoveries increasingly rely on complex multimodal reasoning based on information-intensive scientific data and domain-specific expertise. Empowered by expert-level scientific benchmarks, scientific Multimodal Large Language Models (MLLMs) hold the potential to significantly enhance this discovery process in realistic workflows. However, current scientific benchmarks mostly focus on evaluating the knowledge understanding capabilities of MLLMs, leading to an inadequate assessment of their perception and reasoning abilities. To address this gap, we present the Scientists’ First Exam (SFE) benchmark, designed to evaluate the scientific cognitive capacities of MLLMs through three interconnected levels: *scientific signal perception*, *scientific attribute understanding*, *scientific comparative reasoning*. Specifically, SFE comprises 830 expert-verified VQA pairs across three question types, spanning 66 multimodal tasks across five high-value disciplines. Extensive experiments reveal that current *state-of-the-art* GPT-o3 and InternVL-3 achieve only 34.08% and 26.52% on SFE, highlighting significant room for MLLMs to improve in scientific realms. We hope the insights obtained in SFE will facilitate further developments in AI-enhanced scientific discoveries.

View full details

Poster

SonoGym: High Performance Simulation for Challenging Surgical Tasks with Robotic Ultrasound

Yunke Ao ⋅ Masoud Moghani ⋅ Mayank Mittal ⋅ Manish Prajapat ⋅ Luohong Wu ⋅ Frederic Giraud ⋅ Fabio Carrillo ⋅ Andreas Krause ⋅ Philipp Fürnstahl

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Ultrasound (US) is a widely used medical imaging modality due to its real-time capabilities, non-invasive nature, and cost-effectiveness. By reducing operator dependency and enhancing access to complex anatomical regions, robotic ultrasound can help improve workflow efficiency. Recent studies have demonstrated the potential of deep reinforcement learning (DRL) and imitation learning (IL) to enable more autonomous and intelligent robotic ultrasound navigation. However, the application of learning-based robotic ultrasound to computer-assisted surgical tasks, such as anatomy reconstruction and surgical guidance, remains largely unexplored. A key bottleneck for this is the lack of realistic and efficient simulation environments tailored to these tasks. In this work, we present SonoGym, a scalable simulation platform for robotic ultrasound, enabling parallel simulation across tens to hundreds of environments. Our framework supports realistic and real-time simulation of US data from CT-derived 3D models of the anatomy through both a physics-based and a Generative Adversarial Network (GAN) approach. Our framework enables the training of DRL and recent IL agents (vision transformers and diffusion policies) for relevant tasks in robotic orthopedic surgery by integrating common robotic platforms and orthopedic end effectors. We further incorporate submodular DRL---a recent method that handles history-dependent rewards---for anatomy reconstruction and safe reinforcement learning for surgery. Our results demonstrate successful policy learning across a range of scenarios, while also highlighting the limitations of current methods in clinically relevant environments. We believe our simulation can facilitate research in robot learning approaches for such challenging robotic surgery applications. Dataset, codes and videos are publicly available at https://sonogym.github.io/.

View full details

Poster

GenSpace: Benchmarking Spatially-Aware Image Generation

Zehan Wang ⋅ Jiayang Xu ⋅ Ziang Zhang ⋅ Tianyu Pang ⋅ Chao Du ⋅ Hengshuang Zhao ⋅ Zhou Zhao

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Humans can intuitively compose and arrange scenes in the 3D space for photography. However, can advanced AI image generators plan scenes with similar 3D spatial awareness when creating images from text or image prompts? We present GenSpace, a novel benchmark and evaluation pipeline to comprehensively assess the spatial awareness of current image generation models. Furthermore, standard evaluations using general Vision-Language Models (VLMs) frequently fail to capture the detailed spatial errors. To handle this challenge, we propose a specialized evaluation pipeline and metric, which reconstructs 3D scene geometry using multiple visual foundation models and provides a more accurate and human-aligned metric of spatial faithfulness. Our findings show that while AI models create visually appealing images and can follow general instructions, they struggle with specific 3D details like object placement, relationships, and measurements. We summarize three core limitations in the spatial perception of current state-of-the-art image generation models: 1) Object Perspective Understanding, 2) Egocentric-Allocentric Transformation, and 3) Metric Measurement Adherence, highlighting possible directions for improving spatial intelligence in image generation.

View full details

Poster

RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation

Songhao Han ⋅ Boxiang Qiu ⋅ Yue Liao ⋅ Siyuan Huang ⋅ Chen Gao ⋅ Shuicheng Yan ⋅ Si Liu

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Recent advances in vision-language models (VLMs) have enabled instruction-conditioned robotic systems with improved generalization. However, most existing work focuses on reactive System 1 policies, underutilizing VLMs’ strengths in semantic reasoning and long-horizon planning. These System 2 capabilities—characterized by deliberative, goal-directed thinking—remain underexplored due to the limited temporal scale and structural complexity of current benchmarks. To address this gap, we introduce RoboCerebra, a benchmark for evaluating high-level reasoning in long-horizon robotic manipulation. RoboCerebra includes: (1) a large-scale simulation dataset with extended task horizons and diverse subtask sequences in household environments; (2) a hierarchical framework combining a high-level VLM planner with a low-level vision-language-action (VLA) controller; and (3) an evaluation protocol targeting planning, reflection, and memory through structured System 1–System 2 interaction. The dataset is constructed via a top-down pipeline, where GPT generates task instructions and decomposes them into subtask sequences. Human operators execute the subtasks in simulation, yielding high-quality trajectories with dynamic object variations. Compared to prior benchmarks, RoboCerebra features significantly longer action sequences and denser annotations. We further benchmark state-of-the-art VLMs as System 2 modules and analyze their performance across key cognitive dimensions, advancing the development of more capable and generalizable robotic planners.

View full details

Poster

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

Frank (Fangzheng) Xu ⋅ Yufan Song ⋅ Boxuan Li ⋅ Yuxuan Tang ⋅ Kritanjali Jain ⋅ Mengxue Bao ⋅ Zora Wang ⋅ Xuhui Zhou ⋅ Zhitong Guo ⋅ Murong Cao ⋅ Mingyang Yang ⋅ Hao Yang Lu ⋅ Amaad Martin ⋅ Zhe Su ⋅ Leander Maben ⋅ Raj Mehta ⋅ Wayne Chi ⋅ Lawrence Jang ⋅ Yiqing Xie ⋅ Shuyan Zhou ⋅ Graham Neubig

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We interact with computers on an everyday basis, be it in everyday life or work, and many aspects of work can be done entirely with access to a computer and the Internet. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. But how performant are AI agents at helping to accelerate or even autonomously perform work-related tasks? The answer to this question has important implications for both industry looking to adopt AI into their workflows, and for economic policy to understand the effects that adoption of AI may have on the labor market. To measure the progress of these LLM agents' performance on performing real-world professional tasks, in this paper, we introduce TheAgentCompany, an extensible benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker: by browsing the Web, writing code, running programs, and communicating with other coworkers. We build a self-contained environment with internal web sites and data that mimics a small software company environment, and create a variety of tasks that may be performed by workers in such a company. We test baseline agents powered by both closed API-based and open-weights language models (LMs), and find that with the most competitive agent, 30% of the tasks can be completed autonomously. This paints a nuanced picture on task automation with LM agents -- in a setting simulating a real workplace, a good portion of simpler tasks could be solved autonomously, but more difficult long-horizon tasks are still beyond the reach of current systems. For more information and demos, refer to https://the-agent-company.com.

View full details

Poster

Time-IMM: A Dataset and Benchmark for Irregular Multimodal Multivariate Time Series

Ching Chang ⋅ Jeehyun Hwang ⋅ Yidan Shi ⋅ Haixin Wang ⋅ Wei Wang ⋅ Wen-Chih Peng ⋅ Tien-Fu Chen

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Time series data in real-world applications such as healthcare, climate modeling, and finance are often irregular, multimodal, and messy, with varying sampling rates, asynchronous modalities, and pervasive missingness.However, existing benchmarks typically assume clean, regularly sampled, unimodal data, creating a significant gap between research and real-world deployment.We introduce Time-IMM, a dataset specifically designed to capture cause-driven irregularity in multimodal multivariate time series.Time-IMM represents nine distinct types of time series irregularity, categorized into trigger-based, constraint-based, and artifact-based mechanisms.Complementing the dataset, we introduce IMM-TSF, a benchmark library for forecasting on irregular multimodal time series, enabling asynchronous integration and realistic evaluation.IMM-TSF includes specialized fusion modules, including a timestamp-to-text fusion module and a multimodality fusion module, which support both recency-aware averaging and attention-based integration strategies.Empirical results demonstrate that explicitly modeling multimodality on irregular time series data leads to substantial gains in forecasting performance.Time-IMM and IMM-TSF provide a foundation for advancing time series analysis under real-world conditions.The dataset is publicly available at \url{https://github.com/blacksnail789521/Time-IMM}, and the benchmark library can be accessed at \url{https://github.com/blacksnail789521/IMM-TSF}.

View full details

Poster

Bridging Crypto with ML-based Solvers: the SAT Formulation and Benchmarks

Xinhao Zheng ⋅ Xinhao Song ⋅ Bolin Qiu ⋅ Yang Li ⋅ Zhongteng Gui ⋅ Junchi Yan

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

The Boolean Satisfiability Problem (SAT) plays a crucial role in cryptanalysis, enabling tasks like key recovery and distinguisher construction. Conflict-Driven Clause Learning (CDCL) has emerged as the dominant paradigm in modern SAT solving, and machine learning has been increasingly integrated with CDCL-based SAT solvers to tackle complex cryptographic problems. However, the lack of a unified evaluation framework, inconsistent input formats, and varying modeling approaches hinder fair comparison. Besides, cryptographic SAT instances also differ structurally from standard SAT problems, and the absence of standardized datasets further complicates evaluation. To address these issues, we introduce SAT4CryptoBench, the first comprehensive benchmark for assessing machine learning–based solvers in cryptanalysis. SAT4CryptoBench provides diverse SAT datasets in both Arithmetic Normal Form (ANF) and Conjunctive Normal Form (CNF), spanning various algorithms, rounds, and key sizes. Our framework evaluates three levels of machine learning integration: standalone distinguishers for instance classification, heuristic enhancement for guiding solving strategies, and hyperparameter optimization for adapting to specific problem distributions. Experiments demonstrate that ANF-based networks consistently achieve superior performance over CNF-based networks in learning cryptographic features. Nonetheless, current ML techniques struggle to generalize across algorithms and instance sizes, with computational overhead potentially offsetting benefits on simpler cases. Despite this, ML-driven optimization strategies notably improve solver efficiency on cryptographic SAT instances. Besides, we propose BASIN, a bitwise solver taking plaintext-ciphertext bitstrings as inputs. Crucially, its superior performance on high-round problems highlights the importance of input modeling and the advantage of direct input representations for complex cryptographic structures.

View full details

Poster

Fixing It in Post: A Comparative Study of LLM Post-Training Data Quality and Model Performance

Aladin Djuhera ⋅ Swanand Kadhe ⋅ Syed Zawad ⋅ Farhan Ahmed ⋅ Heiko Ludwig ⋅ Holger Boche

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Recent work on large language models (LLMs) has increasingly focused on post-training and alignment with datasets curated to enhance instruction following, world knowledge, and specialized skills. However, most post-training datasets used in leading open- and closed-source LLMs remain inaccessible to the public, with limited information about their construction process. This lack of transparency has motivated the recent development of open-source post-training corpora. While training on these open alternatives can yield performance comparable to that of leading models, systematic comparisons remain challenging due to the significant computational cost of conducting them rigorously at scale, and are therefore largely absent. As a result, it remains unclear how specific samples, task types, or curation strategies influence downstream performance when assessing data quality. In this work, we conduct the first comprehensive side-by-side analysis of two prominent open post-training datasets: Tulu-3-SFT-Mix and SmolTalk. Using the Magpie framework, we annotate each sample with detailed quality metrics, including turn structure (single-turn vs. multi-turn), task category, input quality, and response quality, and we derive statistics that reveal structural and qualitative similarities and differences between the two datasets. Based on these insights, we design a principled curation recipe that produces a new data mixture, **TuluTalk**, which contains 14% fewer samples than either source dataset while matching or exceeding their performance on key benchmarks. Our findings offer actionable insights for constructing more effective post-training datasets that improve model performance within practical resource limits. To support future research, we publicly release both the annotated source datasets and our curated TuluTalk mixture.

View full details

Poster

OrthoLoC: UAV 6-DoF Localization and Calibration Using Orthographic Geodata

Oussema Dhaouadi ⋅ Riccardo Marin ⋅ Johannes Meier ⋅ Jacques Kaiser ⋅ Daniel Cremers

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Accurate visual localization from aerial views is a fundamental problem with applications in mapping, large-area inspection, and search-and-rescue operations. In many scenarios, these systems require high-precision localization while operating with limited resources (e.g., no internet connection or GNSS/GPS support), making large image databases or heavy 3D models impractical. Surprisingly, little attention has been given to leveraging orthographic geodata as an alternative paradigm, which is lightweight and increasingly available through free releases by governmental authorities (e.g., the European Union). To fill this gap, we propose OrthoLoC, the first large-scale dataset comprising 16,425 UAV images from Germany and the United States with multiple modalities. The dataset addresses domain shifts between UAV imagery and geospatial data. Its paired structure enables fair benchmarking of existing solutions by decoupling image retrieval from feature matching, allowing isolated evaluation of localization and calibration performance. Through comprehensive evaluation, we examine the impact of domain shifts, data resolutions, and covisibility on localization accuracy. Finally, we introduce a refinement technique called AdHoP, which can be integrated with any feature matcher, improving matching by up to 95% and reducing translation error by up to 63%. The dataset and code are available at: https://deepscenario.github.io/OrthoLoC .

View full details

Poster

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu ⋅ Peixian Chen ⋅ Yunhang Shen ⋅ Yulei Qin ⋅ Mengdan Zhang ⋅ Xu Lin ⋅ Jinrui Yang ⋅ Xiawu Zheng ⋅ Ke Li ⋅ Xing Sun ⋅ Yunsheng Wu ⋅ Rongrong Ji ⋅ Caifeng Shan ⋅ Ran He

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Multimodal Large Language Model (MLLM) relies on the powerful LLM to perform multimodal tasks, showing amazing emergent abilities in recent studies, such as writing poems based on an image. However, it is difficult for these case studies to fully reflect the performance of MLLM, lacking a comprehensive evaluation. In this paper, we fill in this blank, presenting the first comprehensive MLLM Evaluation benchmark MME. It measures both perception and cognition abilities on a total of 14 subtasks. In order to avoid data leakage that may arise from direct use of public datasets for evaluation, the annotations of instruction-answer pairs are all manually designed. The concise instruction design allows us to fairly compare MLLMs, instead of struggling in prompt engineering. Besides, with such an instruction, we can also easily carry out quantitative statistics. A total of 30 advanced MLLMs are comprehensively evaluated on our MME, which not only suggests that existing MLLMs still have a large room for improvement, but also reveals the potential directions for the subsequent model optimization. The data are released at the project page: https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation.

View full details

Poster

OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents

Thomas Kuntz ⋅ Agatha Duzan ⋅ Hao Zhao ⋅ Francesco Croce ⋅ Zico Kolter ⋅ Nicolas Flammarion ⋅ Maksym Andriushchenko

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Computer use agents are LLM-based agents that can directly interact with a graphical user interface, by processing screenshots or accessibility trees. While these systems are gaining popularity, their safety has been largely overlooked, despite the fact that evaluating and understanding their potential for harmful behavior is essential for widespread adoption. To address this gap, we introduce OS-Harm, a new benchmark for measuring safety of computer use agents. OS-Harm is built on top of the OSWorld environment (Xie et al., 2024) and aims to test models across three categories of harm: deliberate user misuse, prompt injection attacks, and model misbehavior. To cover these cases, we create 150 tasks that span several types of safety violations (harassment, copyright infringement, disinformation, data exfiltration, etc.) and require the agent to interact with a variety of OS applications (email client, code editor, browser, etc.). Moreover, we propose an automated judge to evaluate both accuracy and safety of agents that achieves high agreement with human annotations (0.76 and 0.79 F1 score). We evaluate computer use agents based on a range of frontier models—such as `o4-mini`, `Claude 3.7 Sonnet`, `Gemini 2.5 Pro`—and provide insights into their safety. In particular, all models tend to directly comply with many deliberate misuse queries, are relatively vulnerable to static prompt injections, and occasionally perform unsafe actions. The OS-Harm benchmark is available at https://github.com/tml-epfl/os-harm.

View full details

Poster

OceanBench: A Benchmark for Data-Driven Global Ocean Forecasting systems

Anass El Aouni ⋅ Quentin Gaudel ⋅ J. Emmanuel Johnson ⋅ REGNIER Charly ⋅ Julien Le Sommer ⋅ Simon van Gennip ⋅ ronan fablet ⋅ Marie Drevillon ⋅ Yann DRILLET ⋅ Pierre Le Traon

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Data-driven approaches, particularly those based on deep learning, are rapidly advancing Earth system modeling. However, their application to ocean forecasting remains limited despite the ocean's pivotal role in climate regulation and marine ecosystems. To address this gap, we present OceanBench, a benchmark designed to evaluate and accelerate global short-range (1–10 days) data-driven ocean forecasting.OceanBench is constructed from a curated dataset comprising first-guess trajectories, nowcasts, and atmospheric forcings from operational physical ocean models, typically unavailable in public datasets due to assimilation cycles. Matched observational data are also included, enabling realistic evaluation in an operational-like forecasting framework.The benchmark defines three complementary evaluation tracks: (i) Model-to-Reanalysis, where models are compared against the reanalysis dataset commonly used for training; (ii) Model-to-Analysis, assessing generalization to a higher-resolution physical analysis; and (iii) Model-to-Observations, Intercomparison and Validation (IV-TT) CLASS-4 evaluation against independent observational data. The first two tracks are further supported by process-oriented diagnostics to assess the dynamical consistency and physical plausibility of forecasts.OceanBench includes key ocean variables: sea surface height, temperature, salinity, and currents, along with standardized metrics grounded in physical oceanography. Baseline comparisons with operational systems and state-of-the-art deep learning models are provided. All data, code, and evaluation protocols are openly available at https://github.com/mercator-ocean/oceanbench, establishing OceanBench as a foundation for reproducible and rigorous research in data-driven ocean forecasting.

View full details

Poster

Situat3DChange: Situated 3D Change Understanding Dataset for Multimodal Large Language Model

Ruiping Liu ⋅ Junwei Zheng ⋅ Yufan Chen ⋅ Zirui Wang ⋅ Kunyu Peng ⋅ Kailun Yang ⋅ Jiaming Zhang ⋅ Marc Pollefeys ⋅ Rainer Stiefelhagen

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Physical environments and circumstances are fundamentally dynamic, yet current 3D datasets and evaluation benchmarks tend to concentrate on either dynamic scenarios or dynamic situations in isolation, resulting in incomplete comprehension. To overcome these constraints, we introduce Situat3DChange, an extensive dataset supporting three situation-aware change understanding tasks following the perception-action model: 121K question-answer pairs, 36K change descriptions for perception tasks, and 17K rearrangement instructions for the action task. To construct this large-scale dataset, Situat3DChange leverages 11K human observations of environmental changes to establish shared mental models and shared situational awareness for human-AI collaboration. These observations, enriched with egocentric and allocentric perspectives as well as categorical and coordinate spatial relations, are integrated using an LLM to support understanding of situated changes. To address the challenge of comparing pairs of point clouds from the same scene with minor changes, we propose SCReasoner, an efficient 3D MLLM approach that enables effective point cloud comparison with minimal parameter overhead and no additional tokens required for the language decoder. Comprehensive evaluation on Situat3DChange tasks highlights both the progress and limitations of MLLMs in dynamic scene and situation understanding. Additional experiments on data scaling and cross-domain transfer demonstrate the task-agnostic effectiveness of using Situat3DChange as a training dataset for MLLMs. The established dataset and source code are publicly available at: https://github.com/RuipingL/Situat3DChange.

View full details

Poster

NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions

Weizhe Yuan ⋅ Jane Yu ⋅ Song Jiang ⋅ Karthik Padthe ⋅ Yang Li ⋅ Dong Wang ⋅ Ilia Kulikov ⋅ Kyunghyun Cho ⋅ Yuandong Tian ⋅ Jason Weston ⋅ Xian Li

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Scaling reasoning capabilities beyond traditional domains such as math and coding is hindered by the lack of diverse and high-quality questions. To overcome this limitation, we introduce a scalable approach for generating diverse and challenging reasoning questions, accompanied by reference answers. We present NaturalReasoning, a comprehensive dataset comprising 2.8 million questions that span multiple domains, including STEM fields (e.g., Physics, Computer Science), Economics, Social Sciences, and more. We demonstrate the utility of the questions in NaturalReasoning through knowledge distillation experiments which show that NaturalReasoning can effectively elicit and transfer reasoning capabilities from a strong teacher model. Furthermore, we demonstrate that NaturalReasoning is also effective for unsupervised self-training using external reward models or self-rewarding.

View full details

Poster

Intend to Move: A Multimodal Dataset for Intention-Aware Human Motion Understanding

Ryo Umagami ⋅ Liu Yue ⋅ Xuangeng Chu ⋅ Ryuto Fukushima ⋅ Tetsuya Narita ⋅ Yusuke Mukuta ⋅ Tomoyuki Takahata ⋅ Jianfei Yang ⋅ Tatsuya Harada

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Human motion is inherently intentional, yet most motion modeling paradigms focus on low-level kinematics, overlooking the semantic and causal factors that drive behavior. Existing datasets further limit progress: they capture short, decontextualized actions in static scenes, providing little grounding for embodied reasoning. To address these limitations, we introduce $\textit{Intend to Move (I2M)}$, a large-scale, multimodal dataset for intention-grounded motion modeling. I2M contains 10.1 hours of two-person 3D motion sequences recorded in dynamic realistic home environments, accompanied by multi-view RGB-D video, 3D scene geometry, and language annotations of each participant’s evolving intentions. Benchmark experiments reveal a fundamental gap in current motion models: they fail to translate high-level goals into physically and socially coherent motion. I2M thus serves not only as a dataset but as a benchmark for embodied intelligence, enabling research on models that can reason about, predict, and act upon the ``why'' behind human motion.

View full details

Poster

LawShift: Benchmarking Legal Judgment Prediction Under Statute Shifts

Zhuo Han ⋅ Yi Yang ⋅ Yi Feng ⋅ Wanhong Huang ⋅ Ding Xuxing ⋅ Chuanyi Li ⋅ Jidong Ge ⋅ Vincent Ng

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Legal Judgment Prediction (LJP) seeks to predict case outcomes given available case information, offering practical value for both legal professionals and laypersons. However, a key limitation of existing LJP models is their limited adaptability to statutory revisions. Current SOTA models are neither designed nor evaluated for statutory revisions. To bridge this gap, we introduce LawShift, a benchmark dataset for evaluating LJP under statutory revisions. Covering 31 fine-grained change types, LawShift enables systematic assessment of SOTA models' ability to handle legal changes. We evaluate five representative SOTA models on LawShift, uncovering significant limitations in their response to legal updates. Our findings show that model architecture plays a critical role in adaptability, offering actionable insights and guiding future research on LJP in dynamic legal contexts.

View full details

Poster

MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering

Rushi Qiang ⋅ Yuchen Zhuang ⋅ Yinghao Li ⋅ Dingu Sagar V K ⋅ Rongzhi Zhang ⋅ ChangHao Li ⋅ Ian Wong ⋅ Sherry Yang ⋅ Percy Liang ⋅ Chao Zhang ⋅ Bo Dai

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We introduce MLE-Dojo, a Gym-style framework for systematically reinforcement learning, evaluating, and improving autonomous large language model (LLM) agents in iterative machine learning engineering (MLE) workflows. Unlike existing benchmarks that primarily rely on static datasets or single-attempt evaluations, MLE-Dojo provides an interactive environment enabling agents to iteratively experiment, debug, and refine solutions through structured feedback loops. Built upon 200+ real-world Kaggle challenges, MLE-Dojo covers diverse, open-ended MLE tasks carefully curated to reflect realistic engineering scenarios such as data processing, architecture search, hyperparameter tuning, and code debugging. Its fully executable environment supports comprehensive agent training via both supervised fine-tuning and reinforcement learning, facilitating iterative experimentation, realistic data sampling, and real-time outcome verification. Extensive evaluations of eight frontier LLMs reveal that while current models achieve meaningful iterative improvements, they still exhibit significant limitations in autonomously generating long-horizon solutions and efficiently resolving complex errors. Furthermore, MLE-Dojo’s flexible and extensible architecture seamlessly integrates diverse data sources, tools, and evaluation protocols, uniquely enabling model-based agent tuning and promoting interoperability, scalability, and reproducibility. We open-source our framework and benchmarks to foster community-driven innovation towards next-generation MLE agents.

View full details

Poster

DCcluster-Opt: Benchmarking Dynamic Multi-Objective Optimization for Geo-Distributed Data Center Workloads

Antonio Guillen-Perez ⋅ Avisek Naug ⋅ Vineet Gundecha ⋅ Sahand Ghorbanpour ⋅ Ricardo Luna Gutierrez ⋅ Ashwin Ramesh Babu ⋅ Munther Salim ⋅ Shubhanker Banerjee ⋅ Eoin Essink ⋅ Damien Fay ⋅ Soumyendu Sarkar

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

The increasing energy demands and carbon footprint of large-scale AI require intelligent workload management in globally distributed data centers. Yet progress is limited by the absence of benchmarks that realistically capture the interplay of time-varying environmental factors (grid carbon intensity, electricity prices, weather), detailed data center physics (CPUs, GPUs, memory, HVAC energy), and geo-distributed network dynamics (latency and transmission costs). To bridge this gap, we present DCcluster-Opt: an open-source, high-fidelity simulation benchmark for sustainable, geo-temporal task scheduling. DCcluster-Opt combines curated real-world datasets, including AI workload traces, grid carbon intensity, electricity markets, weather across 20 global regions, cloud transmission costs, and empirical network delay parameters with physics-informed models of data center operations, enabling rigorous and reproducible research in sustainable computing. It presents a challenging scheduling problem where a top-level coordinating agent must dynamically reassign or defer tasks that arrive with resource and service-level agreement requirements across a configurable cluster of data centers to optimize multiple objectives. The environment also models advanced components such as heat recovery. A modular reward system enables an explicit study of trade-offs among carbon emissions, energy costs, service level agreements, and water use. It provides a Gymnasium API with baseline controllers, including reinforcement learning and rule-based strategies, to support reproducible ML research and a fair comparison of diverse algorithms. By offering a realistic, configurable, and accessible testbed, DCcluster-Opt accelerates the development and validation of next-generation sustainable computing solutions for geo-distributed data centers.

View full details

Poster

LabUtopia: High-Fidelity Simulation and Hierarchical Benchmark for Scientific Embodied Agents

Rui Li ⋅ Zixuan Hu ⋅ Wenxi Qu ⋅ Jinouwen Zhang ⋅ Zhenfei Yin ⋅ Sha Zhang ⋅ Xuantuo Huang ⋅ Hanqing Wang ⋅ Tai WANG ⋅ Jiangmiao Pang ⋅ Wanli Ouyang ⋅ LEI BAI ⋅ Wangmeng Zuo ⋅ LINGYU DUAN ⋅ Dongzhan Zhou ⋅ SHIXIANG TANG

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Scientific embodied agents play a crucial role in modern laboratories by automating complex experimental workflows.Compared to typical household environments, laboratory settings impose significantly higher demands on perception of physical-chemical transformations and long-horizon planning, making them an ideal testbed for advancing embodied intelligence.However, its development has been long hampered by the lack of suitable simulator and benchmarks.In this paper, we address this gap by introducing LabUtopia, a comprehensive simulation and benchmarking suite designed to facilitate the development of generalizable, reasoning-capable embodied agents in laboratory settings. Specifically, it integrates i) LabSim, a high-fidelity simulator supporting multi-physics and chemically meaningful interactions; ii) LabScene, a scalable procedural generator for diverse scientific scenes; and iii) LabBench, a hierarchical benchmark spanning five levels of complexity from atomic actions to long-horizon mobile manipulation. LabUtopia supports 30 distinct tasks and includes more than 200 scene and instrument assets, enabling large-scale training and principled evaluation in high-complexity environments.We demonstrate that LabUtopia offers a powerful platform for advancing the integration of perception, planning, and control in scientific-purpose agents and provides a rigorous testbed for exploring the practical capabilities and generalization limits of embodied intelligence in future research. Project web page: https://rui-li023.github.io/labutopia-site/

View full details

Poster

GraphLand: Evaluating Graph Machine Learning Models on Diverse Industrial Data

Gleb Bazhenov ⋅ Oleg Platonov ⋅ Liudmila Prokhorenkova

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Although data that can be naturally represented as graphs is widespread in real-world applications across diverse industries, popular graph ML benchmarks for node property prediction only cover a surprisingly narrow set of data domains, and graph neural networks (GNNs) are often evaluated on just a few academic citation networks. This issue is particularly pressing in light of the recent growing interest in designing graph foundation models. These models are supposed to be able to transfer to diverse graph datasets from different domains, and yet the proposed graph foundation models are often evaluated on a very limited set of datasets from narrow applications. To alleviate this issue, we introduce GraphLand: a benchmark of 14 diverse graph datasets for node property prediction from a range of different industrial applications. GraphLand allows evaluating graph ML models on a wide range of graphs with diverse sizes, structural characteristics, and feature sets, all in a unified setting. Further, GraphLand allows investigating such previously underexplored research questions as how realistic temporal distributional shifts under transductive and inductive settings influence graph ML model performance. To mimic realistic industrial settings, we use GraphLand to compare GNNs with gradient-boosted decision trees (GBDT) models that are popular in industrial applications and show that GBDTs provided with additional graph-based input features can sometimes be very strong baselines. Further, we evaluate currently available general-purpose graph foundation models and find that they fail to produce competitive results on our proposed datasets.

View full details

Poster

GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents

Manish Shetty ⋅ Naman Jain ⋅ Jinjian Liu ⋅ Vijay Kethanaboyina ⋅ Koushik Sen ⋅ Ion Stoica

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Developing high-performance software is a complex task that requires specialized expertise. We introduce GSO, a benchmark for evaluating language models' capabilities in developing high-performance software.We develop an automated pipeline that generates and executes performance tests to analyze repository commit histories to identify 102 challenging optimization tasks across 10 codebases, spanning diverse domains and programming languages.An agent is provided with a codebase and performance test as a precise specification, and tasked to improve the runtime efficiency, which is measured against the expert developer optimization.Our quantitative evaluation reveals that leading SWE-Agents struggle significantly, achieving less than 5% success rate, with limited improvements even with inference-time scaling.Our qualitative analysis identifies key failure modes, including difficulties with low-level languages, practicing lazy optimization strategies, and challenges in accurately localizing bottlenecks.We release the code and artifacts of our benchmark along with agent trajectories to enable future research.

View full details

Poster

WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks

Ivan Evtimov ⋅ Arman Zharmagambetov ⋅ Aaron Grattafiori ⋅ Chuan Guo ⋅ Kamalika Chaudhuri

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Autonomous UI agents powered by AI have tremendous potential to boost human productivity by automating routine tasks such as filing taxes and paying bills. However, a major challenge in unlocking their full potential is security, which is exacerbated by the agent's ability to take action on their user's behalf. Existing tests for prompt injections in web agents either over-simplify the threat by testing unrealistic scenarios or giving the attacker too much power, or look at single-step isolated tasks. To more accurately measure progress for secure web agents, we introduce WASP – a new publicly available benchmark for end-to-end evaluation of Web Agent Security against Prompt Injection attacks. Evaluating with WASP shows that even top-tier AI models, including those with advanced reasoning capabilities, can be deceived by simple, low-effort human-written injections in very realistic scenarios. Our end-to-end evaluation reveals a previously unobserved insight: while attacks partially succeed in up to 86% of the case, even state-of-the-art agents often struggle to fully complete the attacker goals – highlighting the current state of security by incompetence. Code and data are available at https://github.com/facebookresearch/wasp.

View full details

Poster

PartNeXt: A Next-Generation Dataset for Fine-Grained and Hierarchical 3D Part Understanding

Penghao Wang ⋅ Yiyang He ⋅ Xin Lv ⋅ Yukai Zhou ⋅ Lan Xu ⋅ Jingyi Yu ⋅ Jiayuan Gu

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Understanding objects at the level of their constituent parts is fundamental to advancing computer vision, graphics, and robotics. While datasets like PartNet have driven progress in 3D part understanding, their reliance on untextured geometries and expert-dependent annotation limits scalability and usability. We introduce PartNeXt, a next-generation dataset addressing these gaps with over 23000 high-quality, textured 3D models annotated with fine-grained, hierarchical part labels across 50 categories. We benchmark PartNeXt on two tasks: (1) class-agnostic part segmentation, where state-of-the-art methods (e.g., PartField, SAMPart3D) struggle with fine-grained and leaf-level parts, and (2) 3D part-centric question answering, a new benchmark for 3D-LLMs that reveals significant gaps in open-vocabulary part grounding. Additionally, training Point-SAM on PartNeXt yields substantial gains over PartNet, underscoring the dataset’s superior quality and diversity. By combining scalable annotation, texture-aware labels, and multi-task evaluation, PartNeXt opens new avenues for research in structured 3D understanding.

View full details

Poster

EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Videos Generation

Xiaofeng Wang ⋅ Kang Zhao ⋅ Feng Liu ⋅ Jiayu Wang ⋅ Guosheng Zhao ⋅ Xiaoyi Bao ⋅ Zheng Zhu ⋅ Yingya Zhang

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Video generation has emerged as a promising tool for world simulation, leveraging visual data to replicate real-world environments. Within this context, egocentric video generation, which centers on the human perspective, holds significant potential for enhancing applications in virtual reality, augmented reality, and gaming. However, the generation of egocentric videos presents substantial challenges due to the dynamic nature of first-person viewpoints, the intricate diversity of actions, and the complex variety of scenes encountered. Existing datasets are inadequate for addressing these challenges effectively. To bridge this gap, we present EgoVid-5M, the first high-quality dataset specifically curated for egocentric video generation. EgoVid-5M encompasses over 5 million egocentric video clips and is enriched with detailed action annotations, including fine-grained kinematic control and high-level textual descriptions. To ensure the integrity and usability of the dataset, we implement a sophisticated data cleansing pipeline designed to maintain frame consistency, action coherence, and motion smoothness under egocentric conditions. Furthermore, we introduce EgoDreamer, which is capable of generating egocentric videos driven simultaneously by action descriptions and kinematic control signals. The EgoVid-5M dataset, associated action annotations, and all data cleansing metadata will be released for the advancement of research in egocentric video generation.

View full details

Poster

VMDT: Decoding the Trustworthiness of Video Foundation Models

Yujin Potter ⋅ Zhun Wang ⋅ Nicholas Crispino ⋅ Kyle Montgomery ⋅ Alexander Xiong ⋅ Ethan Chang ⋅ Francesco Pinto ⋅ Yuqi Chen ⋅ Rahul Gupta ⋅ Morteza Ziyadi ⋅ Christos Christodoulopoulos ⋅ Bo Li ⋅ Chenguang Wang ⋅ Dawn Song

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

As foundation models become more sophisticated, ensuring their trustworthiness becomes increasingly critical; yet, unlike text and image, the video modality still lacks comprehensive trustworthiness benchmarks. We introduce VMDT (Video-Modal DecodingTrust), the first unified platform for evaluating text-to-video (T2V) and video-to-text (V2T) models across five key trustworthiness dimensions: safety, hallucination, fairness, privacy, and adversarial robustness. Through our extensive evaluation of 7 T2V models and 19 V2T models using VMDT, we uncover several significant insights. For instance, all open-source T2V models evaluated fail to recognize harmful queries and often generate harmful videos, while exhibiting higher levels of unfairness compared to image modality models. In V2T models, unfairness and privacy risks rise with scale, whereas hallucination and adversarial robustness improve---though overall performance remains low. Uniquely, safety shows no correlation with model size, implying that factors other than scale govern current safety levels. Our findings highlight the urgent need for developing more robust and trustworthy video foundation models, and VMDT provides a systematic framework for measuring and tracking progress toward this goal. The code is available at https://sunblaze-ucb.github.io/VMDT-page/.

View full details

Poster

AgMMU: A Comprehensive Agricultural Multimodal Understanding Benchmark

Aruna Gauba ⋅ Irene Pi ⋅ Yunze Man ⋅ Ziqi Pang ⋅ Vikram Adve ⋅ Yu-Xiong Wang

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We present **AgMMU**, a challenging real‑world benchmark for evaluating and advancing vision-language models (VLMs) in the knowledge‑intensive domain of agriculture. Unlike prior datasets that rely on crowdsourced prompts, AgMMU is distilled from 116,231 authentic dialogues between everyday growers and USDA-authorized Cooperative Extension experts. Through a three‑stage pipeline: automated knowledge extraction, QA generation, and human verification, we construct (i) AgMMU, an evaluation set of 746 multiple‑choice questions (MCQs) and 746 open‑ended questions (OEQs), and (ii) AgBase, a development corpus of 57,079 multimodal facts covering five high-stakes agricultural topics: insect identification, species identification, disease categorization, symptom description, and management instruction. AgMMU has three key advantages:- **Authentic \& Expert‑Verified**: All facts, images, and answers originate from real farmer and gardener inquiries answered by credentialed specialists, ensuring high‑fidelity agricultural knowledge.- **Complete Development Suite**: AgMMU uniquely couples a dual‑format evaluation benchmark (MCQ and OEQ) with AgBase, a large‑scale training set, enabling both rigorous assessment and targeted improvement of VLMs.- **Knowledge‑intensive Challenge**: Our tasks demand the synergy of nuanced visual perception and domain expertise, exposing fundamental limitations of current general‑purpose models and charting a path toward robust, application‑ready agricultural AI.Benchmarking 12 leading VLMs reveals pronounced gaps in fine‑grained perception and factual grounding. Open‑sourced models trail after proprietary ones by a wide margin. Simple fine‑tuning on AgBase boosts open-sourced model performance on challenging OEQs for up to 11.6\% on average, narrowing this gap and also motivating future research to propose better strategies in knowledge extraction and distillation from AgBase. We hope AgMMU stimulates research on domain‑specific knowledge integration and trustworthy decision support in agriculture AI development.

View full details

Poster

AGC-Drive: A Large-Scale Dataset for Real-World Aerial-Ground Collaboration in Driving Scenarios

Yunhao Hou ⋅ Bochao Zou ⋅ Min Zhang ⋅ 燃陈 ⋅ Shangdong Yang ⋅ Yanmei Zhang ⋅ Junbao Zhuo ⋅ Siheng Chen ⋅ Jiansheng Chen ⋅ Huimin Ma

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

By sharing information across multiple agents, collaborative perception helps autonomous vehicles mitigate occlusions and improve overall perception accuracy. While most previous work focus on vehicle-to-vehicle and vehicle-to-infrastructure collaboration, with limited attention to aerial perspectives provided by UAVs, which uniquely offer dynamic, top-down views to alleviate occlusions and monitor large-scale interactive environments. A major reason for this is the lack of high-quality datasets for aerial-ground collaborative scenarios. To bridge this gap, we present AGC-Drive, the first large-scale real-world dataset for Aerial-Ground Cooperative 3D perception. The data collection platform consists of two vehicles, each equipped with five cameras and one LiDAR sensor, and one UAV carrying a forward-facing camera and a LiDAR sensor, enabling comprehensive multi-view and multi-agent perception. Consisting of approximately 80K LiDAR frames and 360K images, the dataset covers 14 diverse real-world driving scenarios, including urban roundabouts, highway tunnels, and on/off ramps. Notably, 17\% of the data comprises dynamic interaction events, including vehicle cut-ins, cut-outs, and frequent lane changes. AGC-Drive contains 350 scenes, each with approximately 100 frames and fully annotated 3D bounding boxes covering 13 object categories. We provide benchmarks for two 3D perception tasks: vehicle-to-vehicle collaborative perception and vehicle-to-UAV collaborative perception. Additionally, we release an open-source toolkit, including spatiotemporal alignment verification tools, multi-agent visualization systems, and collaborative annotation utilities. The dataset and code are available at https://github.com/PercepX/AGC-Drive.

View full details

Poster

BOOM: Benchmarking Out-Of-distribution Molecular Property Predictions of Machine Learning Models

Evan Antoniuk ⋅ Shehtab Zaman ⋅ Tal Ben-Nun ⋅ Peggy Li ⋅ James Diffenderfer ⋅ Busra Sahin ⋅ Obadiah Smolenski ⋅ Everett Grethel ⋅ Tim Hsu ⋅ Anna Hiszpanski ⋅ Kenneth Chiu ⋅ Bhavya Kailkhura ⋅ Brian Van Essen

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Data-driven molecular discovery leverages artificial intelligence/machine learning (AI/ML) and generative modeling to filter and design novel molecules. Discovering novel molecules requires accurate out-of-distribution (OOD) predictions, but ML models struggle to generalize OOD. Currently, no systematic benchmarks exist for molecular OOD prediction tasks. We present BOOM, $\textbf{b}$enchmarks for $\textbf{o}$ut-$\textbf{o}f$-$\textbf{d}$istribution $\textbf{m}$olecular property predictions: a chemically-informed benchmark for OOD performance on common molecular property prediction tasks. We evaluate over 150 model-task combinations to benchmark deep learning models on OOD performance. Overall, we find that no existing model achieves strong generalization across all tasks: even the top-performing model exhibited an average OOD error 3$\times$ higher than in-distribution. Current chemical foundation models do not show strong OOD extrapolation, while models with high inductive bias can perform well on OOD tasks with simple, specific properties. We perform extensive ablation experiments, highlighting how data generation, pre-training, hyperparameter optimization, model architecture, and molecular representation impact OOD performance. Developing models with strong OOD generalization is a new frontier challenge in chemical ML. This open-source benchmark is available at https://github.com/FLASK-LLNL/BOOM

View full details

Poster

Time Travel is Cheating: Going Live with DeepFund for Real-Time Fund Investment Benchmarking

Changlun Li ⋅ Yao SHI ⋅ Chen Wang ⋅ Qiqi Duan ⋅ Runke RUAN ⋅ Weijie Huang ⋅ Haonan Long ⋅ Lijun Huang ⋅ Nan Tang ⋅ Yuyu Luo

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Large Language Models (LLMs) have demonstrated notable capabilities across financial tasks, including financial report summarization, earnings call transcript analysis, and asset classification. However, their real-world effectiveness in managing complex fund investment remains inadequately assessed. A fundamental limitation of existing benchmarks for evaluating LLM-driven trading strategies is their reliance on historical back-testing, inadvertently enabling LLMs to "time travel"—leveraging future information embedded in their training corpora, thus resulting in possible information leakage and overly optimistic performance estimates. To address this issue, we introduce DeepFund, a live fund benchmark tool designed to rigorously evaluate LLM in real-time market conditions. Utilizing a multi-agent architecture, DeepFund connects directly with real-time stock market data—specifically data published after each model’s pretraining cutoff—to ensure fair and leakage-free evaluations. Empirical tests on nine flagship LLMs from leading global institutions across multiple investment dimensions—including ticker-level analysis, investment decision-making, portfolio management, and risk control—reveal significant practical challenges. Notably, even cutting-edge models such as DeepSeek-V3 and Claude-3.7-Sonnet incur net trading losses within DeepFund real-time evaluation environment, underscoring the present limitations of LLMs for active fund management. Our code is available at https://github.com/HKUSTDial/DeepFund.

View full details

Poster

LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?

Zihan Zheng ⋅ Zerui Cheng ⋅ Zeyu Shen ⋅ Shang Zhou ⋅ Kaiyuan Liu ⋅ Hansen He ⋅ Dongruixuan Li ⋅ Stanley Wei ⋅ Hangyi Hao ⋅ Jianzhu Yao ⋅ Peiyao Sheng ⋅ Zixuan Wang ⋅ Wenhao Chai ⋅ Aleksandra Korolova ⋅ Peter Henderson ⋅ Sanjeev Arora ⋅ Pramod Viswanath ⋅ Jingbo Shang ⋅ Saining Xie

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Recent reports claim that large language models (LLMs) now outperform elite humans in competitive programming. Drawing on knowledge from a group of medalists in international algorithmic contests, we revisit this claim, examining how LLMs differ from human experts and where limitations still remain. We introduce LiveCodeBench Pro, a benchmark composed of problems from Codeforces, ICPC, and IOI that are continuously updated to reduce the likelihood of data contamination. A team of Olympiad medalists annotates every problem for algorithmic categories and conducts a line-by-line analysis of failed model-generated submissions. Using this new data and benchmark, we find that frontier models still have significant limitations: without external tools, the best model achieves only 53\% pass@1 on medium-difficulty problems and 0\% on hard problems, domains where expert humans still excel. We also find that LLMs succeed at implementation-heavy problems but struggle with nuanced algorithmic reasoning and complex case analysis, often generating confidently incorrect justifications. High performance appears largely driven by implementation precision and tool augmentation, not superior reasoning. LiveCodeBench Pro thus highlights the significant gap to human grandmaster levels, while offering fine-grained diagnostics to steer future improvements in code-centric LLM reasoning.

View full details

Poster

REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites

Div Garg ⋅ Diego Caples ⋅ Andis Draguns ⋅ Nikil Ravi ⋅ Pranav Putta ⋅ Naman Garg ⋅ Prannay Hebbar ⋅ Youngchul Joo ⋅ Jindong Gu ⋅ Charles London ⋅ Christian Schroeder de Witt ⋅ Sumeet Motwani

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We introduce REAL, a benchmark and framework for multi-turn agent evaluations on deterministic simulations of real-world websites. REAL comprises high-fidelity, deterministic replicas of 11 widely-used websites across domains such as e-commerce, travel, communication, and professional networking. We also release a benchmark consisting of 112 practical tasks that mirror everyday complex user interactions requiring both accurate information retrieval and state-changing actions. All interactions occur within this fully controlled setting, eliminating safety risks and enabling robust, reproducible evaluation of agent capability and reliability. Our novel evaluation framework combines programmatic checks of website state for action-based tasks with rubric-guided LLM-based judgments for information retrieval. The framework supports both open-source and proprietary agent systems through a flexible evaluation harness that accommodates black-box commands within browser environments, allowing research labs to test agentic systems without modification. Our empirical results show that frontier language models achieve at most a 41% success rate on REAL, highlighting critical gaps in autonomous web navigation and task completion capabilities. Our framework supports easy integration of new tasks, reproducible evaluation, and scalable post-training data generation, marking a significant step forward in evaluating and advancing agent capabilities.

View full details

Poster

CogPhys: Assessing Cognitive Load via Multimodal Remote and Contact-based Physiological Sensing

Anirudh Bindiganavale Harish ⋅ Peikun Guo ⋅ Bhargav Ghanekar ⋅ Diya Gupta ⋅ Akilesh Rajavenkatanarayan ⋅ MANOJ SHARMA ⋅ Maureen August ⋅ Akane Sano ⋅ Ashok Veeraraghavan

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Remote physiological sensing is an evolving area of research. As systems approach clinical precision, there is increasing focus on complex applications such as cognitive state estimation. Hence, there is a need for large datasets that facilitate research into complex downstream tasks such as remote cognitive load estimation. A first-of-its-kind, our paper introduces an open-source multimodal multi-vital sign dataset consisting of concurrent recordings from RGB, NIR (near-infrared), thermal, and RF (radio-frequency) sensors alongside contact-based physiological signals, such as pulse oximeter and chest bands, providing a benchmark for cognitive state assessment. By adopting a multimodal approach to remote health sensing, our dataset and its associated hardware system excel at modeling the complexities of cognitive load. Here, cognitive load is defined as the mental effort exerted during tasks such as reading, memorizing, and solving math problems. By using the NASA-TLX survey, we set personalized thresholds for defining high/low cognitive levels, enabling a more reliable benchmark. Our benchmarking scheme bridges the gap between existing remote sensing strategies and cognitive load estimation techniques by using vital signs (such as photoplethysmography (PPG) and respiratory waveforms) and physiological signals (blink waveforms) as an intermediary. Through this paper, we focus on replacing the need for intrusive contact-based physiological measurements with more user-friendly remote sensors. Our benchmarking demonstrates that multimodal fusion significantly improves remote vital sign estimation, with our fusion model achieving $<3~BPM$ (beats per minute) error for vital sign estimation. For cognitive load classification, the combination of remote PPG, remote respiratory signals, and blink markers achieves $86.49$% accuracy, approaching the performance of contact-based sensing ($87.5$%) and validating the feasibility of non-intrusive cognitive monitoring.

View full details

Poster

LCDB 1.1: A Database Illustrating Learning Curves Are More Ill-Behaved Than Previously Thought

Cheng Yan ⋅ Felix Mohr ⋅ Tom Viering

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Sample-wise learning curves plot performance versus training set size. They are useful for studying scaling laws and speeding up hyperparameter tuning and model selection. Learning curves are often assumed to be well-behaved: monotone (i.e. improving with more data) and convex. By constructing the Learning Curves Database 1.1 (LCDB 1.1), a large-scale database with high-resolution learning curves including more modern learners (CatBoost, TabNet, RealMLP, and TabPFN), we show that learning curves are less often well-behaved than previously thought. Using statistically rigorous methods, we observe significant ill-behavior in approximately 15% of the learning curves, almost twice as much as in previous estimates. We also identify which learners are to blame and show that specific learners are more ill-behaved than others. Additionally, we demonstrate that different feature scalings rarely resolve ill-behavior. We evaluate the impact of ill-behavior on downstream tasks, such as learning curve fitting and model selection, and find it poses significant challenges, underscoring the relevance and potential of LCDB 1.1 as a challenging benchmark for future research.

View full details

Poster

Massive Sound Embedding Benchmark (MSEB)

Georg Heigold ⋅ Ehsan Variani ⋅ Tom Bagby ⋅ Cyril Allauzen ⋅ Ji Ma ⋅ Shankar Kumar ⋅ Michael D Riley

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Audio is a critical component of multimodal perception, and any truly intelligent system must demonstrate a wide range of auditory capabilities. These capabilities include transcription, classification, retrieval, reasoning, segmentation, clustering, reranking, and reconstruction. Fundamentally, each task involves transforming a raw audio signal into a meaningful 'embedding'—be it a single vector, a sequence of continuous or discrete representations, or another structured form—which then serves as the basis for generating the task's final response. To accelerate progress towards robust machine auditory intelligence, we present the Massive Sound Embedding Benchmark (MSEB): an extensible framework designed to evaluate the auditory components of any multimodal system. In its first release, MSEB offers a comprehensive suite of eight core tasks, with more planned for the future, supported by diverse datasets, including the new, large-scale Simple Voice Questions (SVQ) dataset. Our initial experiments establish clear performance headrooms, highlighting the significant opportunity to improve real-world multimodal experiences where audio is a core signal. We encourage the research community to use MSEB to assess their algorithms and contribute to its growth. The library is publicly hosted at https://github.com/google-research/mseb.

View full details

Poster

TCM-Ladder: A Benchmark for Multimodal Question Answering on Traditional Chinese Medicine

Jiacheng Xie ⋅ Yang Yu ⋅ Ziyang Zhang ⋅ Shuai Zeng ⋅ Jiaxuan He ⋅ Ayush Vasireddy ⋅ Xiaoting tang ⋅ Congyu Guo ⋅ Lening Zhao ⋅ Congcong Jing ⋅ Guanghui An ⋅ Dong Xu

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Traditional Chinese Medicine (TCM), as an effective alternative medicine, has been receiving increasing attention. In recent years, the rapid development of large language models (LLMs) tailored for TCM has highlighted the urgent need for an objective and comprehensive evaluation framework to assess their performance on real-world tasks. However, existing evaluation datasets are limited in scope and primarily text-based, lacking a unified and standardized multimodal question-answering (QA) benchmark. To address this issue, we introduce TCM-Ladder, the first comprehensive multimodal QA dataset specifically designed for evaluating large TCM language models. The dataset covers multiple core disciplines of TCM, including fundamental theory, diagnostics, herbal formulas, internal medicine, surgery, pharmacognosy, and pediatrics. In addition to textual content, TCM-Ladder incorporates various modalities such as images and videos. The dataset was constructed using a combination of automated and manual filtering processes and comprises over 52,000 questions. These questions include single-choice, multiple-choice, fill-in-the-blank, diagnostic dialogue, and visual comprehension tasks. We trained a reasoning model on TCM-Ladder and conducted comparative experiments against nine state-of-the-art general domain and five leading TCM-specific LLMs to evaluate their performance on the dataset. Moreover, we propose Ladder-Score, an evaluation method specifically designed for TCM question answering that effectively assesses answer quality in terms of terminology usage and semantic expression. To the best of our knowledge, this is the first work to systematically evaluate mainstream general domain and TCM-specific LLMs on a unified multimodal benchmark. The datasets and leaderboard are publicly available at https://tcmladder.com and will be continuously updated. The source code is available at https://github.com/orangeshushu/TCM-Ladder.

View full details

Poster

MiNT: Multi-Network Transfer Benchmark for Temporal Graph Learning

Kiarash Shamsi ⋅ Tran Gia Bao Ngo ⋅ Razieh Shirzadkhani ⋅ Shenyang Huang ⋅ Farimah Poursafaei ⋅ Poupak Azad ⋅ Reihaneh Rabbany ⋅ Baris Coskunuzer ⋅ Guillaume Rabusseau ⋅ Cuneyt Akcora

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Temporal Graph Learning (TGL) aims to discover patterns in evolving networks or temporal graphs and leverage these patterns to predict future interactions. However, most existing research focuses on learning from a single network in isolation, leaving the challenges of within-domain and cross-domain generalization largely unaddressed. In this study, we introduce a new benchmark of 84 real-world temporal transaction networks and propose **Temporal Multi-network Transfer (MiNT)**, a pre-training framework designed to capture transferable temporal dynamics across diverse networks. We train MiNT models on up to 64 transaction networks and evaluate their generalization ability on 20 held-out, unseen networks. Our results show that MiNT consistently outperforms individually trained models, revealing a strong relation between the number of pre-training networks and transfer performance. These findings highlight scaling trends in temporal graph learning and underscore the importance of network diversity in improving generalization. This work establishes the first large-scale benchmark for studying transferability in TGL and lays the groundwork for developing Temporal Graph Foundation Models. Our code is available at \url{https://github.com/benjaminnNgo/ScalingTGNs}

View full details

Poster

EVAAA: A Virtual Environment Platform for Essential Variables in Autonomous and Adaptive Agents

Sungwoo Lee ⋅ Jungmin Lee ⋅ Sohee Kim ⋅ Hyebhin Yoon ⋅ Shinwon Park ⋅ Junhyeok Park ⋅ Jaehyuk Bae ⋅ Seok-Jun Hong ⋅ Choong-Wan Woo

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Reinforcement learning (RL) agents have demonstrated strong performance in structured environments, yet they continue to struggle in real-world settings where goals are ambiguous, conditions change dynamically, and external supervision is limited. These challenges stem not primarily from the algorithmic limitations but from the characteristics of conventional training environments, which are usually static, task-specific, and externally defined. In contrast, biological agents develop autonomy and adaptivity by interacting with complex, dynamic environments, where most behaviors are ultimately driven by internal physiological needs. Inspired by these biological constraints, we introduce EVAAA (Essential Variables in Autonomous and Adaptive Agents), a 3D virtual environment for training and evaluating egocentric RL agents endowed with internal physiological state variables. In EVAAA, agents must maintain essential variables (EVs)—e.g., satiation, hydration, body temperature, and tissue integrity (the level of damage)—within viable bounds by interacting with environments that increase in difficulty at each stage. The reward system is derived from internal state dynamics, enabling agents to generate goals autonomously without manually engineered, task-specific reward functions. Built on Unity ML-Agents, EVAAA supports multimodal sensory inputs, including vision, olfaction, thermoception, collision, as well as egocentric embodiment. It features naturalistic survival environments for curricular training and a suite of unseen experimental testbeds, allowing for the evaluation of autonomous and adaptive behaviors that emerge from the interplay between internal state dynamics and environmental constraints. By integrating physiological regulation, embodiment, continual learning, and generalization, EVAAA offers a biologically inspired benchmark for studying autonomy, adaptivity, and internally driven control in RL agents. Our code is publicly available at https://github.com/cocoanlab/evaaa

View full details

Poster

KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models

Yongliang Wu ⋅ Zonghui Li ⋅ Xinting Hu ⋅ Xinyu Ye ⋅ Xianfang Zeng ⋅ Gang Yu ⋅ Wenbo Zhu ⋅ Bernt Schiele ⋅ Ming-Hsuan Yang ⋅ Xu Yang

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Recent advances in multi-modal generative models have enabled significant progress in instruction-based image editing. However, while these models produce visually plausible outputs, their capacity for knowledge-based reasoning editing tasks remains under-explored. In this paper, We introduce KRIS-Bench (Knowledge-based Reasoning in Image-editing Systems Benchmark), a diagnostic benchmark designed to assess models through a cognitively informed lens. Drawing from educational theory, KRIS-Bench categorizes editing tasks across three foundational knowledge types: Factual, Conceptual, and Procedural. Based on this taxonomy, we design 22 representative tasks spanning 7 reasoning dimensions and release 1,267 high-quality annotated editing instances. To support fine-grained evaluation, we propose a comprehensive protocol that incorporates a novel Knowledge Plausibility metric, enhanced by knowledge hints and calibrated through human studies. Empirical results on nine state-of-the-art models reveal significant gaps in reasoning performance, highlighting the need for knowledge-centric benchmarks to advance the development of intelligent image editing systems.

View full details

Poster

EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis

Shengyuan Liu ⋅ Boyun Zheng ⋅ Wenting Chen ⋅ Zhihao Peng ⋅ Zhenfei Yin ⋅ Jing Shao ⋅ Jiancong Hu ⋅ Yixuan Yuan

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Endoscopic procedures are essential for diagnosing and treating internal diseases, and multi-modal large language models (MLLMs) are increasingly applied to assist in endoscopy analysis. However, current benchmarks are limited, as they typically cover specific endoscopic scenarios and a small set of clinical tasks, failing to capture the real-world diversity of endoscopic scenarios and the full range of skills needed in clinical workflows. To address these issues, we introduce EndoBench, the first comprehensive benchmark specifically designed to assess MLLMs across the full spectrum of endoscopic practice with multi-dimensional capacities. EndoBench encompasses 4 distinct endoscopic scenarios, 12 specialized clinical tasks with 12 secondary subtasks, and 5 levels of visual prompting granularities, resulting in 6,832 rigorously validated VQA pairs from 21 diverse datasets. Our multi-dimensional evaluation framework mirrors the clinical workflow—spanning anatomical recognition, lesion analysis, spatial localization, and surgical operations—to holistically gauge the perceptual and diagnostic abilities of MLLMs in realistic scenarios. We benchmark 23 state-of-the-art models, including general-purpose, medical-specialized, and proprietary MLLMs, and establish human clinician performance as a reference standard. Our extensive experiments reveal: (1) proprietary MLLMs outperform open-source and medical-specialized models overall, but still trail human experts; (2) medical-domain supervised fine-tuning substantially boosts task-specific accuracy; and (3) model performance remains sensitive to prompt format and clinical task complexity. EndoBench establishes a new standard for evaluating and advancing MLLMs in endoscopy, highlighting both progress and persistent gaps between current models and expert clinical reasoning. We publicly release our benchmark and code.

View full details

Poster

UAV-Flow Colosseo: A Real-World Benchmark for Flying-on-a-Word UAV Imitation Learning

Xiangyu Wang ⋅ Donglin Yang ⋅ Yue Liao ⋅ Wenhao Zheng ⋅ wenjun wu ⋅ Bin Dai ⋅ Hongsheng Li ⋅ Si Liu

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Unmanned Aerial Vehicles (UAVs) are evolving into language-interactive platforms, enabling more intuitive forms of human-drone interaction. While prior works have primarily focused on high-level planning and long-horizon navigation, we shift attention to language-guided fine-grained trajectory control, where UAVs execute short-range, reactive flight behaviors in response to language instructions. We formalize this problem as the Flying-on-a-Word (Flow) task and introduce UAV imitation learning as an effective approach. In this framework, UAVs learn fine-grained control policies by mimicking expert pilot trajectories paired with atomic language instructions. To support this paradigm, we present UAV-Flow, the first real-world benchmark for language-conditioned, fine-grained UAV control. It includes a task formulation, a large-scale dataset collected in diverse environments, a deployable control framework, and a simulation suite for systematic evaluation. Our design enables UAVs to closely imitate the precise, expert-level flight trajectories of human pilots and supports direct deployment without sim-to-real gap. We conduct extensive experiments on UAV-Flow, benchmarking VLN and VLA paradigms. Results show that VLA models are superior to VLN baselines and highlight the critical role of spatial grounding in the fine-grained Flow setting.

View full details

Poster

BackdoorDM: A Comprehensive Benchmark for Backdoor Learning on Diffusion Model

Weilin Lin ⋅ Nanjun Zhou ⋅ Yanyun Wang ⋅ Jianze Li ⋅ Hui Xiong ⋅ Li Liu

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Backdoor learning is a critical research topic for understanding the vulnerabilities of deep neural networks. While the diffusion model (DM) has been broadly deployed in public over the past few years, the understanding of its backdoor vulnerability is still in its infancy compared to the extensive studies in discriminative models. Recently, many different backdoor attack and defense methods have been proposed for DMs, but a comprehensive benchmark for backdoor learning on DMs is still lacking. This absence makes it difficult to conduct fair comparisons and thoroughly evaluate existing approaches, thus hindering future research progress. To address this issue, we propose *BackdoorDM*, the first comprehensive benchmark designed for backdoor learning on DMs. It comprises nine state-of-the-art (SOTA) attack methods, four SOTA defense strategies, and three useful visualization analysis tools. We first systematically classify and formulate the existing literature in a unified framework, focusing on three different backdoor attack types and five backdoor target types, which are restricted to a single type in discriminative models. Then, we systematically summarize the evaluation metrics for each type and propose a unified backdoor evaluation method based on multimodal large language model (MLLM). Finally, we conduct a comprehensive evaluation and highlight several important conclusions. We believe that BackdoorDM will help overcome current barriers and contribute to building a trustworthy artificial intelligence generated content (AIGC) community. The codes are released in https://github.com/linweiii/BackdoorDM.

View full details

Poster

UrbanIng-V2X: A Large-Scale Multi-Vehicle, Multi-Infrastructure Dataset Across Multiple Intersections for Cooperative Perception

Karthikeyan Chandra Sekaran ⋅ Markus Geisler ⋅ Dominik Rößle ⋅ Adithya Mohan ⋅ Daniel Cremers ⋅ Wolfgang Utschick ⋅ Michael Botsch ⋅ Werner Huber ⋅ Torsten Schön

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Recent cooperative perception datasets have played a crucial role in advancing smart mobility applications by enabling information exchange between intelligent agents, helping to overcome challenges such as occlusions and improving overall scene understanding. While some existing real-world datasets incorporate both vehicle-to-vehicle and vehicle-to-infrastructure interactions, they are typically limited to a single intersection or a single vehicle. A comprehensive perception dataset featuring multiple connected vehicles and infrastructure sensors across several intersections remains unavailable, limiting the benchmarking of algorithms in diverse traffic environments. Consequently, overfitting can occur, and models may demonstrate misleadingly high performance due to similar intersection layouts and traffic participant behavior. To address this gap, we introduce UrbanIng-V2X, the first large-scale, multi-modal dataset supporting cooperative perception involving vehicles and infrastructure sensors deployed across three urban intersections in Ingolstadt, Germany. UrbanIng-V2X consists of 34 temporally aligned and spatially calibrated sensor sequences, each lasting 20 seconds. All sequences contain recordings from one of three intersections, involving two vehicles and up to three infrastructure-mounted sensor poles operating in coordinated scenarios. In total, UrbanIng-V2X provides data from 12 vehicle-mounted RGB cameras, 2 vehicle LiDARs, 17 infrastructure thermal cameras, and 12 infrastructure LiDARs. All sequences are annotated at a frequency of 10 Hz with 3D bounding boxes spanning 13 object classes, resulting in approximately 712k annotated instances across the dataset. We provide comprehensive evaluations using state-of-the-art cooperative perception methods and publicly release the codebase, dataset, HD map, and a digital twin of the complete data collection environment via https://github.com/thi-ad/UrbanIng-V2X.

View full details

Poster

MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems

Yinsicheng Jiang ⋅ Yao Fu ⋅ Yeqi Huang ⋅ Ping Nie ⋅ Zhan Lu ⋅ Leyang Xue ⋅ Congjie He ⋅ Man-Kit Sit ⋅ Jilong Xue ⋅ Li Dong ⋅ Ziming Miao ⋅ DaYou Du ⋅ Tairan Xu ⋅ Kai Zou ⋅ Edoardo Maria Ponti ⋅ Luo Mai

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

The sparse Mixture-of-Experts (MoE) architecture is increasingly favored for scaling Large Language Models (LLMs) efficiently, but it depends on heterogeneous compute and memory resources. These factors jointly affect system Cost, Accuracy, and Performance (CAP), making trade-offs inevitable. Existing benchmarks often fail to capture these trade-offs accurately, complicating practical deployment decisions. To address this, we introduce MoE-CAP, a benchmark specifically designed for MoE systems. Our analysis reveals that achieving an optimal balance across CAP is difficult with current hardware; MoE systems typically optimize two of the three dimensions at the expense of the third—a dynamic we term the MoE-CAP trade-off. To visualize this, we propose the CAP Radar Diagram. We further introduce sparsity-aware performance metrics—Sparse Memory Bandwidth Utilization (S-MBU) and Sparse Model FLOPS Utilization (S-MFU)—to enable accurate performance benchmarking of MoE systems across diverse hardware platforms and deployment scenarios. This benchmark is available on Github: https://github.com/sparse-generative-ai/MoE-CAP.

View full details

Poster

Listening to the Brain: Multi-Band sEEG Auditory Reconstruction via Dynamic Spatio-Temporal Hypergraphs

Xueyi Zhang ⋅ Ruicong Wang ⋅ Jialu Sun ⋅ Siqi Cai ⋅ Haizhou Li

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Speech is a fundamental form of human communication, and speech perception constitutes the initial stage of language comprehension. Although brain-to-speech interface technologies have made significant progress in recent years, most existing studies focus on neural decoding during speech production. Such approaches heavily rely on articulatory motor regions, rendering them unsuitable for individuals with speech motor impairments, such as those with aphasia or locked-in syndrome. To address this limitation, we construct and release NeuroListen, the first publicly available stereo-electroencephalography (sEEG) dataset specifically designed for auditory reconstruction. It contains over 10 hours of neural–speech paired recordings from 5 clinical participants, covering a wide range of semantic categories. Building on this dataset, we propose HyperSpeech, a multi-band neural decoding framework that employs dynamic spatio-temporal hypergraph neural networks to capture high-order dependencies across frequency, spatial, and temporal dimensions. Experimental results demonstrate that HyperSpeech significantly outperforms existing methods across multiple objective speech quality metrics, and achieves superior performance in human subjective evaluations, validating its effectiveness and advancement. This study provides a dedicated dataset and modeling framework for auditory speech decoding, offering foundations for neural language processing and assistive communication systems.

View full details

Poster

HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages

Zhilin Wang ⋅ Jiaqi Zeng ⋅ Olivier Delalleau ⋅ Hoo-Chang Shin ⋅ Felipe Soares ⋅ Alexander Bukharin ⋅ Ellie Evans ⋅ Yi Dong ⋅ Oleksii Kuchaiev

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Preference datasets are essential for training general-domain, instruction-following language models with Reinforcement Learning from Human Feedback (RLHF). Each subsequent data release raises expectations for future data collection, meaning there is a constant need to advance the quality and diversity of openly available preference data. To address this need, we introduce HelpSteer3-Preference, a permissively licensed (CC-BY-4.0), high-quality, human-annotated preference dataset comprising of over 40,000 samples. These samples span diverse real-world applications of large language models (LLMs), including tasks relating to STEM, coding and multilingual scenarios. Using HelpSteer3-Preference, we train Reward Models (RMs) that achieve top performance on RM-Bench (82.4%) and JudgeBench (73.7%). This represents a substantial improvement (~10% absolute) over the previously best-reported results from existing RMs. We demonstrate HelpSteer3-Preference can also be applied to train Generative RMs and how policy models can be aligned with RLHF using our RMs.

View full details

Poster

CLEAR: Command Level Annotated Dataset for Ransomware Detection

Barak Bringoltz ⋅ Elisha Halperin ⋅ Ran Feraru ⋅ Evgeny Blaichman ⋅ Amit Berman

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Over the last decade, ransomware detection has become a central topic in cybersecurity research. Due to ransomware's direct interaction with storage devices, analyzing I/O streams has become an effective detection method and represents a vital area of focus for research. A major challenge in this field is the lack of publicly accessible data featuring individual command labeling. To address this problem, we introduce the Command LEvel Annotated Ransomware (CLEAR) dataset, a large-scale collection of storage devices' stream data. The dataset comprises 1,045 TiB of I/O traffic data, featuring malicious traffic from 137 ransomware variants. It offers two orders of magnitude more I/O traffic data and one order of magnitude more ransomware variants than any other publicly accessible dataset. Importantly, it is the only dataset that individually labels each I/O command as either ransomware or benign activity. This labeling enables the use of advanced sequential models, which we show to outperform existing state-of-the-art models by up to 82% in data loss prevention. Additionally, this allows us to create new tasks, such as data recovery, by selectively reverting only the commands recognized as ransomware while preserving benign activity. The CLEAR dataset also includes supplementary auxiliary features derived from the data, which we demonstrate to improve performance through feature ablation studies. Lastly, a critical aspect of any ransomware detection model is its robustness to new, unseen ransomware variants, as new strains constantly emerge. Therefore, we propose a benchmark based on our dataset to evaluate performance against unknown ransomware samples and illustrate its application across different models.

View full details

Poster

PolyGuard: Massive Multi-Domain Safety Policy-Grounded Guardrail Dataset

Mintong Kang ⋅ Zhaorun Chen ⋅ Chejian Xu ⋅ Jiawei Zhang ⋅ Chengquan Guo ⋅ Minzhou Pan ⋅ Ivan Revilla ⋅ Yu Sun ⋅ Bo Li

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

As large language models (LLMs) become widespread across diverse applications, concerns about the security and safety of LLM interactions have intensified. Numerous guardrail models and benchmarks have been developed to ensure LLM content safety. However, existing guardrail benchmarks are often built upon ad hoc risk taxonomies that lack a principled grounding in standardized safety policies, limiting their alignment with real-world operational requirements. Moreover, they tend to overlook domain-specific risks, while the same risk category can carry different implications across different domains. To bridge these gaps, we introduce PolyGuard, the first massive multi-domain safety policy-grounded guardrail dataset. PolyGuard offers: (1) broad domain coverage across eight safety-critical domains, such as finance, law, and codeGen; (2) policy-grounded risk construction based on authentic, domain-specific safety guidelines; (3) diverse interaction formats, encompassing declarative statements, questions, instructions, and multi-turn conversations; (4) advanced benign data curation via detoxification prompting to challenge over-refusal behaviors; and (5) \textbf{attack-enhanced instances} that simulate adversarial inputs designed to bypass guardrails. Based on PolyGuard, we benchmark 19 advanced guardrail models and uncover a series of findings, such as: (1) All models achieve varied F1 scores, with many demonstrating high variance across risk categories, highlighting their limited domain coverage and insufficient handling of domain-specific safety concerns; (2) As models evolve, their coverage of safety risks broadens, but performance on common risk categories may decrease; (3) All models remain vulnerable to optimized adversarial attacks. The policy-grounded \dataset establishes the first principled and comprehensive guardrail benchmark. We believe that \dataset and the unique insights derived from our evaluations will advance the development of policy-aligned and resilient guardrail systems.

View full details

Poster

ClinBench: A Standardized Multi-Domain Framework for Evaluating Large Language Models in Clinical Information Extraction

Ismael Villanueva Miranda ⋅ Zifan Gu ⋅ Donghan Yang ⋅ Kuroush Nezafati ⋅ Jingwei Huang ⋅ Peifeng Ruan ⋅ Xiaowei Zhan ⋅ Guanghua Xiao ⋅ Yang Xie

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Large Language Models (LLMs) offer substantial promise for clinical natural language processing (NLP); however, a lack of standardized benchmarking methodologies limits their objective evaluation and practical translation. To address this gap, we introduce ClinBench, an open-source, multi-model, multi-domain benchmarking framework. ClinBench is designed for the rigorous evaluation of LLMs on important structured information extraction tasks (e.g., tumor staging, histologic diagnoses, atrial fibrillation, and social determinants of health) from unstructured clinical notes. The framework standardizes the evaluation pipeline by: (i) operating on consistently structured input datasets; (ii) employing dynamic, YAML-based prompting for uniform task definition; and (iii) enforcing output validation via JSON schemas, supporting robust comparison across diverse LLM architectures. We demonstrate ClinBench through a large-scale study of 11 prominent LLMs (e.g., GPT-4o series, LLaMA3 variants, Mixtral) across three clinical domains using configurations of public datasets (TCGA for lung cancer, MIMIC-IV-ECG for atrial fibrillation, and MIMIC notes for SDOH). Our results reveal significant performance-efficiency trade-offs. For example, when averaged across the four benchmarked clinical extraction tasks, GPT-3.5-turbo achieved a mean F1 score of 0.83 with a mean runtime of 16.8 minutes. In comparison, LLaMA3.1-70b obtained a similar mean F1 of 0.82 but required a substantially longer mean runtime of 42.7 minutes. GPT-4o-mini also presented a favorable balance with a mean F1 of 0.81 and a mean runtime of 13.4 minutes. ClinBench provides a unified, extensible framework and empirical insights for reproducible, fair LLM benchmarking in clinical NLP. By enabling transparent and standardized evaluation, this work advances data-centric AI research, informs model selection based on performance, cost, and clinical priorities, and supports the effective integration of LLMs into healthcare. The framework and evaluation code are publicly available at https://github.com/ismaelvillanuevamiranda/ClinBench/.

View full details

Poster

Towards precision protein-ligand affinity prediction benchmark: A Complete and Modification-Aware DAVIS Dataset

Ming Hsiu Wu ⋅ Ziqian Xie ⋅ Shuiwang Ji ⋅ Degui Zhi

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Advancements in AI for science unlocks capabilities for critical drug discovery tasks such as protein-ligand binding affinity prediction. However, current models overfit to existing oversimplified datasets that does not represent naturally occurring and biologically relevant proteins with modifications. In this work, we curate a complete and modification-aware version of the widely used DAVIS dataset by incorporating 4,032 kinase–ligand pairs involving substitutions, insertions, deletions, and phosphorylation events. This enriched dataset enables benchmarking of predictive models under biologically realistic conditions. Based on this new dataset, we propose three benchmark settings—Augmented Dataset Prediction, Wild-Type to Modification Generalization, and Few-Shot Modification Generalization—designed to assess model robustness in the presence of protein modifications. Through extensive evaluation of both docking-free and docking-based methods, we find that docking-based model generalize better in zero-shot settings. In contrast, docking-free models tend to overfit to wild-type proteins and struggle with unseen modifications but show notable improvement when fine-tuned on a small set of modified examples. We anticipate that the curated dataset and benchmarks offer a valuable foundation for developing models that better generalize to protein modifications, ultimately advancing precision medicine in drug discovery. The benchmark is available at: https://github.com/ZhiGroup/DAVIS-complete

View full details

Poster

C-SEO Bench: Does Conversational SEO Work?

Haritz Puerto ⋅ Martin Gubri ⋅ Tommaso Green ⋅ Seong Joon Oh ⋅ Sangdoo Yun

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Large Language Models (LLMs) are transforming search engines into Conversational Search Engines (CSE). Consequently, Search Engine Optimization (SEO) is being shifted into Conversational Search Engine Optimization (C-SEO). We are beginning to see dedicated C-SEO methods for modifying web documents to increase their visibility in CSE responses. However, they are often tested only for a limited breadth of application domains; we do not know whether certain C-SEO methods would be effective for a broad range of domains. Moreover, existing evaluations consider only a single-actor scenario where only one web document adopts a C-SEO method; in reality, multiple players are likely to competitively adopt the cutting-edge C-SEO techniques, drawing an analogy from the dynamics we have seen in SEO. We present C-SEO Bench, the first benchmark designed to evaluate C-SEO methods across multiple tasks, domains, and number of actors. We consider two search tasks, question answering and product recommendation, with three domains each. We also formalize a new evaluation protocol with varying adoption rates among involved actors. Our experiments reveal that most current C-SEO methods are not only largely ineffective but also frequently have a negative impact on document ranking, which is opposite to what is expected. Instead, traditional SEO strategies, those aiming to improve the ranking of the source in the LLM context, are significantly more effective. We also observe that as we increase the number of C-SEO adopters, the overall gains decrease, depicting a congested and zero-sum nature of the problem.

View full details

Poster

3EED: Ground Everything Everywhere in 3D

Rong Li ⋅ Yuhao Dong ⋅ Tianshuai Hu ⋅ Alan Liang ⋅ Youquan Liu ⋅ Dongyue Lu ⋅ Liang Pan ⋅ Lingdong Kong ⋅ Junwei Liang ⋅ Ziwei Liu

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Visual grounding in 3D is the key for embodied agents to localize language-referred objects in open-world environments. However, existing benchmarks are limited to indoor focus, single-platform constraints, and small scale. We introduce 3EED, a multi-platform, multi-modal 3D grounding benchmark featuring RGB and LiDAR data from vehicle, drone, and quadruped platforms. We provide over 128,000 objects and 22,000 validated referring expressions across diverse outdoor scenes -- 10x larger than existing datasets. We develop a scalable annotation pipeline combining vision-language model prompting with human verification to ensure high-quality spatial grounding. To support cross-platform learning, we propose platform-aware normalization and cross-modal alignment techniques, and establish benchmark protocols for in-domain and cross-platform evaluations. Our findings reveal significant performance gaps, highlighting the challenges and opportunities of generalizable 3D grounding. The 3EED dataset and benchmark toolkit are released to advance future research in language-driven 3D embodied perception.

View full details

Poster

SynTSBench: Rethinking Temporal Pattern Learning in Deep Learning Models for Time Series

Qitai Tan ⋅ Yiyun Chen ⋅ Mo Li ⋅ Ruiwen Gu ⋅ Yilin Su ⋅ Xiao-Ping (Steven) Zhang

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Recent advances in deep learning have driven rapid progress in time series forecasting, yet many state-of-the-art models continue to struggle with robust performance in real-world applications, even when they achieve strong results on standard benchmark datasets. This persistent gap can be attributed to the black-box nature of deep learning architectures and the inherent limitations of current evaluation frameworks, which frequently lack the capacity to provide clear, quantitative insights into the specific strengths and weaknesses of different models, thereby complicating the selection of appropriate models for particular forecasting scenarios.To address these issues, we propose a synthetic data-driven evaluation paradigm, SynTSBench, that systematically assesses fundamental modeling capabilities of time series forecasting models through programmable feature configuration. Our framework isolates confounding factors and establishes an interpretable evaluation system with three core analytical dimensions: (1) temporal feature decomposition and capability mapping, which enables systematic evaluation of model capacities to learn specific pattern types; (2) robustness analysis under data irregularities, which quantifies noise tolerance thresholds and anomaly recovery capabilities; and (3) theoretical optimum benchmarking, which establishes performance boundaries for each pattern type—enabling direct comparison between model predictions and mathematical optima.Our experiments show that current deep learning models do not universally approach optimal baselines across all types of temporal features.

View full details

Poster

The Impact of Coreset Selection on Spurious Correlations and Group Robustness

Amaya Dharmasiri ⋅ William Yang ⋅ Polina Kirichenko ⋅ Lydia Liu ⋅ Olga Russakovsky

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Coreset selection methods have shown promise in reducing the training data size while maintaining model performance for data-efficient machine learning. However, many large real-world datasets suffer from unknown spurious correlations and hidden biases. Therefore, it is crucial to understand how such biases would affect downstream tasks via the selected coresets. In this work, we conduct the first comprehensive analysis of the implications of data selection on the bias levels of the selected coresets and the robustness of downstream models trained on them. We use an extensive experimental setting spanning ten different spurious correlations benchmarks, five score metrics to characterize sample importance/ difficulty, and five data selection policies across a broad range of coreset sizes to identify important patterns and derive insights. Thereby, we unravel a series of nontrivial nuances in well-known interactions between sample difficulty and bias alignment, as well as dataset bias and resultant model robustness. For example, we show that embedding-based sample characterizations run a comparatively lower risk of inadvertently exacerbating bias when used for selecting coresets compared to characterizations based on learning dynamics. Our analysis also reveals that lower bias levels achieved by coresets of difficult samples do not reliably guarantee downstream robustness. Most importantly, we show that special considerations need to be made when the coreset size is very small, since there is a unique risk of highly prototypical coresets reaching high average performance while obscuring their low group-robustness.

View full details

Poster

Dynamic Risk Assessments for Offensive Cybersecurity Agents

Boyi Wei ⋅ Benedikt Stroebl ⋅ Jiacen Xu ⋅ Joie Zhang ⋅ Zhou Li ⋅ Peter Henderson

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Foundation models are increasingly becoming better autonomous programmers, raising the prospect that they could also automate dangerous offensive cyber‑operations. Current frontier model audits probe the cybersecurity risks of such agents, but most fail to account for the degrees of freedom available to adversaries in the real world.In particular, with strong verifiers and financial incentives, agents for offensive cybersecurity are amenable to iterative improvement by would-be adversaries. We argue that assessments should take into account an expanded threat model in the context of cybersecurity, emphasizing the varying degrees of freedom that an adversary may possess in _stateful_ and _non-stateful_ environments within a fixed compute budget. We show that even with a relatively small compute budget (8 H100 GPU Hours in our study), adversaries can improve an agent's cybersecurity capability on InterCode CTF by more than 40\% relative to the baseline---without any external assistance. These results highlight the need to evaluate agents' cybersecurity risk in a dynamic manner, painting a more representative picture of risk.

View full details

Poster

DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding

Weihao Xuan ⋅ Junjue Wang ⋅ Heli Qi ⋅ Zihang Chen ⋅ Zhuo Zheng ⋅ Yanfei Zhong ⋅ Junshi Xia ⋅ Naoto YOKOYA

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in visual understanding, but their application to long-term Earth observation analysis remains limited, primarily focusing on single-temporal or bi-temporal imagery. To address this gap, we introduce **DVL-Suite**, a comprehensive framework for analyzing long-term urban dynamics through remote sensing imagery. Our suite comprises 14,871 high-resolution (1.0m) multi-temporal images spanning 42 major cities in the U.S. from 2005 to 2023, organized into two components: **DVL-Bench** and **DVL-Instruct**. The *DVL-Bench* includes six urban understanding tasks, from fundamental change detection (*pixel-level*) to quantitative analyses (*regional-level*) and comprehensive urban narratives (*scene-level*), capturing diverse urban dynamics including expansion/transformation patterns, disaster assessment, and environmental challenges. We evaluate 18 state-of-the-art MLLMs and reveal their limitations in long-term temporal understanding and quantitative analysis. These challenges motivate the creation of *DVL-Instruct*, a specialized instruction-tuning dataset designed to enhance models' capabilities in multi-temporal Earth observation. Building upon this dataset, we develop **DVLChat**, a baseline model capable of both image-level question-answering and pixel-level segmentation, facilitating a comprehensive understanding of city dynamics through language interactions. Project: [https://github.com/weihao1115/dynamicvl](https://github.com/weihao1115/dynamicvl).

View full details

Poster

ConnectomeBench: Can LLMs proofread the connectome?

Jeff Brown ⋅ Andrew Kirjner ⋅ Annika Vivekananthan ⋅ Edward Boyden

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Connectomics—the mapping of neural connections in an organism's brain—currently requires extraordinary human effort to proofread the data collected from imaging and machine-learning assisted segmentation. With the growing excitement around using AI agents to automate important scientific tasks, we explore whether current AI systems can perform multiple tasks necessary for data proofreading. We introduce ConnectomeBench, a multimodal benchmark evaluating large language model (LLM) capabilities in three critical proofreading tasks: segment type identification, split error correction, and merge error detection. Using expert annotated data from two large open-source datasets—a cubic millimeter of mouse visual cortex and the complete Drosophila brain—we evaluate proprietary multimodal LLMs including Claude 3.7/4 Sonnet, o4-mini, GPT-4.1, GPT-4o, as well as open source models like InternVL-3 and NVLM. Our results demonstrate that current models achieve surprisingly high performance in segment identification (52-82\% balanced accuracy vs. 20-25\% chance) and binary/multiple choice split error correction (75-85\% accuracy vs. 50\% chance) while generally struggling on merge error identification tasks. Overall, while the best models still lag behind expert performance, they demonstrate promising capabilities that could eventually enable them to augment and potentially replace human proofreading in connectomics.

View full details

Poster

R&D-Agent-Quant: A Multi-Agent Framework for Data-Centric Factors and Model Joint Optimization

Yuante Li ⋅ Xu Yang ⋅ Xiao Yang ⋅ Xisen Wang ⋅ Weiqing Liu ⋅ Jiang Bian

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Financial markets pose fundamental challenges for asset return prediction due to their high dimensionality, non-stationarity, and persistent volatility. Despite advances in large language models and multi-agent systems, current quantitative research pipelines suffer from limited automation, weak interpretability, and fragmented coordination across key components such as factor mining and model innovation. In this paper, we propose R&D-Agent for Quantitative Finance, in short R&D-Agent(Q), the first data-centric multi-agent framework designed to automate the full-stack research and development of quantitative strategies via coordinated factor-model co-optimization. R&D-Agent(Q) decomposes the quant process into two iterative stages: a Research stage that dynamically sets goal-aligned prompts, formulates hypotheses based on domain priors, and maps them to concrete tasks, and a Development stage that employs a code-generation agent, Co-STEER, to implement task-specific code, which is then executed in real-market backtests. The two stages are connected through a feedback stage that thoroughly evaluates experimental outcomes and informs subsequent iterations, with a multi-armed bandit scheduler for adaptive direction selection. Empirically, R&D-Agent(Q) achieves up to 2× higher annualized returns than classical factor libraries using 70% fewer factors, and outperforms state-of-the-art deep time-series models on real markets. Its joint factor–model optimization delivers a strong balance between predictive accuracy and strategy robustness. Our code is available at: https://github.com/microsoft/RD-Agent.

View full details

Poster

DecoyDB: A Dataset for Graph Contrastive Learning in Protein-Ligand Binding Affinity Prediction

Yupu Zhang ⋅ Zelin Xu ⋅ Tingsong Xiao ⋅ Gustavo Seabra ⋅ Yanjun Li ⋅ Chenglong Li ⋅ Zhe Jiang

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Predicting the binding affinity of protein-ligand complexes plays a vital role in drug discovery. Unfortunately, progress has been hindered by the lack of large-scale and high-quality binding affinity labels. The widely used PDBbind dataset has fewer than 20K labeled complexes. Self-supervised learning, especially graph contrastive learning (GCL), provides a unique opportunity to break the barrier by pretraining graph neural network models based on vast unlabeled complexes and fine-tuning the models on much fewer labeled complexes. However, the problem faces unique challenges, including a lack of a comprehensive unlabeled dataset with well-defined positive/negative complex pairs and the need to design GCL algorithms that incorporate the unique characteristics of such data. To fill the gap, we propose DecoyDB, a large-scale, structure-aware dataset specifically designed for self-supervised GCL on protein–ligand complexes. DecoyDB consists of high-resolution ground truth complexes and diverse decoy structures with computationally generated binding poses that range from realistic to suboptimal. Each decoy is annotated with a Root Mean Square Deviation (RMSD) from the native pose. We further design a customized GCL framework to pretrain graph neural networks based on DecoyDB and fine-tune the models with labels from PDBbind. Extensive experiments confirm that models pretrained with DecoyDB achieve superior accuracy, sample efficiency, and generalizability.

View full details

Poster

PhysGym: Benchmarking LLMs in Interactive Physics Discovery with Controlled Priors

Yimeng Chen ⋅ Piotr Piękos ⋅ Mateusz Ostaszewski ⋅ Firas Laakom ⋅ Jürgen Schmidhuber

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Evaluating the scientific discovery capabilities of large language model based agents, particularly how they cope with varying environmental complexity and utilize prior knowledge, requires specialized benchmarks currently lacking in the landscape. To address this gap, we introduce PhysGym, a novel benchmark suite and simulation platform for rigorously assessing LLM-based scientific reasoning in interactive physics environments. PhysGym's primary contribution lies in its sophisticated control over the level of prior knowledge provided to the agent. This allows researchers to dissect agent performance along axes including the complexity of the problem and the prior knowledge levels. The benchmark comprises a suite of interactive simulations, where agents must actively probe environments, gather data sequentially under constraints and formulate hypotheses about underlying physical laws. PhysGym provides standardized evaluation protocols and metrics for assessing hypothesis accuracy and model fidelity. We demonstrate the benchmark's utility by presenting results from baseline LLMs, showcasing its ability to differentiate capabilities based on varying priors and task complexity.

View full details

Poster

Torch-Uncertainty: Deep Learning Uncertainty Quantification

Adrien Lafage ⋅ Olivier Laurent ⋅ Firas Gabetni ⋅ Gianni Franchi

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Deep Neural Networks (DNNs) have demonstrated remarkable performance across various domains, including computer vision and natural language processing. However, they often struggle to accurately quantify their predictions' uncertainty, limiting their broader adoption in critical industrial applications. Uncertainty Quantification (UQ) for Deep Learning seeks to address this challenge by providing methodologies to improve the reliability of uncertainty estimates. While numerous techniques have been proposed, a unified tool remains lacking that offers a seamless workflow for evaluating and integrating these methods. To bridge this gap, we introduce **Torch-Uncertainty**, a *PyTorch* and *Lightning* framework designed to streamline the training and evaluation of DNNs with UQ techniques. In this paper, we outline the foundational principles of our library and present comprehensive experimental results that benchmark a diverse set of UQ methods across classification, segmentation, and regression tasks. Our library is available at: https://github.com/ENSTA-U2IS-AI/torch-uncertainty.

View full details

Poster

MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark

Junjie Xing ⋅ Yeye He ⋅ Mengyu Zhou ⋅ Haoyu Dong ⋅ Shi Han ⋅ Lingjiao Chen ⋅ Dongmei Zhang ⋅ Surajit Chaudhuri ⋅ H. V. Jagadish

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Tables and table-based use cases play a crucial role in many important real-world applications, such as spreadsheets, databases, and computational notebooks, which traditionally require expert-level users like data engineers, data analysts, and database administrators to operate. Although LLMs have shown remarkable progress in working with tables (e.g., in spreadsheet and database copilot scenarios), comprehensive benchmarking of such capabilities remains limited. In contrast to an extensive and growing list of NLP benchmarks, evaluations of table-related tasks are scarce, and narrowly focus on tasks like NL-to-SQL and Table-QA, overlooking the broader spectrum of real-world tasks that professional users face. This gap limits our understanding and model progress in this important area.In this work, we introduce MMTU, a large-scale benchmark with over 28K questions across 25 real-world table tasks, designed to comprehensively evaluate models ability to understand, reason, and manipulate real tables at the expert-level. These tasks are drawn from decades’ worth of computer science research on tabular data, with a focus on complex table tasks faced by professional users. We show that MMTU require a combination of skills -- including table understanding, reasoning, and coding -- that remain challenging for today's frontier models, where even frontier reasoning models like OpenAI GPT-5 and DeepSeek R1 score only around 69% and 57% respectively, suggesting significant room for improvement. We highlight key findings in our evaluation using MMTU and hope this benchmark drives further advances in understanding and developing foundation models for structured data processing and analysis.Our code and data are available at https://github.com/MMTU-Benchmark/MMTU and https://huggingface.co/datasets/MMTU-benchmark/MMTU.

View full details

Poster

CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language Models

Xiao An ⋅ Jiaxing Sun ⋅ Zihan Gui ⋅ Wei He

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

The rapid advancement of Large Vision-Language Models (VLMs), both general-domain models and those specifically tailored for remote sensing, has demonstrated exceptional perception and reasoning capabilities in Earth observation tasks. However, a benchmark for systematically evaluating their capabilities in this domain is still lacking. To bridge this gap, we propose CHOICE, an extensive benchmark designed to objectively evaluate the hierarchical remote sensing capabilities of VLMs. Focusing on 2 primary capability dimensions essential to remote sensing: perception and reasoning, we further categorize 6 secondary dimensions and 23 leaf tasks to ensure a well-rounded assessment coverage. CHOICE guarantees the quality of all 10,507 problems through a rigorous process of data collection from 50 globally distributed cities, question construction, and quality control. The newly curated data and the format of multiple-choice questions with definitive answers allow for an objective and straightforward performance assessment. Our evaluation of 3 proprietary and 21 open-source VLMs highlights their critical limitations within this specialized context. We hope that CHOICE will serve as a valuable resource and offer deeper insights into the challenges and potential of VLMs in the field of remote sensing. Code and dataset are available at [this https URL](https://github.com/ShawnAn-WHU/CHOICE).

View full details

Poster

Face-Human-Bench: A Comprehensive Benchmark of Face and Human Understanding for Multi-modal Assistants

Lixiong Qin ⋅ Shilong Ou ⋅ Miaoxuan Zhang ⋅ Jiangning Wei ⋅ Yuhang Zhang ⋅ Xiaoshuai Song ⋅ Yuchen Liu ⋅ Mei Wang ⋅ Weiran Xu

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Faces and humans are crucial elements in social interaction and are widely included in everyday photos and videos. Therefore, a deep understanding of faces and humans will enable multi-modal assistants to achieve improved response quality and broadened application scope. Currently, the multi-modal assistant community lacks a comprehensive and scientific evaluation of face and human understanding abilities. In this paper, we first propose a hierarchical ability taxonomy that includes three levels of abilities. Then, based on this taxonomy, we collect images and annotations from publicly available datasets in the face and human community and build a semi-automatic data pipeline to produce problems for the new benchmark. Finally, the obtained Face-Human-Bench includes a development set and a test set, each with 1800 problems, supporting both English and Chinese. We conduct evaluations over 25 mainstream multi-modal large language models (MLLMs) with our Face-Human-Bench, focusing on the correlation between abilities, the impact of the relative position of targets on performance, and the impact of Chain of Thought (CoT) prompting on performance. We also explore which abilities of MLLMs need to be supplemented by specialist models. The dataset and evaluation code have been made publicly available at https://face-human-bench.github.io.

View full details

Poster

RESPIN-S1.0: A read speech corpus of 10000+ hours in dialects of nine Indian Languages

Saurabh Kumar ⋅ Abhayjeet Singh ⋅ DEEKSHITHA G ⋅ Amartya veer ⋅ Jesuraj Bandekar ⋅ Savitha Murthy ⋅ Sumit Sharma ⋅ Sandhya Badiger ⋅ Sathvik Udupa ⋅ Amala Nagireddi ⋅ Srinivasa Raghavan K M ⋅ Rohan Saxena ⋅ Jai Nanavati ⋅ Raoul Nanavati ⋅ Janani Sridharan ⋅ Arjun Mehta ⋅ Ashish S ⋅ Sai Mora ⋅ Prashanthi Venkataramakrishnan ⋅ Gauri Date ⋅ Karthika P ⋅ Prasanta Ghosh

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We introduce **RESPIN-S1.0**, the largest publicly available dialect-rich read-speech corpus for Indian languages, comprising more than 10,000 hours of validated audio across nine major languages: Bengali, Bhojpuri, Chhattisgarhi, Hindi, Kannada, Magahi, Maithili, Marathi, and Telugu. Indian languages exhibit high dialectal variation and are spoken by populations that remain digitally underserved. Existing speech corpora typically represent only standard dialects and lack domain and linguistic diversity. RESPIN-S1.0 addresses this limitation by collecting speech across more than 38 dialects and two high-impact domains: agriculture and finance. Text data were composed by native dialect speakers and validated through a pipeline combining automated and manual checks. Over 200,000 unique sentences were recorded through a crowdsourced mobile platform and categorised into clean, semi-noisy, and noisy subsets based on transcription quality, with the clean portion alone exceeding 10,000 hours. Along with audio and transcriptions, RESPIN provides dialect-aware phonetic lexicons, speaker metadata, and reproducible train, development, and test splits. To benchmark performance, we evaluate multiple ASR models, including TDNN-HMM, E-Branchformer, Whisper, and wav2vec2-based self-supervised models, and find that fine-tuning on RESPIN significantly improves recognition accuracy over pretrained baselines. A subset of RESPIN-S1.0 has already supported community challenges such as the SLT Code Hackathon 2022 and MADASR@ASRU 2023 and 2025, releasing more than 1,200 hours publicly. This resource supports research in dialectal ASR, language identification, and related speech technologies, establishing a comprehensive benchmark for inclusive, dialect-rich ASR in multilingual low-resource settings. **Dataset:** https://spiredatasets.ee.iisc.ac.in/respincorpus **Code:** https://github.com/labspire/respin_baselines.git

View full details

Poster

Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models

Christopher Chiu ⋅ Silviu Pitis ⋅ Mihaela van der Schaar

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Clinical reasoning in medicine is a hypothesis-driven process where physicians refine diagnoses from limited information through targeted history, physical examination, and diagnostic investigations. In contrast, current medical benchmarks for large language models (LLMs) primarily assess knowledge recall through single-turn questions, where complete clinical information is provided upfront. To address this gap, we introduce VivaBench, a multi-turn benchmark that evaluates sequential clinical reasoning in LLM agents. Our dataset consists of 1762 physician-curated clinical vignettes structured as interactive scenarios that simulate a $ \textit{viva voce}$ (oral) examination in medical training, requiring agents to actively probe for relevant findings, select appropriate investigations, and synthesize information across multiple steps to reach a diagnosis. While current LLMs demonstrate competence in diagnosing conditions from well-described clinical presentations, their performance degrades significantly when required to navigate iterative diagnostic reasoning under uncertainty in our evaluation. Our analysis identified several failure modes that mirror common cognitive errors in clinical practice, including: (1) fixation on initial hypotheses, (2) inappropriate investigation ordering, (3) premature diagnostic closure, and (4) failing to screen for critical conditions. These patterns reveal fundamental limitations in how current LLMs reason and make decisions under uncertainty. Through VivaBench, we provide a standardized benchmark for evaluating conversational medical AI systems for real-world clinical decision support. Beyond medical applications, we contribute to the larger corpus of research on agentic AI by demonstrating how sequential reasoning trajectories can diverge in complex decision-making environments.

View full details

Poster

STSBench: A Large-Scale Dataset for Modeling Neuronal Activity in the Dorsal Stream of Primate Visual Cortex

Ethan Trepka ⋅ Ruobing Xia ⋅ Shude Zhu ⋅ Sharif Saleki ⋅ Danielle Lopes ⋅ Stephen Cital ⋅ Konstantin Willeke ⋅ Mindy Kim ⋅ Tirin Moore

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

The primate visual system is typically divided into two streams — the ventral stream, responsible for object recognition, and the dorsal stream, responsible for encoding spatial relations and motion. Recent studies have shown that convolutional neural networks (CNNs) pretrained on object recognition tasks are remarkably effective at predicting neuronal responses in the ventral stream, shedding light on the neural mechanisms underlying object recognition. However, similar models of the dorsal stream remain underdeveloped due to the lack of large scale datasets encompassing dorsal stream areas. To address this gap, we present STSBench, a dataset of large-scale, single neuron recordings from over 2,000 neurons in the superior temporal sulcus (STS), a nearly 50-fold increase over existing dorsal stream datasets, collected while Rhesus macaques viewed thousands of unique, natural videos. We show that our dataset can be used for benchmarking encoding models of dorsal stream neuronal responses and reconstructing visual input from neural activity.

View full details

Poster

ThinkBench: Dynamic Out-of-Distribution Evaluation for Robust LLM Reasoning

Shulin Huang ⋅ Linyi Yang ⋅ Yan Song ⋅ Shawn Chen ⋅ Leyang Cui ⋅ Ziyu Wan ⋅ Qingcheng Zeng ⋅ Ying Wen ⋅ Kun Shao ⋅ Weinan Zhang ⋅ Jun Wang ⋅ Yue Zhang

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Evaluating large language models (LLMs) poses significant challenges, particularly due to issues of data contamination and the leakage of correct answers. To address these challenges, we introduce ThinkBench, a novel evaluation framework designed to robustly evaluate the reasoning capability of LLMs. ThinkBench proposes a dynamic data generation method for constructing out-of-distribution (OOD) datasets and offers an OOD dataset that contains 2,912 samples drawn from reasoning tasks. ThinkBench unifies the evaluation of reasoning models and non-reasoning models. We evaluate 16 LLMs and 4 PRMs under identical experimental conditions and show that most of the LLMs' performance are far from robust and they face a certain level of data leakage. By dynamically generating OOD datasets, ThinkBench effectively provides a reliable evaluation of LLMs and reduces data contamination impact. Our data and codes are available at https://github.com/huangshulin123/ThinkBench.

View full details

Poster

C3Po: Cross-View Cross-Modality Correspondence by Pointmap Prediction

Kuan Wei Huang ⋅ Brandon Li ⋅ Bharath Hariharan ⋅ Noah Snavely

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Geometric models like DUSt3R have shown great advances in understanding the geometry of a scene from pairs of photos. However, they fail when the inputs are from vastly different viewpoints (e.g., aerial vs.\ ground) or modalities (e.g., photos vs.\ abstract drawings) compared to what was observed during training. This paper addresses a challenging version of this problem: predicting correspondences between ground-level photos and floor plans. Current datasets for joint photo--floor plan reasoning are limited, either lacking in varying modalities (VIGOR) or lacking in correspondences (WAFFLE). To address these limitations, we introduce a new dataset, C3, created by first reconstructing a number of scenes in 3D from Internet photo collections via structure-from-motion, then manually registering the reconstructions to floor plans gathered from the Internet, from which we can derive correspondence between images and floor plans. C3 contains 90K paired floor plans and photos across 597 scenes with 153M pixel-level correspondences and 85K camera poses. We find that state-of-the-art correspondence models struggle on this task. By training on our new data, we can improve on the best performing method by 34\% in RMSE. We also identify open challenges in cross-modal geometric reasoning that our dataset aims to help address. Our project website is available at: \url{https://c3po-correspondence.github.io/}.

View full details

Poster

Solving Inequality Proofs with Large Language Models

Jiayi Sheng ⋅ Luna Lyu ⋅ Jikai Jin ⋅ Tanglin Xia ⋅ Alex Gu ⋅ James Zou ⋅ Pan Lu

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Inequality proving, crucial across diverse scientific and mathematical fields, tests advanced reasoning skills such as discovering tight bounds and strategic theorem application. This makes it a distinct, demanding frontier for large language models (LLMs), offering insights beyond general mathematical problem-solving. Progress in this area is hampered by existing datasets that are often scarce, synthetic, or rigidly formal. We address this by proposing an *informal yet verifiable* task formulation, recasting inequality proving into two automatically checkable subtasks: bound estimation and relation prediction. Building on this, we release *IneqMath*, an expert-curated dataset of Olympiad-level inequalities, including a test set and training corpus enriched with step-wise solutions and theorem annotations. We also develop a novel LLM-as-judge evaluation suite, combining a *final-answer* judge with four specialized *step-wise* judges designed to detect common reasoning flaws. A systematic evaluation of 29 leading LLMs on *IneqMath* reveals a surprising reality: even top models like o1 achieve less than 10% overall accuracy under step-wise scrutiny; this is a drop of up to 65.5% from their accuracy considering only final answer equivalence. This discrepancy exposes fragile deductive chains and a critical gap for current LLMs between merely finding an answer and constructing a rigorous proof. Scaling model size and increasing test-time computation yield limited gains in overall proof correctness. Instead, our findings highlight promising research directions such as theorem-guided reasoning and self-refinement.

View full details

Poster

MetaBox-v2: A Unified Benchmark Platform for Meta-Black-Box Optimization

Zeyuan Ma ⋅ Yue-Jiao Gong ⋅ Hongshu Guo ⋅ Wenjie Qiu ⋅ Sijie Ma ⋅ Hongqiao Lian ⋅ Jiajun Zhan ⋅ Kaixu Chen ⋅ Chen Wang ⋅ Zhiyang Huang ⋅ Zechuan Huang ⋅ Guojun Peng ⋅ Ran Cheng ⋅ Yining Ma

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Meta-Black-Box Optimization (MetaBBO) streamlines the automation of optimization algorithm design through meta-learning. It typically employs a bi-level structure: the meta-level policy undergoes meta-training to reduce the manual effort required in developing algorithms for low-level optimization tasks. The original MetaBox (2023) provided the first open-source framework for reinforcement learning-based single-objective MetaBBO. However, its relatively narrow scope no longer keep pace with the swift advancement in this field. In this paper, we introduce MetaBox-v2 (\url{https://github.com/MetaEvo/MetaBox}) as a milestone upgrade with four novel features: 1) a unified architecture supporting RL, evolutionary, and gradient-based approaches, by which we reproduce $23$ up-to-date baselines; 2) efficient parallelization schemes, which reduce the training/testing time by $10-40$x; 3) a comprehensive benchmark suite of $18$ synthetic/realistic tasks ($1900$+ instances) spanning single-objective, multi-objective, multi-model, and multi-task optimization scenarios; 4) plentiful and extensible interfaces for custom analysis/visualization and integrating to external optimization tools/benchmarks. To show the utility of MetaBox-v2, we carry out a systematic case study that evaluates the built-in baselines in terms of the optimization performance, generalization ability and learning efficiency. Valuable insights are concluded from thorough and detailed analysis for practitioners and those new to the field.

View full details

Poster

Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing

Xiangyu Zhao ⋅ Peiyuan Zhang ⋅ Kexian Tang ⋅ Xiaorong Zhu ⋅ Hao Li ⋅ Wenhao Chai ⋅ Zicheng Zhang ⋅ Renqiu Xia ⋅ Guangtao Zhai ⋅ Junchi Yan ⋅ Hua Yang ⋅ Xue Yang ⋅ Haodong Duan

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Large Multi-modality Models (LMMs) have made significant progress in visual understanding and generation, but they still face challenges in General Visual Editing, particularly in following complex instructions, preserving appearance consistency, and supporting flexible input formats. To study this gap, we introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE). RISEBench focuses on four key reasoning categories: Temporal, Causal, Spatial, and Logical Reasoning. We curate high-quality test cases for each category and propose an robust evaluation framework that assesses Instruction Reasoning, Appearance Consistency, and Visual Plausibility with both human judges and the LMM-as-a-judge approach. We conducted experiments evaluating nine prominent visual editing models, comprising both open-source and proprietary models. The evaluation results demonstrate that current models face significant challenges in reasoning-based editing tasks. Even the most powerful model evaluated, GPT-image-1, achieves an accuracy of merely 28.8%. RISEBench effectively highlights the limitations of contemporary editing models, provides valuable insights, and indicates potential future directions for the field of reasoning-aware visual editing. Our code and data have been released at https://github.com/PhoenixZ810/RISEBench.

View full details

Poster

BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset

Zhiheng Xi ⋅ Guanyu Li ⋅ Yutao Fan ⋅ Honglin Guo ⋅ Yufang Liu ⋅ Xiaoran Fan ⋅ Jiaqi Liu ⋅ dingjinchao ⋅ Wangmeng Zuo ⋅ Zhenfei Yin ⋅ LEI BAI ⋅ Tao Ji ⋅ Tao Gui ⋅ Qi Zhang ⋅ Xuanjing Huang

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

In this paper, we introduce BMMR, a large-scale bilingual, multimodal, multi-disciplinary reasoning dataset for the community to develop and evaluate large multimodal models (LMMs). BMMR comprises 100k university-level questions drawn from 300 UNESCO-defined subjects, spanning diverse formats—multiple-choice, fill-in-the-blank, and open-ended QA—and sourced from both print and digital media such as books, exams, and quizzes. All data are curated and filtered via a human-in-the-loop, automated, and scalable framework, and each instance is paired with a high-quality reasoning path. The dataset is organized into two parts: BMMR-Eval that comprises 20k high-quality instances to comprehensively assess LMMs’ knowledge and reasoning across multiple disciplines in both Chinese and English; and BMMR-Train that contains 80k instances to support further research and development, extending the current focus on mathematical reasoning to diverse disciplines and domains. In addition, we propose the process-based multi-discipline BMMR-Verifier for accurate and fine-grained evaluation of LMMs’ reasoning. Extensive experiments reveal that (i) even SOTA models leave substantial headroom on BMMR-Eval; (ii) reasoning models exhibit discipline bias and outperform LMMs only on specific subjects; (iii) open-source models still trail their proprietary counterparts; and (iv) fine-tuning on BMMR-Train narrows this gap. Additionally, we conduct reasoning-chain analyses using BMMR-Verifier and other in-depth studies, uncovering the challenges LMMs currently face in multidisciplinary reasoning. We will release the data and models, and we believe our work can offers valuable insights and contributions to the community.

View full details

Poster

CausalDynamics: A large‐scale benchmark for structural discovery of dynamical causal models

Benjamin Herdeanu ⋅ Juan Nathaniel ⋅ Carla Roesch ⋅ Jatan Buch ⋅ Gregor Ramien ⋅ Johannes Haux ⋅ Pierre Gentine

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Causal discovery for dynamical systems poses a major challenge in fields where active interventions are infeasible. Most methods used to investigate these systems and their associated benchmarks are tailored to deterministic, low-dimensional and weakly nonlinear time-series data. To address these limitations, we present *CausalDynamics*, a large-scale benchmark and extensible data generation framework to advance the structural discovery of dynamical causal models. Our benchmark consists of true causal graphs derived from thousands of both linearly and nonlinearly coupled ordinary and stochastic differential equations as well as two idealized climate models. We perform a comprehensive evaluation of state-of-the-art causal discovery algorithms for graph reconstruction on systems with noisy, confounded, and lagged dynamics. *CausalDynamics* consists of a plug-and-play, build-your-own coupling workflow that enables the construction of a hierarchy of physical systems. We anticipate that our framework will facilitate the development of robust causal discovery algorithms that are broadly applicable across domains while addressing their unique challenges. We provide a user-friendly implementation and documentation on https://kausable.github.io/CausalDynamics.

View full details

Poster

Robo2VLM: Improving Visual Question Answering using Large-Scale Robot Manipulation Data

Kaiyuan Eric Chen ⋅ Shuangyu Xie ⋅ Zehan Ma ⋅ Pannag Sanketi ⋅ Ken Goldberg

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Vision-Language Models (VLMs) acquire real-world knowledge and general reasoning ability through Internet-scale image-text corpora. They can augment robotic systems with scene understanding and task planning, and assist visuomotor policies that are trained on robot trajectory data. We explore the reverse paradigm — using rich, real, multi-modal robot trajectory data to enhance and evaluate VLMs. In this paper, we present Robo2VLM, a Visual Question Answering (VQA) dataset generation framework for VLMs. Given a human tele-operated robot demonstration with video and robot data, Robo2VLM derives ground-truth from non-visual and non-descriptive sensory modalities, such as end-effector pose, gripper aperture, and force sensing. Based on these modalities, it segments the robot trajectory into a sequence of manipulation phases. At each phase, Robo2VLM uses scene and interaction understanding to identify 3D properties of the robot, task goal, and the target object. The properties are used to generate representative VQA queries – images with textural multiple-choice questions – based on spatial, goal-conditioned, and interaction reasoning question templates. We use a subset of Open X-Embodiment to generate Robo2VLM-1, a large-scale in-the-wild dataset with 684,710 questions based on 463 distinct scenes and 3,396 robotic manipulation tasks from 176k real robot trajectories. Results suggest that Robo2VLM-1 can benchmark and improve VLM capabilities in spatial and interaction reasoning.

View full details

Poster

SmokeViz: A Large-Scale Satellite Dataset for Wildfire Smoke Detection and Segmentation

Rey Koki ⋅ Michael McCabe ⋅ Dhruv Kedar ⋅ Josh Myers-Dean ⋅ Annabel Wade ⋅ Jebb Stewart ⋅ Christina Kumler-Bonfanti ⋅ Jed Brown

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

The global rise in wildfire frequency and intensity over the past decade underscores the need for improved fire monitoring techniques. To advance deep learning research on wildfire detection and its associated human health impacts, we introduce **SmokeViz**, a large-scale machine learning dataset of smoke plumes in satellite imagery. The dataset is derived from expert annotations created by smoke analysts at the National Oceanic and Atmospheric Administration, which provide coarse temporal and spatial approximations of smoke presence. To enhance annotation precision, we propose **pseudo-label dimension reduction (PLDR)**, a generalizable method that applies pseudo-labeling to refine datasets with mismatching temporal and/or spatial resolutions. Unlike typical pseudo-labeling applications that aim to increase the number of labeled samples, PLDR maintains the original labels but increases the dataset quality by solving for intermediary pseudo-labels (IPLs) that align each annotation to the most representative input data. For SmokeViz, a parent model produces IPLs to identify the single satellite image within each annotations time window that best corresponds with the smoke plume. This refinement process produces a succinct and relevant deep learning dataset consisting of over 160,000 manual annotations. The SmokeViz dataset is expected to be a valuable resource to develop further wildfire-related machine learning models and is publicly available at \url{https://noaa-gsl-experimental-pds.s3.amazonaws.com/index.html#SmokeViz/}.

View full details

Poster

BO4Mob: Bayesian Optimization Benchmarks for High-Dimensional Urban Mobility Problem

Seunghee Ryu ⋅ Donghoon Kwon ⋅ Seongjin Choi ⋅ Aryan Deshwal ⋅ Seungmo Kang ⋅ Carolina Osorio

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We introduce BO4Mob, a new benchmark framework for high-dimensional Bayesian Optimization (BO), driven by the challenge of origin-destination (OD) travel demand estimation in large urban road networks. Estimating OD travel demand from limited traffic sensor data is a difficult inverse optimization problem, particularly in real-world, large-scale transportation networks. This problem involves optimizing over high-dimensional continuous spaces where each objective evaluation is computationally expensive, stochastic, and non-differentiable. BO4Mob comprises five scenarios based on real-world San Jose, CA road networks, with input dimensions scaling up to 10,100. These scenarios utilize high-resolution, open-source traffic simulations that incorporate realistic nonlinear and stochastic dynamics. We demonstrate the benchmark's utility by evaluating five optimization methods: three state-of-the-art BO algorithms and two non-BO baselines. This benchmark is designed to support both the development of scalable optimization algorithms and their application for the design of data-driven urban mobility models, including high-resolution digital twins of metropolitan road networks. Code and documentation are available at https://github.com/UMN-Choi-Lab/BO4Mob.

View full details

Poster

BikeBench: A Bicycle Design Benchmark for Generative Models with Objectives and Constraints

Lyle Regenwetter ⋅ Yazan Abu Obaideh ⋅ Fabien Chiotti ⋅ Ioanna Lykourentzou ⋅ Faez Ahmed

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We introduce BikeBench, an engineering design benchmark for evaluating generative models on problems with multiple real-world objectives and constraints. As generative AI's reach continues to grow, evaluating its capability to understand physical laws, human guidelines, and hard constraints grows increasingly important. Engineering product design lies at the intersection of these difficult tasks, providing new challenges for AI capabilities. BikeBench evaluates AI models' capabilities to generate bicycle designs that not only resemble the dataset, but meet specific performance objectives and constraints. To do so, BikeBench quantifies a variety of human-centered and multiphysics performance characteristics, such as aerodynamics, ergonomics, structural mechanics, human-rated usability, and similarity to subjective text or image prompts. Supporting the benchmark are several datasets of simulation results, a dataset of 10,000 human-rated bicycle assessments, and a synthetically generated dataset of 1.6M designs, each with a parametric, CAD/XML, SVG, and PNG representation. BikeBench is uniquely configured to evaluate tabular generative models, large language models (LLMs), design optimization, and hybrid algorithms side-by-side. Our experiments indicate that LLMs and tabular generative models fall short of hybrid GenAI+optimization algorithms in design quality, constraint satisfaction, and similarity scores, suggesting significant room for improvement. We hope that BikeBench, a first-of-its-kind benchmark, will help catalyze progress in generative AI for constrained multi-objective engineering design problems. We provide code, data, an interactive leaderboard, and other resources at https://github.com/Lyleregenwetter/BikeBench.

View full details

Poster

DCAD-2000: A Multilingual Dataset across 2000+ Languages with Data Cleaning as Anomaly Detection

Yingli Shen ⋅ Wen Lai ⋅ Shuo Wang ⋅ Xueren Zhang ⋅ Kangyang Luo ⋅ Alexander Fraser ⋅ Maosong Sun

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

The rapid development of multilingual large language models (LLMs) highlights the need for high-quality, diverse, and well-curated multilingual datasets. In this paper, we introduce DCAD-2000 (Data Cleaning as Anomaly Detection), a large-scale multilingual corpus constructed from newly extracted Common Crawl data and existing multilingual sources. DCAD-2000 covers 2,282 languages, 46.72TB of text, and 8.63 billion documents, spanning 155 high- and medium-resource languages and 159 writing scripts. To overcome the limitations of existing data cleaning approaches, which rely on manually designed heuristic thresholds, we reframe data cleaning as an anomaly detection problem. This dynamic filtering paradigm substantially improves data quality by automatically identifying and removing noisy or anomalous content. By fine-tuning LLMs on DCAD-2000, we demonstrate notable improvements in data quality, robustness of the cleaning pipeline, and downstream performance, particularly for low-resource languages across multiple multilingual benchmarks.

View full details

Poster

AHa-Bench: Benchmarking Audio Hallucinations in Large Audio-Language Models

Xize Cheng ⋅ Dongjie Fu ⋅ Chenyuhao Wen ⋅ Shannon Yu ⋅ Zehan Wang ⋅ Shengpeng Ji ⋅ Siddhant Arora ⋅ Tao Jin ⋅ Shinji Watanabe ⋅ Zhou Zhao

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Hallucinations present a significant challenge in the development and evaluation of large language models (LLMs), directly affecting their reliability and accuracy. While notable advancements have been made in research on textual and visual hallucinations, there is still a lack of a comprehensive benchmark for evaluating auditory hallucinations in large audio language models (LALMs). To fill this gap, we introduce **AHa-Bench**, a systematic and comprehensive benchmark for audio hallucinations. Audio data, in particular, uniquely combines the multi-attribute complexity of visual data with the semantic richness of textual data, leading to auditory hallucinations that share characteristics with both visual and textual hallucinations. Based on the source of these hallucinations, AHa-Bench categorizes them into semantic hallucinations, acoustic hallucinations, and semantic-acoustic confusion hallucinations. In addition, we systematically evaluate seven open-source local perception language models (LALMs), demonstrating the challenges these models face in audio understanding, especially when it comes to jointly understanding semantic and acoustic information. Through the development of a comprehensive evaluation framework, AHa-Bench aims to enhance the robustness and stability of LALMs, fostering more reliable and nuanced audio understanding in LALMs. The benchmark dataset is available at \url{https://huggingface.co/datasets/ahabench/AHa-Bench}.

View full details

Poster

OligoGym: Curated Datasets and Benchmarks for Oligonucleotide Drug Discovery

Rachapun Rotrattanadumrong ⋅ Carlo De Donno

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Oligonucleotide therapeutics offer great potential to address previously undruggable targets and enable personalized medicine. However, their progress is often hindered by insufficient safety and efficacy profiles. Predictive modeling and machine learning could significantly accelerate oligonucleotide drug discovery by identifying suboptimal compounds early on, but their application in this area lags behind other modalities. A key obstacle to the adoption of machine learning in the field is the scarcity of readily accessible and standardized datasets for model development, as data are often scattered across diverse experiments with inconsistent molecular representations. To overcome this challenge, we introduce OligoGym, a curated collection of standardized, machine learning-ready datasets encompassing various oligonucleotide therapeutic modalities and endpoints. We used OligoGym to benchmark diverse classical and deep learning methods, establishing performance baselines for each dataset across different featurization techniques, model configurations, and splitting strategies. Our work represents a crucial first step in creating a more unified framework for oligonucleotide therapeutic dataset generation and model training.

View full details

Poster

FailureSensorIQ: A Multi-Choice QA Dataset for Understanding Sensor Relationships and Failure Modes

Christodoulos Constantinides ⋅ Dhaval Patel ⋅ Shuxin Lin ⋅ Claudio Guerrero ⋅ Sunil Patil ⋅ Jayant Kalagnanam

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We introduce FailureSensorIQ, a novel Multi-Choice Question-Answering (MCQA) benchmarking system designed to assess the ability of Large Language Models (LLMs) to reason and understand complex, domain-specific scenarios in Industry 4.0. Unlike traditional QA benchmarks, our system focuses on multiple aspects of reasoning through failure modes, sensor data, and the relationships between them across various industrial assets. Through this work, we envision a paradigm shift where modeling decisions are not only data-driven using statistical tools like correlation analysis and significance tests, but also domain-driven by specialized LLMs which can reason about the key contributors and useful patterns that can be captured with feature engineering.We evaluate the Industrial knowledge of over a dozen LLMs including GPT-4, Llama, and Mistral on FailureSensorIQ from different lens using Perturbation-Uncertainty-Complexity analysis, Expert Evaluation study, Asset-Specific Knowledge Gap analysis, ReAct agent using external knowledge-bases.Even though closed-source models with strong reasoning capabilities approach expert-level performance, the comprehensive benchmark reveals a significant drop in performance that is fragile to perturbations, distractions, and inherent knowledge gaps in the models.We also provide a real-world case study of how LLMs can drive the modeling decisions on 3 different failure prediction datasets related to various assets.We release: (a) expert-curated MCQA for various industrial assets, (b) FailureSensorIQ benchmark and Hugging Face leaderboard based on MCQA built from non-textual data found in ISO documents, and (c) ``LLMFeatureSelector'', an LLM-based feature selection scikit-learn pipeline. The software is available at https://github.com/IBM/FailureSensorIQ.

View full details

Poster

Measuring Fingerprints of Web-filtered Text Datasets and Fingerprint Propagation Through Training

Youssef Mansour ⋅ Reinhard Heckel

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We investigate fingerprints in pretraining datasets for large language models (LLMs) through dataset classification experiments. Building on prior work demonstrating the existence of fingerprints or biases in popular computer vision datasets, we analyze popular open-source pretraining datasets for LLMs derived from CommonCrawl including C4, RefinedWeb, DolmaCC, RedPajama-V2, FineWeb, and DCLM-Baseline. Despite those datasets being obtained with similar curation steps, neural networks can classify surprisingly well which dataset a single text sequence belongs to, significantly better than a human can. This indicates that small differences in filtering and processing pipelines induce fingerprints, that we find are evident in formatting, vocabulary, and content distributions. Such fingerprints can negatively impact cross-dataset generalization. Additionally, we show that these fingerprints propagate through training: sequences generated by models trained on those datasets can be accurately classified by a classifier trained on the original datasets. This can offer insights into data characteristics that are typically undisclosed by LLM developers, including pretraining mixture proportions and finetuning data sources.

View full details

Poster

CGBench: Benchmarking Language Model Scientific Reasoning for Clinical Genetics Research

Owen Queen ⋅ Harrison Zhang ⋅ James Zou

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Variant and gene interpretation are fundamental to personalized medicine and translational biomedicine. However, traditional approaches are manual and labor-intensive. Generative language models (LMs) can facilitate this process, accelerating the translation of fundamental research into clinically-actionable insights. While existing benchmarks have attempted to quantify the capabilities of LMs for interpreting scientific data, these studies focus on narrow tasks that do not translate to real-world research. To meet these challenges, we introduce CGBench, a robust benchmark that tests reasoning capabilities of LMs on scientific publications. CGBench is built from ClinGen, a resource of expert-curated literature interpretations in clinical genetics. CGBench measures the ability to 1) extract relevant experimental results following precise protocols and guidelines, 2) judge the strength of evidence, and 3) categorize and describe the relevant outcome of experiments. We test 8 different LMs and find that while models show promise, substantial gaps exist in literature interpretation, especially on fine-grained instructions. Reasoning models excel in fine-grained tasks but non-reasoning models are better at high-level interpretations. Finally, we measure LM explanations against human explanations with an LM judge approach, revealing that models often hallucinate or misinterpret results even when correctly classifying evidence. CGBench reveals strengths and weaknesses of LMs for precise interpretation of scientific publications, opening avenues for future research in AI for clinical genetics and science more broadly.

View full details

Poster

AgentRecBench: Benchmarking LLM Agent-based Personalized Recommender Systems

Yu Shang ⋅ Peijie Liu ⋅ Yuwei Yan ⋅ Zijing Wu ⋅ Leheng Sheng ⋅ Yuanqing Yu ⋅ Chumeng Jiang ⋅ An Zhang ⋅ Fengli Xu ⋅ Yu Wang ⋅ Min Zhang ⋅ Yong Li

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The emergence of agentic recommender systems powered by Large Language Models (LLMs) represents a paradigm shift in personalized recommendations, leveraging LLMs’ advanced reasoning and role-playing capabilities to enable autonomous, adaptive decision-making. Unlike traditional recommendation approaches, agentic recommender systems can dynamically gather and interpret user-item interactions from complex environments, generating robust recommendation strategies that generalize across diverse scenarios. However, the field currently lacks standardized evaluation protocols to systematically assess these methods. To address this critical gap, we propose: (1) an interactive textual recommendation simulator incorporating rich user and item metadata and three typical evaluation scenarios (classic, evolving-interest, and cold-start recommendation tasks); (2) a unified modular framework for developing agentic recommender systems; and (3) the first comprehensive benchmark comparing over 10 classical and agentic recommendation methods. Our findings demonstrate the superiority of agentic systems and establish actionable design guidelines for their core components. The benchmark environment has been rigorously validated through an open challenge and remains publicly available with a maintained leaderboard at https://tsinghua-fib-lab.github.io/AgentSocietyChallenge/pages/overview.html. The benchmark is available at: https://huggingface.co/datasets/SGJQovo/AgentRecBench.

View full details

Poster

Sekai: A Video Dataset towards World Exploration

Zhen Li ⋅ Chuanhao Li ⋅ Xiaofeng Mao ⋅ Shaoheng Lin ⋅ Ming Li ⋅ Shitian Zhao ⋅ Zhaopan Xu ⋅ Xinyue Li ⋅ Yukang Feng ⋅ Jianwen Sun ⋅ Zizhen Li ⋅ Fanrui Zhang ⋅ Jiaxin Ai ⋅ Zhixiang Wang ⋅ Yuwei Wu ⋅ Tong He ⋅ Yunde Jia ⋅ Kaipeng Zhang

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Video generation techniques have made remarkable progress, promising to be the foundation of interactive world exploration.However, existing video generation datasets are not well-suited for world exploration training as they suffer from some limitations: limited locations, short duration, static scenes, and a lack of annotations about exploration and the world.In this paper, we introduce Sekai (meaning "world" in Japanese), a high-quality first-person view worldwide video dataset with rich annotations for world exploration. It consists of over 5,000 hours of walking or drone view (FPV and UVA) videos from over 100 countries and regions across 750 cities. We develop an efficient and effective toolbox to collect, pre-process and annotate videos with location, scene, weather, crowd density, captions, and camera trajectories.Comprehensive analyses and experiments demonstrate the dataset’s scale, diversity, annotation quality, and effectiveness for training video generation models.We believe Sekai will benefit the area of video generation and world exploration, and motivate valuable applications.

View full details

Poster

DanmakuTPPBench: A Multi-modal Benchmark for Temporal Point Process Modeling and Understanding

Yue Jiang ⋅ Jichu Li ⋅ Yang Liu ⋅ Dingkang Yang ⋅ Feng Zhou ⋅ Quyu Kong

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We introduce DanmakuTPPBench, a comprehensive benchmark designed to advance multi-modal Temporal Point Process (TPP) modeling in the era of Large Language Models (LLMs). While TPPs have been widely studied for modeling temporal event sequences, existing datasets are predominantly unimodal, hindering progress in models that require joint reasoning over temporal, textual, and visual information. To address this gap, DanmakuTPPBench comprises two complementary components:(1) DanmakuTPP-Events, a novel dataset derived from the Bilibili video platform, where user-generated bullet comments (Danmaku) naturally form multi-modal events annotated with precise timestamps, rich textual content, and corresponding video frames;(2) DanmakuTPP-QA, a challenging question-answering dataset constructed via a novel multi-agent pipeline powered by state-of-the-art LLMs and multi-modal LLMs (MLLMs), targeting complex temporal-textual-visual reasoning. We conduct extensive evaluations using both classical TPP models and recent MLLMs, revealing significant performance gaps and limitations in current methods’ ability to model multi-modal event dynamics. Our benchmark establishes strong baselines and calls for further integration of TPP modeling into the multi-modal language modeling landscape. Project page: https://github.com/FRENKIE-CHIANG/DanmakuTPPBench.

View full details

Poster

Evaluating Program Semantics Reasoning with Type Inference in System $F$

Yifeng He ⋅ Luning Yang ⋅ Christopher Gonzalo ⋅ Hao Chen

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Large Language Models (LLMs) are increasingly integrated into the software engineering ecosystem.Their test-time compute reasoning capabilities promise significant potential in understanding program logic and semantics beyond mere token recognition. However, current benchmarks evaluating reasoning LLMs for code lack a formal, program-centric deductive framework for the soundness of evaluation, incompetent in assessing of whether models genuinely reason about program semantics or merely associate superficial connections between natural language and code tokens. To bridge this gap, we introduce TF-Bench, a benchmark designed to evaluate LLM reasoning based on type inference in System F, a task we refer to as *program semantics reasoning*. By employing verified transformations to remove semantically irrelevant natural language,we construct TF-Bench_pure, a purely semantics-driven variant of TF-Bench. Our analysis reveals substantial limitations in state-of-the-art LLMs, with the best-performing LLM (Claude-3.7-sonnet) achieving only $55.85\%$ accuracy on TF-Bench_pure. Additionally, we propose two novel metrics to assess the robustness and effectiveness of extended reasoning,underscoring the critical limitation in current LLM capabilities and highlighting essential directions for future research.

View full details

Poster

Why Do Multi-Agent LLM Systems Fail?

Mert Cemri ⋅ Melissa Z Pan ⋅ Shuyi Yang ⋅ Lakshya A Agrawal ⋅ Bhavya Chopra ⋅ Rishabh Tiwari ⋅ Kurt Keutzer ⋅ Aditya Parameswaran ⋅ Dan Klein ⋅ Kannan Ramchandran ⋅ Matei A Zaharia ⋅ Joseph Gonzalez ⋅ Ion Stoica

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Despite enthusiasm for Multi-Agent LLM Systems (MAS), their performance gains on popular benchmarks are often minimal. This gap highlights a critical need for a principled understanding of why MAS fail. Addressing this question requires systematic identification and analysis of failure patterns. We introduce MAST-Data, a comprehensive dataset of 1600+ annotated traces collected across 7 popular MAS frameworks. MAST-Data is the first multi-agent system dataset to outline the failure dynamics in MAS for guiding the development of better future systems. To enable systematic classification of failures for MAST-Data, we build the first Multi-Agent System Failure Taxonomy (MAST). We develop MAST through rigorous analysis of 150 traces, guided closely by expert human annotators andvalidated by high inter-annotator agreement (κ = 0.88). This process identifies 14 unique modes, clustered into 3 categories: (i) system design issues, (ii) inter-agent misalignment, and (iii) task verification. To enable scalable annotation, we develop an LLM-as-a-Judge pipeline with high agreement with human annotations. We leverage MAST and MAST-Data to analyze failure patterns across models (GPT4, Claude 3, Qwen2.5, CodeLlama) and tasks (coding, math, general agent), demonstrating improvement headrooms from better MAS design. Our analysis provides insights revealing that identified failures require more sophisticated solutions, highlighting a clear roadmap for future research. We publicly release our comprehensive dataset (MAST-Data), the MAST, and our LLM annotator to facilitate widespread research and development in MAS.

View full details

Poster

TransferBench: Benchmarking Ensemble-based Black-box Transfer Attacks

Fabio Brau ⋅ Maura Pintor ⋅ Antonio Cinà ⋅ Raffaele Mura ⋅ Luca Scionis ⋅ Luca Oneto ⋅ Fabio Roli ⋅ Battista Biggio

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Ensemble-based black-box transfer attacks optimize adversarial examples on a set of surrogate models, claiming to reach high success rates by querying the (unknown) target model only a few times. In this work, we show that prior evaluations are systematically biased, as such methods are tested only under overly optimistic scenarios, without considering (i) how the choice of surrogate models influences transferability, (ii) how they perform against robust target models, and (iii) whether querying the target to refine the attack is really required.To address these gaps, we introduce TransferBench, a framework for evaluating ensemble-based black-box transfer attacks under more realistic and challenging scenarios than prior work. Our framework considers 17 distinct settings on CIFAR-10 and ImageNet, including diverse surrogate-target combinations, robust targets, and comparisons to baseline methods that do not use any query-based refinement mechanism. Our findings reveal that existing methods fail to generalize to more challenging scenarios, and that query-based refinement offers little to no benefit, contradicting prior claims. These results highlight that building reliable and query-efficient black-box transfer attacks remains an open challenge. We release our benchmark and evaluation code at: https://github.com/pralab/transfer-bench.

View full details

Poster

DisasterM3: A Remote Sensing Vision-Language Dataset for Disaster Damage Assessment and Response

Junjue Wang ⋅ Weihao Xuan ⋅ Heli Qi ⋅ Zhihao Liu ⋅ Kunyi Liu ⋅ Yuhan Wu ⋅ Hongruixuan Chen ⋅ JIAN SONG ⋅ Junshi Xia ⋅ Zhuo Zheng ⋅ Naoto YOKOYA

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Large vision-language models (VLMs) have made great achievements in Earth vision. However, complex disaster scenes with diverse disaster types, geographic regions, and satellite sensors have posed new challenges for VLM applications. To fill this gap, we curate the first remote sensing vision-language dataset (DisasterM3) for global-scale disaster assessment and response. DisasterM3 includes 26,988 bi-temporal satellite images and 123k instruction pairs across 5 continents, with three characteristics: **1) Multi-hazard**: DisasterM3 involves 36 historical disaster events with significant impacts, which are categorized into 10 common natural and man-made disasters. **2) Multi-sensor**: Extreme weather during disasters often hinders optical sensor imaging, making it necessary to combine Synthetic Aperture Radar (SAR) imagery for post-disaster scenes. **3) Multi-task**: Based on real-world scenarios, DisasterM3 includes 9 disaster-related visual perception and reasoning tasks, harnessing the full potential of VLM's reasoning ability with progressing from disaster-bearing body recognition to structural damage assessment and object relational reasoning, culminating in the generation of long-form disaster reports. We extensively evaluated 14 generic and remote sensing VLMs on our benchmark, revealing that state-of-the-art models struggle with the disaster tasks, largely due to the lack of a disaster-specific corpus, cross-sensor gap, and damage object counting insensitivity. Focusing on these issues, we fine-tune four VLMs using our dataset and achieve stable improvements (up to 10.4\%$\uparrow$QA, 2.1$\uparrow$Report, 40.8\%$\uparrow$Referring Seg.) with robust cross-sensor and cross-disaster generalization capabilities. Project: https://github.com/Junjue-Wang/DisasterM3.

View full details

Poster

Toward Engineering AGI: Benchmarking the Engineering Design Capabilities of LLMs

Xingang Guo ⋅ Yaxin Li ⋅ XiangYi Kong ⋅ YILAN JIANG ⋅ Xiayu Zhao ⋅ Zhihua Gong ⋅ Yufan Zhang ⋅ Daixuan Li ⋅ Tianle Sang ⋅ Beixiao Zhu ⋅ Gregory Jun ⋅ Yingbing Huang ⋅ Yiqi Liu ⋅ Yuqi Xue ⋅ Rahul Dev Kundu ⋅ Qi Lim ⋅ Yizhou Zhao ⋅ Luke Granger ⋅ Mohamed Younis ⋅ Darioush Keivan ⋅ Nippun Sabharwal ⋅ Shreyanka Sinha ⋅ Prakhar Agarwal ⋅ Kojo Vandyck ⋅ Hanlin Mai ⋅ Zichen Wang ⋅ Aditya Venkatesh ⋅ Ayush Barik ⋅ Jiankun Yang ⋅ Chongying Yue ⋅ Jingjie He ⋅ Libin Wang ⋅ Licheng Xu ⋅ Hao Chen ⋅ Jinwen Wang ⋅ Liujun Xu ⋅ Rushabh Shetty ⋅ Ziheng Guo ⋅ Dahui Song ⋅ Manvi Jha ⋅ Weijie Liang ⋅ Weiman Yan ⋅ Bryan Zhang ⋅ Sahil Bhandary Karnoor ⋅ Jialiang Zhang ⋅ Rutva Pandya ⋅ Xinyi Gong ⋅ Mithesh Ganesh ⋅ Feize Shi ⋅ Ruiling Xu ⋅ Yifan Zhang ⋅ Yanfeng Ouyang ⋅ Lianhui Qin ⋅ Elyse Rosenbaum ⋅ Corey Snyder ⋅ Peter Seiler ⋅ Geir Dullerud ⋅ Xiaojia Zhang ⋅ Zuofu Cheng ⋅ Pavan Kumar Hanumolu ⋅ Jian Huang ⋅ Mayank Kulkarni ⋅ Mahdi Namazifar ⋅ Huan Zhang ⋅ Bin Hu

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Modern engineering, spanning electrical, mechanical, aerospace, civil, and computer disciplines, stands as a cornerstone of human civilization and the foundation of our society. However, engineering design poses a fundamentally different challenge for large language models (LLMs) compared with traditional textbook-style problem solving or factual question answering. Although existing benchmarks have driven progress in areas such as language understanding, code synthesis, and scientific problem solving, real-world engineering design demands the synthesis of domain knowledge, navigation of complex trade-offs, and management of the tedious processes that consume much of practicing engineers' time. Despite these shared challenges across engineering disciplines, no benchmark currently captures the unique demands of engineering design work. In this work, we introduce EngDesign, an Engineering Design benchmark that evaluates LLMs' abilities to perform practical design tasks across nine engineering domains. Unlike existing benchmarks that focus on factual recall or question answering, EngDesign uniquely emphasizes LLMs' ability to synthesize domain knowledge, reason under constraints, and generate functional, objective-oriented engineering designs. Each task in EngDesign represents a real-world engineering design problem, accompanied by a detailed task description specifying design goals, constraints, and performance requirements. EngDesign pioneers a simulation-based evaluation paradigm that moves beyond textbook knowledge to assess genuine engineering design capabilities and shifts evaluation from static answer checking to dynamic, simulation-driven functional verification, marking a crucial step toward realizing the vision of engineering Artificial General Intelligence (AGI).

View full details

Poster

Clean First, Align Later: Benchmarking Preference Data Cleaning for Reliable LLM Alignment

Samuel (Min-Hsuan) Yeh ⋅ Sharon Li

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Human feedback plays a pivotal role in aligning large language models (LLMs) with human preferences. However, such feedback is often noisy or inconsistent, which can degrade the quality of reward models and hinder alignment. While various automated data cleaning methods have been proposed to mitigate this issue, a systematic evaluation of their effectiveness and generalizability remains lacking. To bridge this gap, we introduce the first comprehensive benchmark for evaluating 13 preference data cleaning methods in the context of LLM alignment. Our framework offers a standardized protocol to assess cleaning strategies in terms of alignment performance and generalizability across diverse datasets, model architectures, and optimization algorithms. By unifying disparate methods and rigorously comparing them, we uncover key factors that determine the success of data cleaning in alignment tasks. This benchmark lays the groundwork for principled and reproducible approaches to improving LLM alignment through better data quality—highlighting the crucial but underexplored role of data preprocessing in responsible AI development. We release modular implementations of all methods to catalyze further research: https://github.com/deeplearning-wisc/PrefCleanBench.

View full details

Poster

Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge

Boyu Gou ⋅ Zanming Huang ⋅ Yuting Ning ⋅ Yu Gu ⋅ Michael Lin ⋅ Weijian Qi ⋅ Andrei Kopanev ⋅ Botao Yu ⋅ Bernal Jimenez Gutierrez ⋅ Yiheng Shu ⋅ Chan Hee (Luke) Song ⋅ Jiaman Wu ⋅ Shijie Chen ⋅ Hanane Moussa ⋅ TIANSHU ZHANG ⋅ Jian Xie ⋅ Yifei Li ⋅ Tianci Xue ⋅ Zeyi Liao ⋅ Kai Zhang ⋅ Boyuan Zheng ⋅ Zhaowei Cai ⋅ Viktor Rozgic ⋅ Morteza Ziyadi ⋅ Huan Sun ⋅ Yu Su

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Agentic search such as Deep Research systems-where agents autonomously browse the web, synthesize information, and return comprehensive citation-backed answers-represents a major shift in how users interact with web-scale information. While promising greater efficiency and cognitive offloading, the growing complexity and open-endedness of agentic search have outpaced existing evaluation benchmarks and methodologies, which largely assume short search horizons and static answers. In this paper, we introduce Mind2Web 2, a benchmark of 130 realistic, high-quality, and long-horizon tasks that require real-time web browsing and extensive information synthesis, constructed with over 1000 hours of human labor. To address the challenge of evaluating time-varying and complex answers, we propose a novel Agent-as-a-Judge framework. Our method constructs task-specific judge agents based on a tree-structured rubric design to automatically assess both answer correctness and source attribution. We conduct a comprehensive evaluation of ten frontier agentic search systems and human performance, along with a detailed error analysis to draw insights for future development. The best-performing system, OpenAI Deep Research, can already achieve 50-70% of human performance while spending half the time, highlighting its great potential. Altogether, Mind2Web 2 provides a rigorous foundation for developing and benchmarking the next generation of agentic search systems.

View full details

Poster

GlobalTomo: A global dataset for physics-ML seismic wavefield modeling and FWI

Shiqian Li ⋅ Zhi Li ⋅ Zhancun Mu ⋅ Shiji Xin ⋅ Zhixiang Dai ⋅ Kuangdai Leng ⋅ Rita Zhang ⋅ Xiaodong Song ⋅ Yixin Zhu

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Global seismic tomography, taking advantage of seismic waves from natural earthquakes, provides essential insights into the earth's internal dynamics. Advanced Full-Waveform Inversion (FWI) techniques, whose aim is to meticulously interpret every detail in seismograms, confront formidable computational demands in forward modeling and adjoint simulations on a global scale. Recent advancements in Machine Learning (ML) offer a transformative potential for accelerating the computational efficiency of FWI and extending its applicability to larger scales. This work presents the first 3D global synthetic dataset tailored for seismic wavefield modeling and full-waveform tomography, referred to as the Global Tomography (GlobalTomo) dataset. This dataset is comprehensive, incorporating explicit wave physics and robust geophysical parameterization at realistic global scales, generated through state-of-the-art forward simulations optimized for 3D global wavefield calculations. Through extensive analysis and the establishment of ML baselines, we illustrate that ML approaches are particularly suitable for global FWI, overcoming its limitations with rapid forward modeling and flexible inversion strategies. This work represents a cross-disciplinary effort to enhance our understanding of the earth's interior through physics-ML modeling.

View full details

Poster

MUniverse: A Simulation and Benchmarking Suite for Motor Unit Decomposition

Pranav Mamidanna ⋅ Thomas Klotz ⋅ Dimitrios Chalatsis ⋅ Agnese Grison ⋅ Irene Mendez Guerra ⋅ Shihan Ma ⋅ Arnault Caillet ⋅ Simon Avrillon ⋅ Robin Rohlén ⋅ Dario Farina

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Neural source separation enables the extraction of individual spike trains from complex electrophysiological recordings. When applied to electromyographic (EMG) signals, it provides a unique window into the motor output of the nervous system by isolating the spiking activity of motor units (MUs). MU decomposition from EMG signals is currently the only scalable neural interfacing approach available in behaving humans and has become foundational in motor neuroscience and neuroprosthetics. However, unlike related domains such as spike sorting or electroencephalography (EEG) analysis, decomposition of EMG signals lacks open benchmarks that reflect the diversity of muscles, movement contexts, and noise sources encountered in practice.To address this gap, we introduce MUniverse, a modular simulation and benchmarking suite for decomposing EMG signals into individual MU spiking activity. MUniverse provides: (1) a simulation stack with a user-friendly interface to a state-of-the-art EMG generator; (2) a curated library of datasets across synthetic, hybrid synthetic-real data with ground truth spikes, and experimental EMG; (3) a set of internal and external decomposition pipelines; and (4) a unified benchmark with well-defined tasks, standard evaluation metrics, and baseline results from established decomposition pipelines.MUniverse is designed for extensibility, reproducibility, and community use, and all datasets are distributed with standardised metadata (Croissant, BIDS). By standardising evaluation and enabling dataset simulation at scale, MUniverse aims to catalyze progress on this long-standing neural signal processing problem.

View full details

Poster

Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models

Hao Cheng ⋅ Erjia Xiao ⋅ Jing Shao ⋅ Yichi Wang ⋅ Le Yang ⋅ Chao Shen ⋅ Philip Torr ⋅ Jindong Gu ⋅ Renjing Xu

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Large Language Models (LLMs) demonstrate impressive zero-shot performance across a wide range of natural language processing tasks. Integrating various modality encoders further expands their capabilities, giving rise to Multimodal Large Language Models (MLLMs) that process not only text but also visual and auditory modality inputs. However, these advanced capabilities may also pose significant safety problems, as models can be exploited to generate harmful or inappropriate content through jailbreak attack. While prior work has extensively explored how manipulating textual or visual modality inputs can circumvent safeguards in LLMs and MLLMs, the vulnerability of audio-specific Jailbreak on Large Audio-Language Models (LALMs) remains largely underexplored. To address this gap, we introduce \textbf{Jailbreak-AudioBench}, which consists of the Toolbox, curated Dataset, and comprehensive Benchmark. The Toolbox supports not only text-to-audio conversion but also various editing techniques for injecting audio hidden semantics. The curated Dataset provides diverse explicit and implicit jailbreak audio examples in both original and edited forms. Utilizing this dataset, we evaluate multiple state-of-the-art LALMs and establish the most comprehensive Jailbreak benchmark to date for audio modality. Finally, Jailbreak-AudioBench establishes a foundation for advancing future research on LALMs safety alignment by enabling the in-depth exposure of more powerful jailbreak threats, such as query-based audio editing, and by facilitating the development of effective defense mechanisms.

View full details

Poster

STAR: A Benchmark for Astronomical Star Fields Super-Resolution

WU KUO-CHENG ⋅ Guohang Zhuang ⋅ Jinyang Huang ⋅ Xiang Zhang ⋅ Wanli Ouyang ⋅ Yan Lu

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Super-resolution (SR) advances astronomical imaging by enabling cost-effective high-resolution capture, crucial for detecting faraway celestial objects and precise structural analysis. However, existing datasets for astronomical SR (ASR) exhibit three critical limitations: flux inconsistency, object-crop setting, and insufficient data diversity, significantly impeding ASR development. We propose STAR, a large-scale astronomical SR dataset containing 54,738 flux-consistent star field image pairs covering wide celestial regions. These pairs combine Hubble Space Telescope high-resolution observations with physically faithful low-resolution counterparts generated through a flux-preserving data generation pipeline, enabling systematic development of field-level ASR models. To further empower the ASR community, STAR provides a novel Flux Error (FE) to evaluate SR models in physical view. Leveraging this benchmark, we propose a Flux-Invariant Super Resolution (FISR) model that could accurately infer the flux-consistent high-resolution images from input photometry, suppressing several SR state-of-the-art methods by 24.84% on a novel designed flux consistency metric, showing the priority of our method for astrophysics. Extensive experiments demonstrate the effectiveness of our proposed method and the value of our dataset. Code and models are available at https://github.com/GuoCheng12/STAR.

View full details

Poster

MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?

Yunxiang Zhang ⋅ Muhammad Khalifa ⋅ Shitanshu Bhushan ⋅ Grant Murphy ⋅ Lajanugen Logeswaran ⋅ Jaekyeom Kim ⋅ Moontae Lee ⋅ Honglak Lee ⋅ Lu Wang

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We introduce **MLRC-Bench**, a benchmark designed to quantify how effectively language agents can tackle challenging **M**achine **L**earning (ML) **R**esearch **C**ompetitions, with a focus on open research problems that demand novel methodologies.Unlike prior work, e.g., AI Scientist, which evaluates the end-to-end agentic pipeline by using LLM-as-a-judge, MLRC-Bench measures the key steps of proposing and implementing novel research methods and evaluates them with rigorous protocol and objective metrics.Our curated suite of 7 competition tasks reveals significant challenges for LLM agents. Even the best-performing tested agent (gemini-exp-1206 under MLAB) closes only 9.3% of the gap between baseline and top human participant scores.Furthermore, our analysis reveals a misalignment between the *LLM-judged* innovation and their *actual* performance on cutting-edge ML research problems. MLRC-Bench is a dynamic benchmark, which is designed to continually grow with new ML competitions to encourage rigorous and objective evaluations of AI’s research capabilities. Our leaderboard and code are publicly available at https://huggingface.co/spaces/launch/MLRC_Bench.

View full details

Poster

RDB2G-Bench: A Comprehensive Benchmark for Automatic Graph Modeling of Relational Databases

Dongwon Choi ⋅ Sunwoo Kim ⋅ Juyeon Kim ⋅ Kyungho Kim ⋅ Geon Lee ⋅ Shinhwan Kang ⋅ Myunghwan Kim ⋅ Kijung Shin

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Recent advances have demonstrated the effectiveness of graph-based machine learning on relational databases (RDBs) for predictive tasks. Such approaches require transforming RDBs into graphs, a process we refer to as RDB-to-graph modeling, where rows of tables are represented as nodes and foreign-key relationships as edges.Yet, effective modeling of RDBs into graphs remains challenging.Specifically, there exist numerous ways to model RDBs into graphs, and performance on predictive tasks varies significantly depending on the chosen graph model of RDBs.In our analysis, we find that the best-performing graph model can yield up to a 10\% higher performance compared to the common heuristic rule for graph modeling, which remains non-trivial to identify.To foster research on intelligent RDB-to-graph modeling, we introduce RDB2G-Bench, the first benchmark framework for evaluating such methods.We construct extensive datasets covering 5 real-world RDBs and 12 predictive tasks, resulting in around 50k graph model–performance pairs for efficient and reproducible evaluations.Thanks to our precomputed datasets, we were able to benchmark 10 automatic RDB-to-graph modeling methods on the $12$ tasks about 380$\times$ faster than on-the-fly evaluation, which requires repeated GNN training.Our analysis of the datasets and benchmark results reveals key structural patterns affecting graph model effectiveness, along with practical implications for effective graph modeling.Our datasets and code are available at https://github.com/chlehdwon/RDB2G-Bench.

View full details

Poster

A Controllable Examination for Long-Context Language Models

Yijun Yang ⋅ Zeyu Huang ⋅ Wenhao Zhu ⋅ Zihan Qiu ⋅ Fei Yuan ⋅ Jeff Pan ⋅ Ivan Titov

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Existing frameworks for evaluating long-context language models (LCLM) can be broadly categorized into real-world applications (e.g, document summarization) and synthetic tasks (e.g, needle-in-a-haystack). Despite their utility, both approaches are accompanied by certain intrinsic limitations. Real-world tasks often involve complexity that makes interpretation challenging and suffer from data contamination, whereas synthetic tasks frequently lack meaningful coherence between the target information ("needle") and its surrounding context ("haystack"), undermining their validity as proxies for realistic applications. In response to these challenges, we posit that an ideal long-context evaluation framework should be characterized by three essential features: 1) seamless context: coherent contextual integration between target information and its surrounding context; 2) controllable setting: an extensible task setup that enables controlled studies—for example, incorporating additional required abilities such as numerical reasoning; and 3) sound evaluation: avoiding LLM-as-Judge and conduct exact-match to ensure deterministic and reproducible evaluation results.This study introduces $\textbf{LongBioBench}$, a benchmark that utilizes artificially generated biographies as a controlled environment for assessing LCLMs across dimensions of $\textit{understanding}$, $\textit{reasoning}$, and $\textit{trustworthiness}$. Our experimental evaluation, which includes $\textbf{18}$ LCLMs in total, demonstrates that most models still exhibit deficiencies in semantic understanding and elementary reasoning over retrieved results and are less trustworthy as context length increases.Our further analysis indicates some design choices employed by existing synthetic benchmarks, such as contextual non-coherence, numerical needles, and the absence of distractors, rendering them vulnerable to test the model's long-context capabilities.Moreover, we also reveal that long-context continual pretraining primarily adjusts RoPE embedding to accommodate extended context lengths, which in turn yields only marginal improvements in the model’s true capabilities. To sum up, compared to previous synthetic benchmarks, LongBioBench achieves a better trade-off between mirroring authentic language tasks and maintaining controllability, and is highly interpretable and configurable.

View full details

Poster

ConStellaration: A dataset of QI-like stellarator plasma boundaries and optimization benchmarks

Santiago Cadena ⋅ Andrea Merlo ⋅ Emanuel Laude ⋅ Alexander Bauer ⋅ Atul Agrawal ⋅ Maria Pascu ⋅ Marija Savtchouk ⋅ Lukas Bonauer ⋅ Enrico Guiraud ⋅ Stuart Hudson ⋅ Markus Kaiser

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Stellarators are magnetic confinement devices under active development to deliver steady-state carbon-free fusion energy. Their design involves a high-dimensional, constrained optimization problem that requires expensive physics simulations and significant domain expertise. Recent advances in plasma physics and open-source tools have made stellarator optimization more accessible. However, broader community progress is currently bottlenecked by the lack of standardized optimization problems with strong baselines and datasets that enable data-driven approaches, particularly for quasi-isodynamic (QI) stellarator configurations, considered as a promising path to commercial fusion due to their inherent resilience to current-driven disruptions. Here, we release an open dataset of diverse QI-like stellarator plasma boundary shapes, paired with their ideal magnetohydrodynamic (MHD) equilibria and performance metrics. We generated this dataset by sampling a variety of QI fields and optimizing corresponding stellarator plasma boundaries. We introduce three optimization benchmarks of increasing complexity: (1) a single-objective geometric optimization problem, (2) a "simple-to-build" QI stellarator, and (3) a multi-objective ideal-MHD stable QI stellarator that investigates trade-offs between compactness and coil simplicity. For every benchmark, we provide reference code, evaluation scripts, and strong baselines based on classical optimization techniques. Finally, we show how learned models trained on our dataset can efficiently generate novel, feasible configurations without querying expensive physics oracles. By openly releasing the dataset (https://huggingface.co/datasets/proxima-fusion/constellaration) along with benchmark problems and baselines (https://github.com/proximafusion/constellaration), we aim to lower the entry barrier for optimization and machine learning researchers to engage in stellarator design and to accelerate cross-disciplinary progress toward bringing fusion energy to the grid.

View full details

Poster

Toward Real-world Text Image Forgery Localization: Structured and Interpretable Data Synthesis

Zeqin Yu ⋅ Haotao Xie ⋅ Jian Zhang ⋅ Jiangqun Ni ⋅ Wenkang Su ⋅ Jiwu Huang

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Existing Text Image Forgery Localization (T-IFL) methods often suffer from poor generalization due to the limited scale of real-world datasets and the distribution gap caused by synthetic data that fails to capture the complexity of real-world tampering. To tackle this issue, we propose Fourier Series-based Tampering Synthesis (FSTS), a structured and interpretable framework for synthesizing tampered text images. FSTS first collects 16,750 real-world tampering instances from five representative tampering types, using a structured pipeline that records human-performed editing traces via multi-format logs (e.g., video, PSD, and editing logs). By analyzing these collected parameters and identifying recurring behavioral patterns at both individual and population levels, we formulate a hierarchical modeling framework. Specifically, each individual tampering parameter is represented as a compact combination of basis operation–parameter configurations, while the population-level distribution is constructed by aggregating these behaviors. Since this formulation draws inspiration from the Fourier series, it enables an interpretable approximation using basis functions and their learned weights. By sampling from this modeled distribution, FSTS synthesizes diverse and realistic training data that better reflect real-world forgery traces. Extensive experiments across four evaluation protocols demonstrate that models trained with FSTS data achieve significantly improved generalization on real-world datasets. Dataset is available at \href{https://github.com/ZeqinYu/FSTS}{Project Page}.

View full details

Poster

CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance

Myeongsoo Kim ⋅ Shweta Garg ⋅ Baishakhi Ray ⋅ Varun Kumar ⋅ Anoop Deoras

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Programming assistants powered by large language models have transformed software development, yet most benchmarks focus narrowly on code generation tasks. Recent efforts like InfiBench and StackEval attempt to address this gap using Stack Overflow data but remain limited to single-turn interactions in isolated contexts, require significant manual curation, and fail to represent complete project environments. We introduce CodeAssistBench (CAB), the first benchmark framework for evaluating multi-turn programming assistance in realistic settings that address questions grounded in actual codebases. Unlike existing programming Q&A benchmarks, CAB automatically generates scalable datasets from GitHub issues tagged with questions using configurable parameters (e.g., repository creation date, star count, programming languages), and includes automatic containerization of codebases for evaluation. It then evaluates models through simulated users in these containerized environments with full codebase access. Using this framework, we constructed a test set of 3,286 real-world programming questions across 214 repositories, spanning seven programming languages and diverse problem domains. Our evaluation of leading LLMs reveals a substantial capability gap: while models perform well on Stack Overflow questions with success rates of 70-83%, they resolve only up to 16.49% of CAB's issues from recent repositories (post-training cutoff). This discrepancy highlights the challenges of providing assistance in complex, project-specific contexts versus answering standalone questions. Our fully automated framework enables continuous benchmark expansion and is available at https://github.com/amazon-science/CodeAssistBench/.

View full details

Poster

TabArena: A Living Benchmark for Machine Learning on Tabular Data

Nick Erickson ⋅ Lennart Purucker ⋅ Andrej Tschalzev ⋅ David Holzmüller ⋅ Prateek Desai ⋅ David Salinas ⋅ Frank Hutter

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

With the growing popularity of deep learning and foundation models for tabular data, the need for standardized and reliable benchmarks is higher than ever. However, current benchmarks are static. Their design is not updated even if flaws are discovered, model versions are updated, or new models are released. To address this, we introduce TabArena, the first continuously maintained living tabular benchmarking system. To launch TabArena, we manually curate a representative collection of datasets and well-implemented models, conduct a large-scale benchmarking study to initialize a public leaderboard, and assemble a team of experienced maintainers. Our results highlight the influence of validation method and ensembling of hyperparameter configurations to benchmark models at their full potential. While gradient-boosted trees are still strong contenders on practical tabular datasets, we observe that deep learning methods have caught up under larger time budgets with ensembling. At the same time, foundation models excel on smaller datasets. Finally, we show that ensembles across models advance the state-of-the-art in tabular machine learning. We observe that some deep learning models are overrepresented in cross-model ensembles due to validation set overfitting, and we encourage model developers to address this issue. We launch TabArena with a public leaderboard, reproducible code, and maintenance protocols to create a living benchmark available at https://tabarena.ai.

View full details

Poster

CleverBirds: A Multiple-Choice Benchmark for Fine-grained Human Knowledge Tracing

Leonie Bossemeyer ⋅ Samuel Heinrich ⋅ Grant Van Horn ⋅ Oisin Mac Aodha

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Mastering fine-grained visual recognition, essential in many expert domains, can require that specialists undergo years of dedicated training. Modeling the progression of such expertize in humans remains challenging, and accurately inferring a human learner’s knowledge state is a key step toward understanding visual learning. We introduce CleverBirds, a large-scale knowledge tracing benchmark for fine-grained bird species recognition. Collected by the citizen-science platform eBird, it offers insight into how individuals acquire expertize in complex fine-grained classification. More than 40,000 participants have engaged in the quiz, answering over 17 million multiple-choice questions spanning over 10,000 bird species, with long-range learning patterns across an average of 400 questions per participant. We release this dataset to support the development and evaluation of new methods for visual knowledge tracing. We show that tracking learners' knowledge is challenging, especially across participant subgroups and question types, with different forms of contextual information offering varying degrees of predictive benefit. CleverBirds is among the largest benchmark of its kind, offering a substantially higher number of learnable concepts. With it, we hope to enable new avenues for studying the development of visual expertize over time and across individuals.

View full details

Poster

BenchmarkCards: Standardized Documentation for Large Language Model Benchmarks

Anna Sokol ⋅ Elizabeth Daly ⋅ Michael Hind ⋅ David Piorkowski ⋅ Xiangliang Zhang ⋅ Nuno Moniz ⋅ Nitesh Chawla

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Large language models (LLMs) are powerful tools capable of handling diverse tasks. Comparing and selecting appropriate LLMs for specific tasks requires systematic evaluation methods, as models exhibit varying capabilities across different domains. However, finding suitable benchmarks is difficult given the many available options. This complexity not only increases the risk of benchmark misuse and misinterpretation but also demands substantial effort from LLM users, seeking the most suitable benchmarks for their specific needs. To address these issues, we introduce BenchmarkCards, an intuitive and validated documentation framework that standardizes critical benchmark attributes such as objectives, methodologies, data sources, and limitations. Through user studies involving benchmark creators and users, we show that BenchmarkCards can simplify benchmark selection and enhance transparency, facilitating informed decision-making in evaluating LLMs.Data & Code:github.com/SokolAnn/BenchmarkCards huggingface.co/datasets/ASokol/BenchmarkCards

View full details

Poster

OctoNet: A Large-Scale Multi-Modal Dataset for Human Activity Understanding Grounded in Motion-Captured 3D Pose Labels

Dongsheng Yuan ⋅ Xie Zhang ⋅ Weiying Hou ⋅ Sheng Lyu ⋅ Yuemin Yu ⋅ Luca Jiang-Tao Yu ⋅ Chengxiao Li ⋅ Chenshu Wu

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We introduce OctoNet, a large-scale, multi-modal, multi-view human activity dataset designed to advance human activity understanding and multi-modal learning. OctoNet comprises 12 heterogeneous modalities (including RGB, depth, thermal cameras, infrared arrays, audio, millimeter-wave radar, Wi-Fi, IMU, and more) recorded from 41 participants under multi-view sensor setups, yielding over 67.72M synchronized frames. The data encompass 62 daily activities spanning structured routines, freestyle behaviors, human-environment interaction, healthcare tasks, etc. Critically, all modalities are annotated by high-fidelity 3D pose labels captured via a professional motion-capture system, allowing precise alignment and rich supervision across sensors and views. OctoNet is one of the most comprehensive datasets of its kind, enabling a wide range of learning tasks such as human activity recognition, 3D pose estimation, multi-modal fusion, cross-modal supervision, and sensor foundation models. Extensive experiments have been conducted to demonstrate the sensing capacity using various baselines. OctoNet offers a unique and unified testbed for developing and benchmarking generalizable, robust models for human-centric perceptual AI.

View full details

Poster

CausalVerse: Benchmarking Causal Representation Learning with Configurable High-Fidelity Simulations

Guangyi Chen ⋅ Yunlong Deng ⋅ Peiyuan Zhu ⋅ Yan Li ⋅ Yifan Shen ⋅ Zijian Li ⋅ Kun Zhang

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Causal Representation Learning (CRL) aims to uncover the data-generating process and identify the underlying causal variables and relations, whose evaluation remains inherently challenging due to the requirement of known ground-truth causal variables and causal structure. Existing evaluations often rely on either simplistic synthetic datasets or downstream performance on real-world tasks, generally suffering a dilemma between realism and evaluative precision. In this paper, we introduce a new benchmark for CRL using high-fidelity simulated visual data that retains both realistic visual complexity and, more importantly, access to ground-truth causal generating processes. The dataset comprises around 200 thousand images and 3 million video frames across 24 sub-scenes in four domains: static image generation, dynamic physical simulations, robotic manipulations, and traffic situation analysis. These scenarios range from static to dynamic settings, simple to complex structures, and single to multi-agent interactions, offering a comprehensive testbed that hopefully bridges the gap between rigorous evaluation and real-world applicability. In addition, we provide flexible access to the underlying causal structures, allowing users to modify or configure them to align with the required assumptions in CRL, such as available domain labels, temporal dependencies, or intervention histories. Leveraging this benchmark, we evaluated representative CRL methods across diverse paradigms and offered empirical insights to assist practitioners and newcomers in choosing or extending appropriate CRL frameworks to properly address specific types of real problems that can benefit from the CRL perspective. Welcome to visit our: Project page: https://causal-verse.github.io/ , Dataset: https://huggingface.co/CausalVerse

View full details

Poster

From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos

Animesh Gupta ⋅ Jay Parmar ⋅ Ishan Rajendrakumar Dave ⋅ Mubarak Shah

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Composed Video Retrieval (CoVR) retrieves a target video given a query video and a modification text describing the intended change. Existing CoVR benchmarks emphasize appearance shifts or coarse event changes and therefore do not test the ability to capture subtle, fast-paced temporal differences. We introduce TF-CoVR, the first large-scale benchmark dedicated to temporally fine-grained CoVR. TF-CoVR focuses on gymnastics and diving, and provides 180K triplets drawn from FineGym and FineDiving datasets. Previous CoVR benchmarks, focusing on temporal aspect, link each query to a single target segment taken from the same video, limiting practical usefulness. In TF-CoVR, we instead construct each pair by prompting an LLM with the label differences between clips drawn from different videos; every pair is thus associated with multiple valid target videos (3.9 on average), reflecting real-world tasks such as sports-highlight generation. To model these temporal dynamics, we propose TF-CoVR-Base, a concise two-stage training framework: (i) pre-train a video encoder on fine-grained action classification to obtain temporally discriminative embeddings; (ii) align the composed query with candidate videos using contrastive learning. We conduct the first comprehensive study of image, video, and general multimodal embedding (GME) models on temporally fine-grained composed retrieval in both zero-shot and fine-tuning regimes. On TF-CoVR, TF-CoVR-Base improves zero-shot mAP@50 from 5.92 (LanguageBind) to 7.51, and after fine-tuning raises the state-of-the-art from 19.83 to 27.22.

View full details

Poster

MathArena: Evaluating LLMs on Uncontaminated Math Competitions

Mislav Balunovic ⋅ Jasper Dekoninck ⋅ Ivo Petrov ⋅ Nikola Jovanović ⋅ Martin Vechev

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The rapid advancement of reasoning capabilities in large language models (LLMs) has led to notable improvements on mathematical benchmarks. However, many of the most commonly used evaluation datasets (e.g., AIME 2024) are widely available online, making it difficult to disentangle genuine reasoning from potential memorization. Furthermore, these benchmarks do not evaluate proof-writing capabilities, which are crucial for many mathematical tasks.To address this, we introduce MathArena, a new benchmark based on the following key insight: recurring math competitions provide a stream of high-quality, challenging problems that can be used for real-time evaluation of LLMs. By evaluating models as soon as new problems are released, we effectively eliminate the risk of contamination.Using this framework, we find strong signs of contamination in AIME 2024. Nonetheless, evaluations on harder competitions, such as CMIMC 2025, demonstrate impressive reasoning capabilities in top-performing models.MathArena is also the first benchmark for proof-writing capabilities. On IMO 2025, top models achieve slightly less than 40\%, demonstrating both notable progress and significant room for improvement.So far, we have evaluated over $50$ models across seven competitions, totaling $162$ problems. As an evolving benchmark, MathArena will continue to track the progress of LLMs on newly released competitions, ensuring rigorous and up-to-date evaluation of mathematical reasoning.

View full details

Poster

DataSIR: A Benchmark Dataset for Sensitive Information Recognition

Fan Mo ⋅ Bo Liu ⋅ Yuan Fan ⋅ Kun Qin ⋅ Yizhou Zhao ⋅ Jinhe Zhou ⋅ Jia Sun ⋅ Jinfei Liu ⋅ Kui Ren

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

With the rapid development of artificial intelligence technologies, the demand for training data has surged, exacerbating risks of data leakage. Despite increasing incidents and costs associated with such leaks, data leakage prevention (DLP) technologies lag behind evolving evasion techniques that bypass existing sensitive information recognition (SIR) models. Current datasets lack comprehensive coverage of these adversarial transformations, limiting the evaluation of robust SIR systems. To address this gap, we introduce DataSIR, a benchmark dataset specifically designed to evaluate SIR models on sensitive data subjected to diverse format transformations. We curate 26 sensitive data categories based on multiple international regulations, and collect 131,890 original samples correspondingly. Through empirical analysis of real-world evasion tactics, we implement 21 format transformation methods, which are applied to the original samples, expanding the dataset to 1,647,501 samples to simulate adversarial scenarios. We evaluated DataSIR using four traditional NLP models and four large language models (LLMs). For LLMs, we design structured prompts with varying degrees of contextual hints to assess the impact of prior knowledge on recognition accuracy. These evaluations demonstrate that our dataset effectively differentiates the performance of various SIR algorithms. Combined with its rich category and format diversity, the dataset can serve as a benchmark for evaluating related models and help develop future more advanced SIR models. Our dataset and experimental code are publicly available at https://www.kaggle.com/datasets/fanmo1/datasir and https://github.com/Fan-Mo-ZJU/DataSIR.

View full details

Poster

MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs

Tianhao Peng ⋅ Haochen Wang ⋅ Yuanxing Zhang ⋅ Noah Wang ⋅ Zili Wang ⋅ Ge Zhang ⋅ Jian Yang ⋅ Shihao Li ⋅ Yanghai Wang ⋅ Xintao Wang ⋅ Houyi Li ⋅ Wei Ji ⋅ Pengfei Wan ⋅ Wenhao Huang ⋅ ZHAO-XIANG ZHANG ⋅ Jiaheng Liu

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The advent of Multimodal Large Language Models (MLLMs) has expanded AI capabilities to visual modalities, yet existing evaluation benchmarks remain limited to single-video understanding, overlooking the critical need for multi-video understanding in real-world scenarios (e.g., sports analytics and autonomous driving). To address this significant gap, we introduce **MVU-Eval**, the first comprehensive benchmark for evaluating **M**ulti-**V**ideo **U**nderstanding for MLLMs. Specifically, our MVU-Eval mainly assesses eight core competencies through 1,824 meticulously curated question-answer pairs spanning 4,959 videos from diverse domains, addressing both fundamental perception tasks and high-order reasoning tasks. These capabilities are rigorously aligned with real-world applications such as multi-sensor synthesis in autonomous systems and cross-angle sports analytics. Through extensive evaluation of state-of-the-art open-source and closed-source models, we reveal significant performance discrepancies and limitations in current MLLMs' ability to perform understanding across multiple videos.The benchmark will be made publicly available to foster future research.

View full details

Poster

Augmenting Biological Fitness Prediction Benchmarks with Landscapes Features from GraphFLA

Mingyu Huang ⋅ Shasha Zhou ⋅ Ke Li

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Machine learning models increasingly map biological sequence-fitness landscapes to predict mutational effects. Effective evaluation of these models requires benchmarks curated from empirical data. Despite their impressive scales, existing benchmarks lack topographical information regarding the underlying fitness landscapes, which hampers interpretation and comparison of model performance beyond averaged scores. Here, we introduce GraphFLA, a Python framework that constructs and analyzes fitness landscapes from diverse modalities (DNA, RNA, protein, and beyond.), accommodating datasets up to millions of mutants. GraphFLA calculates 20 biologically relevant features that characterize 4 fundamental aspects of landscape topography. By applying GraphFLA to over 5,300 landscapes from ProteinGym, RNAGym, and CIS-BP, we demonstrate its utility in interpreting and comparing the performance of dozens of fitness prediction models, highlighting factors influencing model accuracy and respective advantages of different models. Additionally, we release 155 combinatorially complete empirical fitness landscapes, encompassing over 2.2 million sequences across various modalities. All the codes and datasets are available at https://github.com/COLA-Laboratory/GraphFLA.

View full details

Poster

SeniorTalk: A Chinese Conversation Dataset with Rich Annotations for Super-Aged Seniors

chen yang ⋅ Hui Wang ⋅ Shiyao Wang ⋅ Junyang Chen ⋅ Jiabei He ⋅ Jiaming Zhou ⋅ Xi Yang ⋅ Yequan Wang ⋅ Yonghua Lin ⋅ Yong Qin

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

While voice technologies increasingly serve aging populations, current systems exhibit significant performance gaps due to inadequate training data capturing elderly-specific vocal characteristics like presbyphonia and dialectal variations. The limited data available on super-aged individuals in existing elderly speech datasets, coupled with overly simple recording styles and annotation dimensions, exacerbates this issue. To address the critical scarcity of speech data from individuals aged 75 and above, we introduce SeniorTalk, a carefully annotated Chinese spoken dialogue dataset. This dataset contains 55.53 hours of speech from 101 natural conversations involving 202 participants, ensuring a strategic balance across gender, region, and age. Through detailed annotation across multiple dimensions, it can support a wide range of speech tasks. We perform extensive experiments on speaker verification, speaker diarization, speech recognition, and speech editing tasks, offering crucial insights for the development of speech technologies targeting this age group. Code is available at https://github.com/flageval-baai/SeniorTalk and data at https://huggingface.co/datasets/evan0617/seniortalk.

View full details

Poster

InFlux: A Benchmark for Self-Calibration of Dynamic Intrinsics of Video Cameras

Erich Liang ⋅ Roma Bhattacharjee ⋅ Sreemanti Dey ⋅ Rafael Moschopoulos ⋅ Caitlin Wang ⋅ Michel Liao ⋅ Grace Tan ⋅ Andrew Wang ⋅ Karhan Kayan ⋅ Stamatis Alexandropoulos ⋅ Jia Deng

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Accurately tracking camera intrinsics is crucial for achieving 3D understanding from 2D video. However, most 3D algorithms assume that camera intrinsics stay constant throughout a video, which is often not true for many real-world in-the-wild videos. A major obstacle in this field is a lack of dynamic camera intrinsics benchmarks--existing benchmarks typically offer limited diversity in scene content and intrinsics variation, and none provide per-frame intrinsic changes for consecutive video frames. In this paper, we present Intrinsics in Flux (InFlux), a real-world benchmark that provides per-frame ground truth intrinsics annotations for videos with dynamic intrinsics. Compared to prior benchmarks, InFlux captures a wider range of intrinsic variations and scene diversity, featuring 143K+ annotated frames from 386 high-resolution indoor and outdoor videos with dynamic camera intrinsics. To ensure accurate per-frame intrinsics, we build a comprehensive lookup table of calibration experiments and extend the Kalibr toolbox to improve its accuracy and robustness. Using our benchmark, we evaluate existing baseline methods for predicting camera intrinsics and find that most struggle to achieve accurate predictions on videos with dynamic intrinsics. For the dataset, code, videos, and submission, please visit https://influx.cs.princeton.edu/.

View full details

Poster

GC4NC: A Benchmark Framework for Graph Condensation on Node Classification with New Insights

Shengbo Gong ⋅ Juntong Ni ⋅ Noveen Sachdeva ⋅ Carl Yang ⋅ Wei Jin

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Graph condensation (GC) is an emerging technique designed to learn a significantly smaller graph that retains the essential information of the original graph. This condensed graph has shown promise in accelerating graph neural networks while preserving performance comparable to those achieved with the original, larger graphs. Additionally, this technique facilitates downstream applications like neural architecture search and deepens our understanding of redundancies in large graphs. Despite the rapid development of GC methods, particularly for node classification, a unified evaluation framework is still lacking to systematically compare different GC methods or clarify key design choices for improving their effectiveness. To bridge these gaps, we introduce GC4NC, a comprehensive framework for evaluating diverse GC methods on node classification across multiple dimensions including performance, efficiency, privacy preservation, denoising ability, NAS effectiveness, and transferability. Our systematic evaluation offers novel insights into how condensed graphs behave and the critical design choices that drive their success. These findings pave the way for future advancements in GC methods, enhancing both performance and expanding their real-world applications. The code is available at https://github.com/Emory-Melody/GraphSlim/tree/main/benchmark.

View full details

Poster

DGCBench: A Deep Graph Clustering Benchmark

Benyu Wu ⋅ Yue Liu ⋅ Qiaoyu Tan ⋅ Xinwang Liu ⋅ Wei Du ⋅ Jun Wang ⋅ Guoxian Yu

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Deep graph clustering (DGC) aims to partition graph nodes into distinct clusters in an unsupervised manner. Despite rapid advancements in this field, DGC remains inherently challenging due to the absence of ground-truth, which complicates the design of effective algorithms and impedes the establishment of standardized benchmarks. The lack of unified datasets, evaluation protocols, and metrics further exacerbates these challenges, making it difficult to systematically assess and compare DGC methods. To address these limitations, we introduce $\texttt{DGCBench}$, the first comprehensive and unified benchmark for DGC methods. It evaluates 12 state-of-the-art DGC methods across 12 datasets from diverse domains and scales, spanning 6 critical dimensions: $\textbf{discriminability}$, $\textbf{effectiveness}$, $\textbf{scalability}$, $\textbf{efficiency}$, $\textbf{stability}$, and $\textbf{robustness}$. Additionally, we develop $\texttt{PyDGC}$, an open-source Python library that standardizes the DGC training and evaluation paradigm. Through systematic experiments, we reveal persistent limitations in existing methods, specifically regarding the homophily bottleneck, training instability, vulnerability to perturbations, efficiency plateau, scalability challenges, and poor discriminability, thereby offering actionable insights for future research. We hope that $\texttt{DGCBench}$, $\texttt{PyDGC}$, and our analyses will collectively accelerate the progress in the DGC community. The code is available at https://github.com/Marigoldwu/PyDGC.

View full details

Poster

OpenAD: Open-World Autonomous Driving Benchmark for 3D Object Detection

Zhongyu Xia ⋅ Jishuo Li ⋅ Zhiwei Lin ⋅ Xinhao Wang ⋅ Yongtao Wang ⋅ Ming-Hsuan Yang

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Open-world perception aims to develop a model adaptable to novel domains and various sensor configurations and can understand uncommon objects and corner cases. However, current research lacks sufficiently comprehensive open-world 3D perception benchmarks and robust generalizable methodologies. This paper introduces OpenAD, the first real open-world autonomous driving benchmark for 3D object detection. OpenAD is built upon a corner case discovery and annotation pipeline that integrates with a multimodal large language model (MLLM). The proposed pipeline annotates corner case objects in a unified format for five autonomous driving perception datasets with 2000 scenarios. In addition, we devise evaluation methodologies and evaluate various open-world and specialized 2D and 3D models. Moreover, we propose a vision-centric 3D open-world object detection baseline and further introduce an ensemble method by fusing general and specialized models to address the issue of lower precision in existing open-world methods for the OpenAD benchmark. We host an online challenge on EvalAI. Data, toolkit codes, and evaluation codes are available at https://github.com/VDIGPKU/OpenAD.

View full details

Poster

LIFEBENCH: Evaluating Length Instruction Following in Large Language Models

Wei Zhang ⋅ Zhenhong Zhou ⋅ Kun Wang ⋅ Junfeng Fang ⋅ Rongwu Xu ⋅ Yuanhe Zhang ⋅ Rui Wang ⋅ Ge Zhang ⋅ Xinfeng Li ⋅ Li Sun ⋅ Lingjuan Lyu ⋅ Yang Liu ⋅ Sen Su

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

While large language models (LLMs) can solve PhD-level reasoning problems over long context inputs, they still struggle with a seemingly simpler task: *following explicit length instructions*—e.g., *write a 10,000-word novel*. Additionally, models often generate far too short outputs, terminate prematurely, or even refuse the request. Existing benchmarks focus primarily on evaluating generations quality, but often overlook whether the generations meet length constraints. To this end, we introduce **Length Instruction Following Evaluation Benchmark** (LIFEBench) to comprehensively evaluate LLMs' ability to follow length instructions across diverse tasks and a wide range of specified lengths. LIFEBench consists of 10,800 instances across 4 task categories in both English and Chinese, covering length constraints ranging from 16 to 8192 words. We evaluate 26 widely-used LLMs and find that most models reasonably follow short-length instructions but deteriorate sharply beyond a certain threshold. Surprisingly, almost all models fail to reach the vendor-claimed maximum output lengths in practice, as further confirmed by our evaluations extending up to 32K words. Even long-context LLMs, despite their extended input-output windows, counterintuitively fail to improve length-instructions following. Notably, Reasoning LLMs outperform even specialized long-text generation models, achieving state-of-the-art length following. Overall, LIFEBench uncovers fundamental limitations in current LLMs' length instructions following ability, offering critical insights for future progress.

View full details

Poster

vHector and HeisenVec: Scalable Vector Graphics Generation Through Large Language Models

Leonardo Zini ⋅ Elia Frigieri ⋅ Sebastiano Aloscari ⋅ Lorenzo Baraldi

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We introduce HeisenVec, a large-scale dataset designed to advance research in vector graphics generation from natural language descriptions. Unlike conventional image generation datasets that focus on raster images, HeisenVec targets the structured and symbolic domain of Scalable Vector Graphics (SVG), where images are represented as sequences of drawing commands and style attributes. The dataset comprises 2.2 million SVGs collected from different online sources, each paired with four complementary textual descriptions generated by multi-modal models. To ensure structural consistency and efficiency for autoregressive modeling, all SVGs are standardized through a pre-processing pipeline that unifies geometric primitives as paths, applies affine transformations, and compresses syntax via custom tokens set. HeisenVec exhibits broad coverage among visual styles and sequence lengths, with a substantial portion of samples exceeding 8,000 tokens, making it particularly well-suited for benchmarking long-context language models. Our benchmark enables rigorous evaluation of text-conditioned SVG generation, encourages progress on sequence modeling with symbolic outputs, and bridges the gap between vision, graphics, and language. We release the dataset, tokenization tools, and evaluation pipeline to foster further research in this emerging domain.

View full details

Poster

UniFoil: A Universal Dataset of Airfoils in Transitional and Turbulent Regimes for Subsonic and Transonic Flows

Rohit Kanchi ⋅ Benjamin Melanson ⋅ Nithin Somasekharan ⋅ Shaowu Pan ⋅ Sicheng He

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We present UniFoil, the largest publicly available universal airfoil database based on Reynolds-Averaged Navier–Stokes (RANS) simulations. It contains over 500,000 samples spanning a wide range of Reynolds and Mach numbers, capturing both transitional and fully turbulent flows across incompressible to compressible regimes. UniFoil is designed to support machine learning research in fluid dynamics, particularly for modeling complex aerodynamic phenomena.Most existing datasets are limited to incompressible, fully turbulent flows with smooth field characteristics, thus overlooking the critical physics of laminar–turbulent transition and shock-wave interactions—features that exhibit strong nonlinearity and sharp gradients. UniFoil fills this gap by offering a broad spectrum of realistic flow conditions.In the database, turbulent simulations utilize the Spalart–Allmaras (SA) model, while transitional flows are modeled using an $e^N$-based transition prediction method coupled with the SA model. The database includes a comprehensive geometry set comprising over 4,800 natural laminar flow (NLF) airfoils and 30,000 fully turbulent (FT) airfoils, effectively covering the diversity of airfoil designs relevant to aerospace, wind energy, and marine applications.This database is also highly valuable for scientific machine learning (SciML), enabling the development of data-driven models that more accurately capture the transport processes associated with laminar–turbulent transition. UniFoil is freely available under a permissive CC-BY-SA license.

View full details

Poster

DATE-LM: Benchmarking Data Attribution Evaluation for Large Language Models

Cathy Jiao ⋅ Yijun Pan ⋅ Emily Xiao ⋅ Daisy Sheng ⋅ Niket Jain ⋅ Hanzhang Zhao ⋅ Ishita Dasgupta ⋅ Jiaqi Ma ⋅ Chenyan Xiong

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Data attribution methods quantify the influence of training data on model outputs and are becoming increasingly relevant for a wide range of LLM research and applications, including dataset curation, model interpretability, data valuation. However, there remain critical gaps in systematic LLM-centric evaluation of data attribution methods. To this end, we introduce DATE-LM (Data Attribution Evaluation in Language Models), a unified benchmark for evaluating data attribution methods through real-world LLM applications. DATE-LM measures attribution quality through three key tasks — training data selection, toxicity/bias filtering, and factual attribution. Our benchmark is designed for ease of use, enabling researchers to configure and run large-scale evaluations across diverse tasks and LLM architectures. Furthermore, we use DATE-LM to conduct a large-scale evaluation of existing data attribution methods. Our findings show that no single method dominates across all tasks, data attribution methods have trade-offs with simpler baselines, and method performance is sensitive to task-specific evaluation design. Finally, we release a public leaderboard for quick comparison of methods and to facilitate community engagement. We hope DATE-LM serves as a foundation for future data attribution research in LLMs.

View full details

Poster

A Standardized Benchmark for Multilabel Antimicrobial Peptide Classification

Sebastian Ojeda ⋅ Rafael Velasquez ⋅ Nicolás Aparicio ⋅ Juanita Puentes ⋅ Paula Cárdenas ⋅ Nicolás Andrade ⋅ Gabriel González ⋅ Sergio Rincón ⋅ Carolina Muñoz-Camargo ⋅ Pablo Arbelaez

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Antimicrobial peptides have emerged as promising molecules to combat antimicrobial resistance. However, fragmented datasets, inconsistent annotations, and the lack of standardized benchmarks hinder computational approaches and slow down the discovery of new candidates. To address these challenges, we present the Expanded Standardized Collection for Antimicrobial Peptide Evaluation (ESCAPE), an experimental framework integrating over 80.000 peptides from 27 validated repositories. Our dataset separates antimicrobial peptides from negative sequences and incorporates their functional annotations into a biologically coherent multilabel hierarchy, capturing activities across antibacterial, antifungal, antiviral, and antiparasitic classes. Building on ESCAPE, we propose a transformer-based model that leverages sequence and structural information to predict multiple functional activities of peptides. Our method achieves up to a 2.56% relative average improvement in mean Average Precision over the second-best method adapted for this task, establishing a new state-of-the-art multilabel peptide classification. ESCAPE provides a comprehensive and reproducible evaluation framework to advance AI-driven antimicrobial peptide research.

View full details

Poster

The Leaderboard Illusion

Shivalika Singh ⋅ Yiyang Nan ⋅ Alex Wang ⋅ Daniel Dsouza ⋅ Sayash Kapoor ⋅ Ahmet Üstün ⋅ Sanmi Koyejo ⋅ Yuntian Deng ⋅ Shayne Longpre ⋅ Noah Smith ⋅ Beyza Ermis ⋅ Marzieh Fadaee ⋅ Sara Hooker

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also grow more susceptible to distortion.Chatbot Arena has emerged as the go-to leaderboard for ranking the most capable AI systems. Yet, in this work we identify systematic issues that have resulted in a distorted playing field. We find that undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired. We establish that the ability of these providers to choose the best score leads to biased Arena scores due to selective disclosure of performance results. At an extreme, we found one provider testing 27 private variants before making one model public at the second position on the leaderboard. We also establish that proprietary closed models are sampled at higher rates (number of battles) and have fewer models removed from the arena than open-weight and open-source alternatives. Both these policies lead to large data access asymmetries over time. The top two providers have individually received an estimated 19.2% and 20.4% of all data on the arena. In contrast, a combined 83 open-weight models have only received an estimated 29.7% of the total data. With conservative estimates, we show that access to Chatbot Arena data yields substantial benefits; even limited additional data can result in relative performance gains of up to 112% on ArenaHard, a test set from the arena distribution.Together, these dynamics result in overfitting to Arena-specific dynamics rather than general model quality. The Arena builds on the substantial efforts of both the organizers and an open community that maintains this valuable evaluation platform. We offer actionable recommendations to reform the Chatbot Arena's evaluation framework and promote fairer, more transparent benchmarking for the field.

View full details

Poster

MMPB: It’s Time for Multi-Modal Personalization

Jaeik Kim ⋅ Woojin Kim ⋅ Woohyeon Park ⋅ Jaeyoung Do

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Visual personalization is essential in user-facing AI systems such as smart homes and healthcare, where aligning model behavior with user-centric concepts is critical. However, recent large Vision-Language Models (VLMs), despite their broad applicability, remain underexplored in their ability to adapt to individual users. In this paper, we introduce MMPB, the first extensive benchmark for evaluating VLMs on personalization. MMPB comprises 10k image-query pairs and includes 111 personalizable concepts across four categories: humans, animals, objects, and characters, with the human category enriched with preference-grounded queries. We structure personalization into three main task types, each highlighting a different key property of VLMs. Using 23 widely used VLMs including both open- and closed-source models, we evaluate personalization performance via a three-stage protocol: concept injection, multi-turn dialogue, and personalized querying. Our findings indicate that most VLMs (including some closed-source models) struggle with personalization, particularly in maintaining consistency over dialogue, handling user preferences, and adapting to visual cues. Our analysis reveals that the challenges in VLM personalization (such as refusal behaviors and long-context forgetting) highlight substantial room for improvement. By identifying these limitations and offering a scalable benchmark, MMPB offers valuable insights and a solid foundation for future research toward truly personalized multi-modal AI.

View full details

Poster

EuroSpeech: A Multilingual Speech Corpus

Samuel Pfisterer ⋅ Florian Grötschla ⋅ Luca Lanzendörfer ⋅ Florian Yan ⋅ Roger Wattenhofer

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Recent progress in speech processing has highlighted that high-quality performance across languages requires substantial training data for each individual language. While existing multilingual datasets cover many languages, they often contain insufficient data for each language, leading to models trained on these datasets to exhibit poor performance on most supported languages. Our work addresses this challenge by introducing a scalable pipeline for constructing speech datasets from parliamentary recordings. The proposed pipeline includes robust components for media retrieval and a two-stage alignment algorithm designed to handle non-verbatim transcripts and long-form audio. Applying this pipeline to recordings from 22 European parliaments, we extract over 61k hours of aligned speech segments, achieving substantial per-language coverage with 19 languages exceeding 1k hours and 22 languages exceeding 500 hours of high-quality speech data. We obtain an average 41.8\% reduction in word error rates over baselines when finetuning an existing ASR model on our dataset, demonstrating the usefulness of our approach.

View full details

Poster

MomentSeeker: A Task-Oriented Benchmark For Long-Video Moment Retrieval

Huaying Yuan ⋅ Jian Ni ⋅ Zheng Liu ⋅ Yueze Wang ⋅ Junjie Zhou ⋅ Zhengyang Liang ⋅ Bo Zhao ⋅ Zhao Cao ⋅ Ji-Rong Wen ⋅ Zhicheng Dou

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Accurately locating key moments within long videos is crucial for solving long video understanding (LVU) tasks. However, existing benchmarks are either severely limited in terms of video length and task diversity, or they focus solely on the end-to-end LVU performance, making them inappropriate for evaluating whether key moments can be accurately accessed. To address this challenge, we propose MomentSeeker, a novel benchmark for long-video moment retrieval (LVMR), distinguished by the following features. First, it is created based on long and diverse videos, averaging over 1,200 seconds in duration, and collected from various domains, e.g., movie, anomaly, egocentric, and sports. Second, it covers a variety of real-world scenarios in three levels: global-level, event-level, and object-level, covering common tasks like action recognition, object localization, causal reasoning, etc. Third, it incorporates rich forms of queries, including text-only queries, image-conditioned queries, and video-conditioned queries. On top of MomentSeeker, we conduct comprehensive experiments for both generation-based approaches (directly using MLLMs) and retrieval-based approaches (leveraging video retrievers). Our results reveal the significant challenges in long-video moment retrieval in terms of accuracy and efficiency, despite improvements from the latest long-video MLLMs and task-specific fine-tuning. We have publicly released MomentSeeker to facilitate future research in this area.

View full details

Poster

DetectiumFire: A Comprehensive Multi-modal Dataset Bridging Vision and Language for Fire Understanding

Zixuan Liu ⋅ Siavash H. Khajavi ⋅ Guangkai Jiang

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Recent advances in multi-modal models have demonstrated strong performance in tasks such as image generation and reasoning. However, applying these models to the fire domain remains challenging due to the lack of publicly available datasets with high-quality fire domain annotations. To address this gap, we introduce $\textbf{DetectiumFire}$, a large-scale, multi-modal dataset comprising of 22.5k high-resolution fire-related images and 2.5k real-world fire-related videos covering a wide range of fire types, environments, and risk levels. The data are annotated with both traditional computer vision labels (e.g., bounding boxes) and detailed textual prompts describing the scene, enabling applications such as synthetic data generation and fire risk reasoning. DetectiumFire offers clear advantages over existing benchmarks in scale, diversity, and data quality, significantly reducing redundancy and enhancing coverage of real-world scenarios. We validate the utility of DetectiumFire across multiple tasks, including object detection, diffusion-based image generation, and vision-language reasoning. Our results highlight the potential of this dataset to advance fire-related research and support the development of intelligent safety systems. We release DetectiumFire to promote broader exploration of fire understanding in the AI community.

View full details

Poster

Towards Understanding Camera Motions in Any Video

Zhiqiu Lin ⋅ Siyuan Cen ⋅ Daniel Jiang ⋅ Jay Karhade ⋅ Hewei Wang ⋅ Chancharik Mitra ⋅ Yu Tong Tiffany Ling ⋅ Yuhan Huang ⋅ Rushikesh Zawar ⋅ Xue Bai ⋅ Yilun Du ⋅ Chuang Gan ⋅ Deva Ramanan

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding. CameraBench consists of ~3,000 diverse internet videos, annotated by experts through a rigorous multi-stage quality control process. One of our core contributions is a taxonomy or "language" of camera motion primitives, designed in collaboration with cinematographers. We find, for example, that some motions like "follow" (or tracking) require understanding scene content like moving subjects. We conduct a large-scale human study to quantify human performance, revealing that domain expertise and tutorial-based training can significantly enhance accuracy. For example, a novice may confuse zoom-in (a change of intrinsics) with translating forward (a change of extrinsics), but can be trained to differentiate the two. Using CameraBench, we evaluate Structure-from-Motion (SfM) and Video-Language Models (VLMs), finding that SfM models struggle to capture semantic primitives that depend on scene content, while generative VLMs struggle to capture geometric primitives that require precise estimation of trajectories. We then fine-tune a generative VLM on CameraBench to achieve the best of both worlds and showcase its applications, including motion-augmented captioning, video question answering, and video-text retrieval. We hope our taxonomy, benchmark, and tutorials will drive future efforts towards the ultimate goal of understanding camera motions in any video.

View full details

Poster

RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics

Jie Zhang ⋅ Cezara Petrui ⋅ Kristina Nikolić ⋅ Florian Tramer

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Existing benchmarks for evaluating mathematical reasoning in large language models (LLMs) rely primarily on competition problems, formal proofs, or artificially challenging questions---failing to capture the nature of mathematics encountered in actual research environments. We introduce \textsc{RealMath}, a novel benchmark derived directly from research papers and mathematical forums that assesses LLMs' abilities on authentic mathematical tasks. Our approach addresses three critical challenges: sourcing diverse research-level content, enabling reliable automated evaluation through verifiable statements, and designing a continually refreshable dataset to mitigate contamination risks. Experimental results across multiple LLMs reveal surprising capabilities in handling research mathematics compared to competition problems, suggesting current models may already serve as valuable assistants for working mathematicians despite limitations on highly challenging problems. The code and dataset for \textsc{RealMath} are publicly available.

View full details

Poster

MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?

Zhaorun Chen ⋅ Zichen Wen ⋅ Yichao Du ⋅ Yiyang Zhou ⋅ Chenhang Cui ⋅ Siwei Han ⋅ Jen Weng ⋅ Chaoqi Wang ⋅ Zhengwei Tong ⋅ Leria HUANG ⋅ Canyu Chen ⋅ Haoqin Tu ⋅ Qinghao Ye ⋅ Zhihong Zhu ⋅ Yuqing Zhang ⋅ Jiawei Zhou ⋅ Zhuokai Zhao ⋅ Rafael Rafailov ⋅ Chelsea Finn ⋅ Huaxiu Yao

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

While text-to-image models like GPT-4o-Image and FLUX are rapidly proliferating, they often encounter challenges such as hallucination, bias, and the production of unsafe, low-quality output. To effectively address these issues, it is crucial to align these models with desired behaviors based on feedback from a multimodal judge. Despite their significance, current multimodal judges frequently undergo inadequate evaluation of their capabilities and limitations, potentially leading to misalignment and unsafe fine-tuning outcomes. To address this issue, we introduce MJ-Bench, a novel benchmark which incorporates a comprehensive preference dataset to evaluate multimodal judges in providing feedback for image generation models across six key perspectives: alignment, safety, image quality, bias, composition, and visualization. Specifically, we evaluate a large variety of multimodal judges including smaller-sized CLIP-based scoring models, open-source VLMs, and close-source VLMs on each decomposed subcategory of our preference dataset. Experiments reveal that close-source VLMs generally provide better feedback, with GPT-4o outperforming other judges in average. Compared with open-source VLMs, smaller-sized scoring models can provide better feedback regarding text-image alignment and image quality, while VLMs provide more accurate feedback regarding safety and generation bias due to their stronger reasoning capabilities. Further studies in feedback scale reveal that VLM judges can generally provide more accurate and stable feedback in natural language than numerical scales. Notably, human evaluations on end-to-end and fine-tuned models using separate feedback from these multimodal judges provide similar conclusions, further confirming the effectiveness of MJ-Bench.

View full details

Poster

OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation

Jingjing Chang ⋅ Yixiao Fang ⋅ Peng Xing ⋅ Shuhan Wu ⋅ Wei Cheng ⋅ Rui Wang ⋅ Xianfang Zeng ⋅ Gang Yu ⋅ Hai-Bao Chen

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Text-to-image (T2I) models have garnered significant attention for generating high-quality images aligned with text prompts. However, rapid T2I model advancements reveal limitations in early benchmarks, lacking comprehensive evaluations, especially for text rendering and style. Notably, recent state-of-the-art models, with their rich knowledge modeling capabilities, show potential in reasoning-driven image generation, yet existing evaluation systems have not adequately addressed this frontier. To systematically address these gaps, we introduce $\textbf{OneIG-Bench}$, a meticulously designed comprehensive benchmark framework for fine-grained evaluation of T2I models across multiple dimensions, including subject-element alignment, text rendering precision, reasoning-generated content, stylization, and diversity. By structuring the evaluation, this benchmark enables in-depth analysis of model performance, helping researchers and practitioners pinpoint strengths and bottlenecks in the full pipeline of image generation. Our codebase and dataset are now publicly available to facilitate reproducible evaluation studies and cross-model comparisons within the T2I research community.

View full details

Poster

GUARD: Constructing Realistic Two-Player Matrix and Security Games for Benchmarking Game-Theoretic Algorithms

Noah Krever ⋅ Jakub Cerny ⋅ Moise Blanchard ⋅ Christian Kroer

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Game-theoretic algorithms are commonly benchmarked on recreational games, classical constructs from economic theory such as congestion and dispersion games, or entirely random game instances. While the past two decades have seen the rise of security games -- grounded in real-world scenarios like patrolling and infrastructure protection -- their practical evaluation has been hindered by limited access to the datasets used to generate them. In particular, although the structural components of these games (e.g., patrol paths derived from maps) can be replicated, the critical data defining target values -- central to utility modeling -- remain inaccessible. In this paper, we introduce a flexible framework that leverages open-access datasets to generate realistic matrix and security game instances. These include animal movement data for modeling anti-poaching scenarios and demographic and infrastructure data for infrastructure protection. Our framework allows users to customize utility functions and game parameters, while also offering a suite of preconfigured instances. We provide theoretical results highlighting the degeneracy and limitations of benchmarking on random games, and empirically compare our generated games against random baselines across a variety of standard algorithms for computing Nash and Stackelberg equilibria, including linear programming, incremental strategy generation, and self-play with no-regret learners.

View full details

Poster

AVerImaTeC: A Dataset for Automatic Verification of Image-Text Claims with Evidence from the Web

RUI CAO ⋅ Zifeng Ding ⋅ Zhijiang Guo ⋅ Michael Schlichtkrull ⋅ Andreas Vlachos

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Textual claims are often accompanied by images to enhance their credibility and spread on social media, but this also raises concerns about the spread of misinformation.Existing datasets for automated verification of image-text claims remain limited, as they often consist of synthetic claims and lack evidence annotations to capture the reasoning behind the verdict.In this work, we introduce AVerImaTeC, a dataset consisting of 1,297 real-world image-text claims. Each claim is annotated with question-answer (QA) pairs containing evidence from the web, reflecting a decomposed reasoning regarding the verdict.We mitigate common challenges in fact-checking datasets such as contextual dependence, temporal leakage, and evidence insufficiency, via claim normalization, temporally constrained evidence annotation, and a two-stage sufficiency check. We assess the consistency of the annotation in AVerImaTeC via inter-annotator studies, achieving a $\kappa=0.742$ on verdicts and $74.7\%$ consistency on QA pairs. We also propose a novel evaluation method for evidence retrieval and conduct extensive experiments to establish baselines for verifying image-text claims using open-web evidence.

View full details

Poster

ConViS-Bench: Estimating Video Similarity Through Semantic Concepts

Benedetta Liberatori ⋅ Alessandro Conti ⋅ Lorenzo Vaquero ⋅ Yiming Wang ⋅ Elisa Ricci ⋅ Paolo Rota

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

What does it mean for two videos to be similar? Videos may appear similar when judged by the actions they depict, yet entirely different if evaluated based on the locations where they were filmed. While humans naturally compare videos by taking different aspects into account, this ability has not been thoroughly studied and presents a challenge for models that often depend on broad global similarity scores. Large Multimodal Models (LMMs) with video understanding capabilities open new opportunities for leveraging natural language in comparative video tasks. We introduce Concept-based Video Similarity estimation (ConViS), a novel task that compares pairs of videos by computing interpretable similarity scores across a predefined set of key semantic concepts. ConViS allows for human-like reasoning about video similarity and enables new applications such as concept-conditioned video retrieval. To support this task, we also introduce ConViS-Bench, a new benchmark comprising carefully annotated video pairs spanning multiple domains. Each pair comes with concept-level similarity scores and textual descriptions of both differences and similarities. Additionally, we benchmark several state-of-the-art models on ConViS, providing insights into their alignment with human judgments. Our results reveal significant performance differences on ConViS, indicating that some concepts present greater challenges for estimating video similarity. We believe that ConViS-Bench will serve as a valuable resource for advancing research in language-driven video understanding.

View full details

Poster

SafeVid: Toward Safety Aligned Video Large Multimodal Models

Yixu Wang ⋅ Jiaxin Song ⋅ Yifeng Gao ⋅ Xin Wang ⋅ YANG YAO ⋅ Yan Teng ⋅ Xingjun Ma ⋅ Yingchun Wang ⋅ Yu-Gang Jiang

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

As Video Large Multimodal Models (VLMMs) rapidly advance, their inherent complexity introduces significant safety challenges, particularly the issue of mismatched generalization where static safety alignments fail to transfer to dynamic video contexts. We introduce SafeVid, a framework designed to instill video-specific safety principles in VLMMs. SafeVid uniquely transfers robust textual safety alignment capabilities to the video domain by employing detailed textual video descriptions as an interpretive bridge, facilitating LLM-based rule-driven safety reasoning. This is achieved through a closed-loop system comprising: 1) generation of SafeVid-350K, a novel 350,000-pair video-specific safety preference dataset; 2) targeted alignment of VLMMs using Direct Preference Optimization (DPO); and 3) comprehensive evaluation via our new SafeVidBench benchmark. Alignment with SafeVid-350K significantly enhances VLMM safety, with models like LLaVA-NeXT-Video demonstrating substantial improvements (e.g., up to 42.39%) on SafeVidBench. SafeVid provides critical resources and a structured approach, demonstrating that leveraging textual descriptions as a conduit for safety reasoning markedly improves the safety alignment of VLMMs in complex multimodal scenarios.

View full details

Poster

Satellites Reveal Mobility: A Commuting Origin-destination Flow Generator for Global Cities

Can Rong ⋅ Xin Zhang ⋅ Yanxin Xi ⋅ HONGJIE SUI ⋅ Jingtao Ding ⋅ Yong Li

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Commuting Origin-destination (OD) flows, capturing daily population mobility of citizens, are vital for sustainable development across cities around the world. However, it is challenging to obtain the data due to the high cost of travel surveys and privacy concerns. Surprisingly, we find that satellite imagery, publicly available across the globe, contains rich urban semantic signals to support high-quality OD flow generation, with over 98\% expressiveness of traditional multisource hard-to-collect urban sociodemographic, economics, land use, and point of interest data. This inspires us to design a novel data generator, GlODGen (Global-scale OriginDestination Flow Generator), which can generate OD flow data for any cities of interest around the world. Specifically, GlODGen first leverages Vision-Language Geo-Foundation Models to extract urban semantic signals related to human mobility from satellite imagery. These features are then combined with population data to form region-level representations, which are used to generate OD flows via graph diffusion models. Extensive experiments on 4 continents and 6 representative cities show that GlODGen has great generalizability across diverse urban environments on different continents and can generate OD flow data for global cities highly consistent with real-world mobility data. We implement GlODGen as an automated tool, seamlessly integrating data acquisition and curation, urban semantic feature extraction, and OD flow generation together. It has been released at https://github.com/tsinghua-fib-lab/generate-od-pubtools.

View full details

Poster

AGI-Elo: How Far Are We From Mastering A Task?

Shuo Sun ⋅ Yimin Zhao ⋅ Christina Lee ⋅ JIAWEI SUN ⋅ Chengran Yuan ⋅ Zefan Huang ⋅ Dongen Li ⋅ Justin Yeoh ⋅ Alok Prakash ⋅ Thomas Malone ⋅ Marcelo Ang Jr

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

As the field progresses toward Artificial General Intelligence (AGI), there is a pressing need for more comprehensive and insightful evaluation frameworks that go beyond aggregate performance metrics. This paper introduces a unified rating system that jointly models the difficulty of individual test cases and the competency of AI models (or humans) across vision, language, and action domains. Unlike existing metrics that focus solely on models, our approach allows for fine-grained, difficulty-aware evaluations through competitive interactions between models and tasks, capturing both the long-tail distribution of real-world challenges and the competency gap between current models and full task mastery. We validate the generalizability and robustness of our system through extensive experiments on multiple established datasets and models across distinct AGI domains. The resulting rating distributions offer novel perspectives and interpretable insights into task difficulty, model progression, and the outstanding challenges that remain on the path to achieving full AGI task mastery. We have made our code and results publicly available at https://ss47816.github.io/AGI-Elo/.

View full details

Poster

QUT-DV25: A Dataset for Dynamic Analysis of Next-Gen Software Supply Chain Attacks

Sk Tanzir Mehedi ⋅ Raja Jurdak ⋅ Chadni Islam ⋅ Gowri Sankar Ramachandran

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Securing software supply chains is a growing challenge due to the inadequacy of existing datasets in capturing the complexity of next-gen attacks, such as multiphase malware execution, remote access activation, and dynamic payload generation. Existing datasets, which rely on metadata inspection and static code analysis, are inadequate for detecting such attacks. This creates a critical gap because these datasets do not capture what happens during and after a package is installed. To address this gap, we present QUT-DV25, a dynamic analysis dataset specifically designed to support and advance research on detecting and mitigating supply chain attacks within the Python Package Index (PyPI) ecosystem. This dataset captures install and post-install-time traces from 14,271 Python packages, of which 7,127 are malicious. The packages are executed in an isolated sandbox environment using an extended Berkeley Packet Filter (eBPF) kernel and user-level probes. It captures 36 real-time features, that includes system calls, network traffic, resource usages, directory access patterns, dependency logs, and installation behaviors, enabling the study of next-gen attack vectors. ML analysis using the QUT-DV25 dataset identified four malicious PyPI packages previously labeled as benign, each with thousands of downloads. These packages deployed covert remote access and multi-phase payloads, were reported to PyPI maintainers, and subsequently removed. This highlights the practical value of QUT-DV25, as it outperforms reactive, metadata, and static datasets, offering a robust foundation for developing and benchmarking advanced threat detection within the evolving software supply chain ecosystem.

View full details

Poster

VideoUFO: A Million-Scale User-Focused Dataset for Text-to-Video Generation

Wenhao Wang ⋅ Yi Yang

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Text-to-video generative models convert textual prompts into dynamic visual content, offering wide-ranging applications in film production, gaming, and education. However, their real-world performance often falls short of user expectations. One key reason is that these models have not been trained on videos related to some topics users want to create. In this paper, we propose VideoUFO, the first Video dataset specifically curated to align with Users' FOcus in real-world scenarios. Beyond this, our VideoUFO also features: (1) minimal (0.29\%) overlap with existing video datasets, and (2) videos searched exclusively via YouTube's official API under the Creative Commons license. These two attributes provide future researchers with greater freedom to broaden their training sources. The VideoUFO comprises over 1.09 million video clips, each paired with both a brief and a detailed caption (description). Specifically, through clustering, we first identify 1,291 user-focused topics from the million-scale real text-to-video prompt dataset, VidProM. Then, we use these topics to retrieve videos from YouTube, split the retrieved videos into clips, and generate both brief and detailed captions for each clip. After verifying the clips with specified topics, we are left with about 1.09 million video clips. Our experiments reveal that (1) current 16 text-to-video models do not achieve consistent performance across all user-focused topics; and (2) a simple model trained on VideoUFO outperforms others on worst-performing topics. The dataset and code are publicly available at https://huggingface.co/datasets/WenhaoWang/VideoUFO and https://github.com/WangWenhao0716/BenchUFO under the CC BY 4.0 License.

View full details

Poster

Measuring what Matters: Construct Validity in Large Language Model Benchmarks

Andrew M. Bean ⋅ Ryan Othniel Kearns ⋅ Angelika Romanou ⋅ Franziska Sofia Hafner ⋅ Harry Mayne ⋅ Jan Batzner ⋅ Negar Foroutan Eghlidi ⋅ Chris Schmitz ⋅ Karolina Korgul ⋅ Hunar Batra ⋅ Oishi Deb ⋅ Emma Beharry ⋅ Cornelius Emde ⋅ Thomas Foster ⋅ Anna Gausen ⋅ María Grandury ⋅ Sophia Han ⋅ Valentin Hofmann ⋅ Lujain Ibrahim ⋅ Hazel Kim ⋅ Hannah Rose Kirk ⋅ Fangru Lin ⋅ Gabrielle Liu ⋅ Lennart Luettgau ⋅ Jabez Magomere ⋅ Jonathan Rystrøm ⋅ Anna Sotnikova ⋅ Yushi Yang ⋅ Yilun Zhao ⋅ Adel Bibi ⋅ Antoine Bosselut ⋅ Ronald Clark ⋅ Arman Cohan ⋅ Jakob Foerster ⋅ Yarin Gal ⋅ Scott Hale ⋅ Deborah Raji ⋅ Christopher Summerfield ⋅ Philip Torr ⋅ Cozmin Ududec ⋅ Luc Rocher ⋅ Adam Mahdi

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Evaluating large language models (LLMs) is crucial for both assessing their capabilities and identifying safety or robustness issues prior to deployment. Reliably measuring abstract and complex phenomena such as `safety' and `robustness' requires strong construct validity, that is, having measures that represent what matters to the phenomenon. With a team of 29 expert reviewers, we conduct a systematic review of 445 LLM benchmarks from leading conferences in natural language processing and machine learning. Across the reviewed articles, we find patterns related to the measured phenomena, tasks, and scoring metrics which undermine the validity of the resulting claims. To address these shortcomings, we provide eight key recommendations and detailed actionable guidance to researchers and practitioners in developing LLM benchmarks.

View full details

Poster

ReinAD: Towards Real-world Industrial Anomaly Detection with a Comprehensive Contrastive Dataset

Xu Wang ⋅ Jingyuan Zhuo ⋅ Zhiyuan You ⋅ Zhiyu Tan ⋅ Yikuan Yu ⋅ Siyu Wang ⋅ Xinyi Le

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Recent years have witnessed significant advancements in industrial anomaly detection (IAD) thanks to existing anomaly detection datasets. However, the large performance gap between these benchmarks and real industrial practice reveals critical limitations in existing datasets. We argue that the mismatch between current datasets and real industrial scenarios becomes the primary barrier to practical IAD deployment. To this end, we propose ReinAD dataset, a comprehensive contrastive dataset towards Real-world industrial Anomaly Detection. Our dataset prioritizes three critical real-world requirements: 1) Contrast-based anomaly definition that is essential for industrial practice, 2) Fine-grained unaligned image pairs reflecting real inspections, and 3) Large-scale data from active production lines spanning multiple industrial categories. Based on our dataset, we introduce the ReinADNet. It takes both normal reference and test images as inputs, achieving anomaly detection through normal-anomaly comparison. To address the fine-grained and unaligned properties of real industrial scenes, our method integrates pyramidal similarity aggregation for comprehensive anomaly characterization and global-local feature fusion for spatial misalignment tolerance. Our method outperforms all baselines on the ReinAD dataset (e.g., 64.5% v.s. 59.5% in 1-shot image-level AP) under all settings. Extensive experiments across several datasets demonstrate our dataset's challenging nature and our method's superior generalization. This work provides a solid foundation for practical industrial anomaly detection. Dataset and code are available at https://tocmac.github.io/ReinAD.

View full details

Poster

FAIR Universe HiggsML Uncertainty Dataset and Competition

Wahid Bhimji ⋅ Ragansu Chakkappai ⋅ Po-Wen Chang ⋅ Yuan-Tang Chou ⋅ Sascha Diefenbacher ⋅ Jordan Dudley ⋅ Ibrahim Elsharkawy ⋅ Steven Farrell ⋅ Aishik Ghosh ⋅ Cristina Giordano ⋅ Isabelle Guyon ⋅ Chris Harris ⋅ Yota Hashizume ⋅ Shih-Chieh Hsu ⋅ Elham E Khoda ⋅ Claudius Krause ⋅ Ang Li ⋅ Benjamin Nachman ⋅ David Rousseau ⋅ Robert Schöfbeck ⋅ Maryam Shooshtari ⋅ Dennis Schwarz ⋅ Ihsan Ullah ⋅ Daohan Wang ⋅ Yulei Zhang

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The FAIR Universe – HiggsML Uncertainty Challenge focused on measuring the physical properties of elementary particles with imperfect simulators. Participants were required to compute and report confidence intervals for a parameter of interest regarding the Higgs boson while accounting for various systematic (epistemic) uncertainties. The dataset is a tabular dataset of 28 features and 280 million instances. Each instance represents a simulated proton-proton collision as observed at CERN’s Large Hadron Collider in Geneva, Switzerland. The features of these simulations were chosen to capture key characteristics of different types of particles. These include primary attributes, such as the energy and three-dimensional momentum of the particles, as well as derived attributes, which are calculated from the primary ones using domain-specific knowledge. Additionally, a label feature designates each instance’s type of proton-proton collision, distinguishing the Higgs boson events of interest from three background sources. As outlined in this paper, the permanent dataset release allows long-term benchmarking of new techniques. The leading submissions, including Contrastive Normalising Flows and Density Ratios estimation through classification, are described. Our challenge has brought together the physics and machine learning communities to advance our understanding and methodologies in handling systematic uncertainties within AI techniques.

View full details

Poster

SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents

Ibragim Badertdinov ⋅ Alexander Golubev ⋅ Maksim Nekrashevich ⋅ Anton Shevtsov ⋅ Simon Karasik ⋅ Andrei Andriushchenko ⋅ Maria Trofimova ⋅ Daria Litvintseva ⋅ Boris Yangel

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

LLM-based agents have shown promising capabilities in a growing range of software engineering (SWE) tasks. However, advancing this field faces two critical challenges. First, high-quality training data is scarce, especially data that reflects real-world SWE scenarios, where agents must interact with development environments, execute code and adapt behavior based on the outcomes of their actions. Existing datasets are either limited to one-shot code generation or comprise small, manually curated collections of interactive tasks, lacking both scale and diversity. Second, the lack of fresh interactive SWE tasks affects evaluation of rapidly improving models, as static benchmarks quickly become outdated due to contamination issues. To address these limitations, we introduce a novel, automated, and scalable pipeline to continuously extract real-world interactive SWE tasks from diverse GitHub repositories. Using this pipeline, we construct SWE-rebench, a public dataset comprising over 21,000 interactive Python-based SWE tasks, suitable for reinforcement learning of SWE agents at scale. Additionally, we use continuous supply of fresh tasks collected using SWE-rebench methodology to build a contamination-free benchmark for agentic software engineering. We compare results of various LLMs on this benchmark to results on SWE-bench Verified and show that performance of some language models might be inflated due to contamination issues.

View full details

Poster

Video-SafetyBench: A Benchmark for Safety Evaluation of Video LVLMs

Xuannan Liu ⋅ Zekun Li ⋅ Zheqi He ⋅ Peipei Li ⋅ shuhan xia ⋅ Xing Cui ⋅ Huaibo Huang ⋅ Xi Yang ⋅ Ran He

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The increasing deployment of Large Vision-Language Models (LVLMs) raises safety concerns under potential malicious inputs. However, existing multimodal safety evaluations primarily focus on model vulnerabilities exposed by static image inputs, ignoring the temporal dynamics of video that may induce distinct safety risks. To bridge this gap, we introduce Video-SafetyBench, the first comprehensive benchmark designed to evaluate the safety of LVLMs under video-text attacks. It comprises 2,264 video-text pairs spanning 48 fine-grained unsafe categories, each pairing a synthesized video with either a harmful query, which contains explicit malice, or a benign query, which appears harmless but triggers harmful behavior when interpreted alongside the video. To generate semantically accurate videos for safety evaluation, we design a controllable pipeline that decomposes video semantics into subject images (what is shown) and motion text (how it moves), which jointly guide the synthesis of query-relevant videos. To effectively evaluate uncertain or borderline harmful outputs, we propose RJScore, a novel LLM-based metric that incorporates the confidence of judge models and human-aligned decision threshold calibration. Extensive experiments show that benign-query video composition achieves average attack success rates of 67.2%, revealing consistent vulnerabilities to video-induced attacks. We believe Video-SafetyBench will catalyze future research into video-based safety evaluation and defense strategies.

View full details

Poster

MMOT: The First Challenging Benchmark for Drone-based Multispectral Multi-Object Tracking

Tianhao Li ⋅ Tingfa Xu ⋅ Ying Wang ⋅ Haolin Qin ⋅ Xu Lin ⋅ Jianan Li

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Drone-based multi-object tracking is essential yet highly challenging due to small targets, severe occlusions, and cluttered backgrounds. Existing RGB-based multi-object tracking algorithms heavily depend on spatial appearance cues such as color and texture, which often degrade in aerial views, compromising tracking reliability. Multispectral imagery, capturing pixel-level spectral reflectance, provides crucial spectral cues that significantly enhance object discriminability under degraded spatial conditions. However, the lack of dedicated multispectral UAV datasets has hindered progress in this domain. To bridge this gap, we introduce **MMOT**, the first challenging benchmark for drone-based multispectral multi-object tracking dataset. It features three key characteristics: (i) **Large Scale** — 125 video sequences with over 488.8K annotations across eight object categories; (ii) **Comprehensive Challenges** — covering diverse real-world challenges such as extreme small targets, high-density scenarios, severe occlusions and complex platform motion; and (iii) **Precise Oriented Annotations** — enabling accurate localization and reduced object ambiguity under aerial perspectives. To better extract spectral features and leverage oriented annotations, we further present a multispectral and orientation-aware MOT scheme adapting existing MOT methods, featuring: (i) a lightweight Spectral 3D-Stem integrating spectral features while preserving compatibility with RGB pretraining; (ii) a orientation-aware Kalman filter for precise state estimation; and (iii) an end-to-end orientation-adaptive transformer architecture. Extensive experiments across representative trackers consistently show that multispectral input markedly improves tracking performance over RGB baselines, particularly for small and densely packed objects. We believe our work will benefit the community for advancing drone-based multispectral multi-object tracking research. Our MMOT, code and benchmarks are publicly available at https://github.com/Annzstbl/MMOT.

View full details

Poster

RAM-W600: A Multi-Task Wrist Dataset and Benchmark for Rheumatoid Arthritis

YANG SONGXIAO ⋅ Haolin Wang ⋅ Yao Fu ⋅ Ye Tian ⋅ Tamostu Kamishima ⋅ Masayuki Ikebe ⋅ Yafei Ou ⋅ Masatoshi Okutomi

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Rheumatoid arthritis (RA) is a common autoimmune disease that has been the focus of research in computer-aided diagnosis (CAD) and disease monitoring. In clinical settings, conventional radiography (CR) is widely used for the screening and evaluation of RA due to its low cost and accessibility. The wrist is a critical region for the diagnosis of RA. However, CAD research in this area remains limited, primarily due to the challenges in acquiring high-quality instance-level annotations. (i) The wrist comprises numerous small bones with narrow joint spaces, complex structures, and frequent overlaps, requiring detailed anatomical knowledge for accurate annotation. (ii) Disease progression in RA often leads to osteophyte, bone erosion (BE), and even bony ankylosis, which alter bone morphology and increase annotation difficulty, necessitating expertise in rheumatology.This work presents a multi-task dataset for wrist bone in CR, including two tasks: (i) wrist bone instance segmentation and (ii) Sharp/van der Heijde (SvdH) BE scoring, which is the first public resource for wrist bone instance segmentation. This dataset comprises 1048 wrist conventional radiographs of 388 patients from six medical centers, with pixel-level instance segmentation annotations for 618 images and SvdH BE scores for 800 images. This dataset can potentially support a wide range of research tasks related to RA, including joint space narrowing (JSN) progression quantification, BE detection, bone deformity evaluation, and osteophyte detection. It may also be applied to other wrist-related tasks, such as carpal bone fracture localization.We hope this dataset will significantly lower the barrier to research on wrist RA and accelerate progress in CAD research within the RA-related domain.Benchmark \& Code: https://github.com/YSongxiao/RAM-W600Data \& Dataset Card: https://huggingface.co/datasets/TokyoTechMagicYang/RAM-W600

View full details

Poster

IndustryEQA: Pushing the Frontiers of Embodied Question Answering in Industrial Scenarios

Yifan Li ⋅ Yuhang Chen ⋅ Anh Dao ⋅ Lichi Li ⋅ Zhongyi Cai ⋅ Zhen Tan ⋅ Tianlong Chen ⋅ Yu Kong

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Existing Embodied Question Answering (EQA) benchmarks primarily focus on household environments, often overlooking safety-critical aspects and reasoning processes pertinent to industrial settings. This drawback limits the evaluation of agent readiness for real-world industrial applications. To bridge this, we introduce IndustryEQA, the first benchmark dedicated to evaluating embodied agent capabilities within safety-critical industrial warehouse scenarios. Built upon the NVIDIA Isaac Sim platform, IndustryEQA provides high-fidelity episodic memory videos featuring diverse industrial assets, dynamic human agents, and carefully designed hazardous situations inspired by real-world safety guidelines. The benchmark includes rich annotations covering six categories: equipment safety, human safety, object recognition, attribute recognition, temporal understanding, and spatial understanding. Besides, it also provides extra reasoning evaluation based on these categories. Specifically, it comprises 971 question-answer pairs generated from small warehouse scenarios and 373 pairs from large ones, incorporating scenarios with and without human. We further propose a comprehensive evaluation framework, including various baseline models, to assess their general perception and reasoning abilities in industrial environments. IndustryEQA aims to steer EQA research towards developing more robust, safety-aware, and practically applicable embodied agents for complex industrial environments.

View full details

Poster

DAVE: Diagnostic benchmark for Audio Visual Evaluation

Gorjan Radevski ⋅ Teodora Popordanoska ⋅ Matthew Blaschko ⋅ Tinne Tuytelaars

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Audio-visual understanding is a rapidly evolving field that seeks to integrate and interpret information from both auditory and visual modalities. Despite recent advances in multi-modal learning, existing benchmarks often suffer from strong visual bias -- when answers can be inferred from visual data alone -- and provide only aggregate scores that conflate multiple sources of error. This makes it difficult to determine whether models struggle with visual understanding, audio interpretation, or audio-visual alignment. In this work, we introduce DAVE: Diagnostic Audio Visual Evaluation, a novel benchmark dataset designed to systematically evaluate audio-visual models across controlled settings. DAVE alleviates existing limitations by (i) ensuring both modalities are necessary to answer correctly and (ii) decoupling evaluation into atomic subcategories. Our detailed analysis of state-of-the-art models reveals specific failure modes and provides targeted insights for improvement. By offering this standardized diagnostic framework, we aim to facilitate more robust development of audio-visual models.Dataset: https://huggingface.co/datasets/gorjanradevski/daveCode: https://github.com/gorjanradevski/dave

View full details

Poster

EyeBench: Predictive Modeling from Eye Movements in Reading

Omer Shubi ⋅ David Robert Reich ⋅ Keren Gruteke Klein ⋅ Yuval Angel ⋅ Paul Prasse ⋅ Lena A. Jäger ⋅ Yevgeni Berzak

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We present EyeBench, the first benchmark designed to evaluate machine learning models that decode cognitive and linguistic information from eye movements during reading. EyeBench offers an accessible entry point to the challenging and underexplored domain of modeling eye tracking data paired with text, aiming to foster innovation at the intersection of multimodal AI and cognitive science. The benchmark provides a standardized evaluation framework for predictive models, covering a diverse set of datasets and tasks, ranging from assessment of reading comprehension to detection of developmental dyslexia. Progress on the EyeBench challenge will pave the way for both practical real-world applications, such as adaptive user interfaces and personalized education, and scientific advances in understanding human language processing. The benchmark is released as an open-source software package which includes data downloading and harmonization scripts, baselines and state-of-the-art models, as well as evaluation code, publicly available at https://github.com/EyeBench/eyebench.

View full details

Poster

ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness

Yijun Liang ⋅ Ming Li ⋅ Chenrui Fan ⋅ Ziyue Li ⋅ Dang Nguyen ⋅ Kwesi Cobbina ⋅ Shweta Bhardwaj ⋅ Jiuhai Chen ⋅ Fuxiao Liu ⋅ Tianyi Zhou

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Color plays an important role in human perception and usually provides critical clues in visual reasoning. However, it is unclear whether and how vision-language models (VLMs) can perceive, understand, and leverage color as humans.This paper introduces ColorBench, an innovative benchmark meticulously crafted to assess the capabilities of VLMs in color understanding, including color perception, reasoning, and robustness. By curating a suite of diverse test scenarios, with grounding in real applications, ColorBench evaluates how these models perceive colors, infer meanings from color-based cues, and maintain consistent performance under varying color transformations. Through an extensive evaluation of 32 VLMs with varying language models and vision encoders, our paper reveals some undiscovered findings: (i) The scaling law (larger models are better) still holds on ColorBench, while the language model plays a more important role than the vision encoder. (ii) However, the performance gaps across models are relatively small, indicating that color understanding has been largely neglected by existing VLMs. (iii) CoT reasoning improves color understanding accuracies and robustness, though they are vision-centric tasks. (iv) Color clues are indeed leveraged by VLMs on ColorBench but they can also mislead models in some tasks.These findings highlight the critical limitations of current VLMs and underscore the need to enhance color comprehension. Our ColorBench can serve as a foundational tool for advancing the study of human-level color understanding of multimodal AI.

View full details

Poster

CoreaSpeech: Korean Speech Corpus via JAMO-based Coreset Selection for Efficient and Robust Korean Speech Generation

Ki-Joong Kwon ⋅ Junho So ⋅ Sang-Hoon Lee

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

While substantial advances have been achieved in TTS for languages such as English and Mandarin, Korean remains comparatively underrepresented due to the lack of rigorous preprocessing methods, systematically constructed datasets, a shortage of standardized Korean TTS benchmarks, and explicitly optimized models for Korean. To address these limitations, we propose a Korean-tailored data-refinement and coreset selection pipeline. It refines speech data and performs textual normalization especially for numerals and English terms, followed by a novel coreset selection strategy that leverages Jamo-based linguistic and phonological features unique to Korean. As a result, we release CoreaSpeech, an efficient and robust Korean speech corpus comprising 700 hours across 21,449 speakers. This refined core subset, evenly balanced across utterances ranging from 0 to 30 seconds, is derived from 2,058 hours of widely used Korean datasets. Building on this, we conducted extensive experiments via cross-lingual fine-tuning with our CoreaSpeech dataset. Furthermore, we introduce a new universal Korean TTS benchmark dataset including clean, noisy, and numeric subsets. Additionally, we demonstrate that our Korean-specific text normalization serves as a plug-and-play module, reliably improving performance regardless of the underlying TTS architecture. We publicly release our dataset, pipeline code, and evaluation benchmarks to support reproducible research and further advances in Korean and multilingual speech synthesis.

View full details

Poster

FineGRAIN: Evaluating Failure Modes of Text-to-Image Models with Vision Language Model Judges

Kevin Hayes ⋅ Micah Goldblum ⋅ Vikash Sehwag ⋅ Gowthami Somepalli ⋅ Ashwinee Panda ⋅ Tom Goldstein

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Text-to-image (T2I) models are capable of generating visually impressive images, yet they often fail to accurately capture specific attributes in user prompts, such as the correct number of objects with the specified colors. The diversity of such errors underscores the need for a hierarchical evaluation framework that can compare prompt adherence abilities of different image generation models. Simultaneously, benchmarks of vision language models (VLMs) have not kept pace with the complexity of scenes that VLMs are used to annotate. In this work, we propose a structured methodology for jointly evaluating T2I models and VLMs by testing whether VLMs can identify 27 specific failure modes in the images generated by T2I models conditioned on challenging prompts. Our second contribution is a dataset of prompts and images generated by 5 T2I models (Flux, SD3-Medium, SD3-Large, SD3.5-Medium, SD3.5-Large) and the corresponding annotations from VLMs (Molmo, InternVL3, Pixtral) annotated by an LLM (Llama3) to test whether VLMs correctly identify the failure mode in a generated image. By analyzing failure modes on a curated set of prompts, we reveal systematic errors in attribute fidelity and object representation. Our findings suggest that current metrics are insufficient to capture these nuanced errors, highlighting the importance of targeted benchmarks for advancing generative model reliability and interpretability.

View full details

Poster

EOC-Bench: Can MLLMs Identify, Recall, and Forecast Objects in an Egocentric World?

Yuqian Yuan ⋅ Ronghao Dang ⋅ long li ⋅ Wentong Li ⋅ Dian Jiao ⋅ Xin Li ⋅ Deli Zhao ⋅ Fan Wang ⋅ Wenqiao Zhang ⋅ Jun Xiao ⋅ Yueting Zhuang

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The emergence of multimodal large language models (MLLMs) has driven breakthroughs in egocentric vision applications. These applications necessitate persistent, context-aware understanding of objects, as users interact with tools in dynamic and cluttered environments. However, existing embodied benchmarks primarily focus on static scene exploration, emphasizing object's appearance and spatial attributes while neglecting the assessment of dynamic changes arising from users' interactions.capabilities in object-level spatiotemporal reasoning required for real-world interactions.To address this gap, we introduce EOC-Bench, an innovative benchmark designed to systematically evaluate object-centric embodied cognition in dynamic egocentric scenarios.Specially, EOC-Bench features 3,277 meticulously annotated QA pairs categorized into three temporal categories: Past, Present, and Future, covering 11 fine-grained evaluation dimensions and 3 visual object referencing types.To ensure thorough assessment, we develop a mixed-format human-in-the-loop annotation frameworkBased on EOC-Bench, we conduct comprehensive evaluations of various proprietary, open-source, and object-level MLLMs. EOC-Bench serves as a crucial tool for advancing the embodied object cognitive capabilities of MLLMs, establishing a robust foundation for developing reliable core models for embodied systems.

View full details

Poster

Toward a Vision-Language Foundation Model for Medical Data: Multimodal Dataset and Benchmarks for Vietnamese PET/CT Report Generation

Tien Nguyen ⋅ Dac Nguyen ⋅ Duc Nguyen The Minh ⋅ Trung Thanh Nguyen ⋅ Truong Thao Nguyen ⋅ Hieu Pham ⋅ Johan Barthelemy ⋅ Tran Minh Quan ⋅ Quoc Viet Hung Nguyen ⋅ Thanh Tam Nguyen ⋅ Mai Son ⋅ Chau Anh ⋅ Thanh Nguyen ⋅ Phi Le Nguyen

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Vision-Language Foundation Models (VLMs), trained on large-scale multimodal datasets, have driven significant advances in Artificial Intelligence (AI) by enabling rich cross-modal reasoning. Despite their success in general domains, applying these models to medical imaging remains challenging due to the limited availability of diverse imaging modalities and multilingual clinical data. Most existing medical VLMs are trained on a subset of imaging modalities and focus primarily on high-resource languages, thus limiting their generalizability and clinical utility. To address these limitations, we introduce a novel Vietnamese-language multimodal medical dataset consisting of 2,757 whole-body PET/CT volumes from independent patients and their corresponding full-length clinical reports. This dataset is designed to fill two pressing gaps in medical AI development: (1) the lack of PET/CT imaging data in existing VLMs training corpora, which hinders the development of models capable of handling functional imaging tasks; and (2) the underrepresentation of low-resource languages, particularly the Vietnamese language, in medical vision-language research. To the best of our knowledge, this is the first dataset to provide comprehensive PET/CT-report pairs in Vietnamese. We further introduce a training framework to enhance VLMs' learning, including data augmentation and expert-validated test sets. We conduct comprehensive experiments benchmarking state-of-the-art VLMs on downstream tasks, including medical report generation and visual question answering. The experimental results show that incorporating our dataset significantly improves the performance of existing VLMs. However, despite these advancements, the models still underperform on clinically critical criteria, particularly the diagnosis of lung cancer, indicating substantial room for future improvement. We believe this dataset and benchmark will serve as a pivotal step in advancing the development of more robust VLMs for medical imaging, particularly in low-resource languages, and improving their clinical relevance in Vietnamese healthcare.

View full details

Poster

RAG-IGBench: Innovative Evaluation for RAG-based Interleaved Generation in Open-domain Question Answering

Rongyang Zhang ⋅ Yuqing Huang ⋅ Chengqiang Lu ⋅ Qimeng Wang ⋅ Yan Gao ⋅ YIWU ⋅ Yao Hu ⋅ Yin Xu ⋅ Wei Wang ⋅ Hao Wang ⋅ Enhong Chen

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

In real-world scenarios, providing user queries with visually enhanced responses can considerably benefit understanding and memory, underscoring the great value of interleaved image-text generation. Despite recent progress, like the visual autoregressive model that unifies text and image processing in a single transformer architecture, generating high-quality interleaved content remains challenging. Moreover, evaluations of these interleaved sequences largely remain underexplored, with existing benchmarks often limited by unimodal metrics that inadequately assess the intricacies of combined image-text outputs. To address these issues, we present RAG-IGBench, a thorough benchmark designed specifically to evaluate the task of Interleaved Generation based on Retrieval-Augmented Generation (RAG-IG) in open-domain question answering. RAG-IG integrates multimodal large language models (MLLMs) with retrieval mechanisms, enabling the models to access external image-text information for generating coherent multimodal content. Distinct from previous datasets, RAG-IGBench draws on the latest publicly available content from social platforms and introduces innovative evaluation metrics that measure the quality of text and images, as well as their consistency. Through extensive experiments with state-of-the-art MLLMs (both open-source and proprietary) on RAG-IGBench, we provide an in-depth analysis examining the capabilities and limitations of these models. Additionally, we validate our evaluation metrics by demonstrating their high correlation with human assessments. Models fine-tuned on RAG-IGBench's training set exhibit improved performance across multiple benchmarks, confirming both the quality and practical utility of our dataset. Our benchmark is available at https://github.com/zry13/RAG-IGBench.

View full details

Poster

Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models

Haohan Chi ⋅ Huan-ang Gao ⋅ Ziming Liu ⋅ Jianing Liu ⋅ Chenyu Liu ⋅ Jinwei Li ⋅ Kaisen Yang ⋅ Yangcheng Yu ⋅ Zeda Wang ⋅ Wenyi Li ⋅ Leichen Wang ⋅ Xingtao HU ⋅ HAO SUN ⋅ Hang Zhao ⋅ Hao Zhao

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Vision-Language-Action (VLA) models for autonomous driving show promise but falter in unstructured corner case scenarios, largely due to a scarcity of targeted benchmarks. To address this, we introduce Impromptu VLA. Our core contribution is the Impromptu VLA Dataset: over 80,000 meticulously curated video clips, distilled from over 2M source clips sourced from 8 open-source large-scale datasets. This dataset is built upon our novel taxonomy of four challenging unstructured categories and features rich, planning-oriented question-answering annotations and action trajectories. Crucially, experiments demonstrate that VLAs trained with our dataset achieve substantial performance gains on established benchmarks—improving closed-loop NeuroNCAP scores and collision rates, and reaching near state-of-the-art L2 accuracy in open-loop nuScenes trajectory prediction. Furthermore, our Q&A suite serves as an effective diagnostic, revealing clear VLM improvements in perception, prediction, and planning. Our code, data and models are available at https://github.com/ahydchh/Impromptu-VLA

View full details

Poster

A Dataset for Distilling Knowledge Priors from Literature for Therapeutic Design

Haydn Jones ⋅ Natalie Maus ⋅ Josh magnus Ludan ⋅ Maggie Huan ⋅ Jiaming Liang ⋅ Marcelo Der Torossian Torres ⋅ Jiatao Liang ⋅ Zachary Ives ⋅ Yoseph Barash ⋅ Cesar de la Fuente-Nunez ⋅ Jacob Gardner ⋅ Mark Yatskar

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

AI-driven discovery can greatly reduce design time and enhance new therapeutics' effectiveness. Models using simulators explore broad design spaces but risk violating implicit constraints due to a lack of experimental priors. For example, in a new analysis across diverse models on the GuacaMol benchmark using supervised classifiers, over 60\% of molecules proposed had a high probability of being mutagenic. In this work, we introduce Medex, a dataset of priors for design problems extracted from literature describing compounds used in lab settings. It is constructed with LLM pipelines for discovering therapeutic entities in relevant paragraphs and summarizing information in concise fair-use facts. Medex consists of 32.3 million pairs of natural language facts, and appropriate entity representations (i.e. SMILES or RefSeq IDs). To demonstrate the potential of the data, we train LLM, CLIP, and LLaVA architectures to reason jointly about text and design targets and evaluate on tasks from the Therapeutic Data Commons (TDC). Medex is highly effective for creating models with strong priors: in supervised prediction problems that use our data for pretraining, our best models with 15M learnable parameters outperform larger 2B TxGemma on both regression and classification TDC tasks, and perform comparably to 9B models on average. Models built with Medex can be used as constraints while optimizing for novel molecules in GuacaMol, resulting in proposals that are safer and nearly as effective. We release our dataset on HuggingFace at https://huggingface.co/datasets/DocAndDesign/Medex, and will provide expanded versions as the available literature grows.

View full details

Poster

CarbonGlobe: A Global-Scale, Multi-Decade Dataset and Benchmark for Carbon Forecasting in Forest Ecosystems

Zhihao Wang ⋅ Lei Ma ⋅ George Hurtt ⋅ Xiaowei Jia ⋅ Yanhua Li ⋅ Ruohan Li ⋅ Zhili Li ⋅ Shuo Xu ⋅ Yiqun Xie

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Forest ecosystems play a critical role in the Earth system as major carbon sinks that are essential for carbon neutralization and climate change mitigation. However, the Earth has undergone significant deforestation and forest degradation, and the remaining forested areas are also facing increasing pressures from socioeconomic factors and climate change, potentially pushing them towards tipping points.Responding to the grand challenge, a theory-based Ecosystem Demography (ED) model has been continuously developed over the past two decades and serves as a key component in major initiatives, including the Global Carbon Budget, NASA Carbon Monitoring System, and US Greenhouse Gas Center. Despite its growing importance in combating climate change and shaping carbon policies, ED's expensive computation significantly limits its ability to estimate carbon dynamics at the global scale with high spatial resolution.Recently, machine learning (ML) models have shown promising potential in approximating theory-based models with interesting success in various domains including weather forecasting, thanks to the open-source benchmark datasets made available.However, there are currently no publicly available ML-ready datasets for global carbon dynamics forecasting in forest ecosystems. The limited data availability hinders the development of corresponding ML emulators. Furthermore, the inputs needed for running ED are highly complex with over a hundred variables from various remote sensing products. To bridge the gap, we develop a new ML-ready benchmark dataset, \textit{CarbonGlobe}, for carbon dynamics forecasting, featuring that: (1) the data has a global-scale coverage at 0.5$^\circ$ resolution; (2) the temporal range spans 40 years; (3) the inputs integrate extensive multi-source data from different sensing products, with calibrated outputs from ED; (4) the data is formatted in ML-ready forms and split into different evaluation scenarios based on climate conditions, etc.; (5) a set of problem-driven metrics is designed to develop benchmarks using various ML models to best align with the needs of downstream applications. Our dataset and code are publicly available on Kaggle and GitHub: https://www.kaggle.com/datasets/zhihaow/carbonglobe and https://github.com/zhwang0/carbon-globe.

View full details

Poster

UMU-Bench: Closing the Modality Gap in Multimodal Unlearning Evaluation

Chengye Wang ⋅ Yuyuan Li ⋅ XiaoHua Feng ⋅ Chaochao Chen ⋅ Xiaolin Zheng ⋅ Jianwei Yin

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Although Multimodal Large Language Models (MLLMs) have advanced numerous fields, their training on extensive multimodal datasets introduces significant privacy concerns, prompting the necessity for efficient unlearning methods.However, current multimodal unlearning approaches often directly adapt techniques from unimodal contexts, largely overlooking the critical issue of modality alignment, i.e., consistently removing knowledge across both unimodal and multimodal settings. To close this gap, we introduce UMU-bench, a unified benchmark specifically targeting modality misalignment in multimodal unlearning. UMU-bench consists of a meticulously curated dataset featuring 653 individual profiles, each described with both unimodal and multimodal knowledge.Additionally, novel tasks and evaluation metrics focusing on modality alignment are introduced, facilitating a comprehensive analysis of unimodal and multimodal unlearning effectiveness. Through extensive experimentation with state-of-the-art unlearning algorithms on UMU-bench, we demonstrate prevalent modality misalignment issues in existing methods. These findings underscore the critical need for novel multimodal unlearning approaches explicitly considering modality alignment.

View full details

Poster

TraffiDent: A Dataset for Understanding the Interplay Between Traffic Dynamics and Incidents

Xiaochuan Gou ⋅ Ziyue Li ⋅ Tian Lan ⋅ Junpeng Lin ⋅ zhishuai Li ⋅ Bingyu Zhao ⋅ Chen Zhang ⋅ Di Wang ⋅ Xiangliang Zhang

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Long-separated research has been conducted on two highly correlated tracks: traffic and incidents. Traffic track witnesses complicating deep learning models, e.g., to push the prediction a few percent more accurate, and the incident track only studies the incidents alone, e.g., to infer the incident risk. We, for the first time, spatiotemporally aligned the two tracks in a large-scale region (16,972 traffic nodes) from year 2022 to 2024: our TraffiDent dataset includes traffic, i.e., time-series indexes on traffic flow, lane occupancy, and average vehicle speed, and incident, whose records are spatiotemporally aligned with traffic data, with seven different incident classes. Additionally, each node includes detailed physical and policy-level meta-attributes of lanes. Previous datasets typically contain only traffic or incident data in isolation, limiting research to general forecasting tasks. TraffiDent integrates both, enabling detailed analysis of traffic-incident interactions and causal relationships. To demonstrate its broad applicability, we design: (1) post-incident traffic forecasting to quantify the impact of different incidents on traffic indexes; (2) incident classification using traffic indexes to determine the incidents types for precautions measures; (3) global causal analysis among the traffic indexes, meta-attributes, and incidents to give high-level guidance of the interrelations of various factors; (4) local causal analysis within road nodes to examine how different incidents affect the road segments' relations. The dataset is available at https://xaitraffic.github.io.

View full details

Poster

CrypticBio: A Large Multimodal Dataset for Visually Confusing Species

Georgiana Manolache ⋅ Gerard Schouten ⋅ Joaquin Vanschoren

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We present CrypticBio, the largest publicly available multimodal dataset of visually confusing species, specifically curated to support the development of AI models in the context of biodiversity applications. Visually confusing or cryptic species are groups of two or more taxa that are nearly indistinguishable based on visual characteristics alone. While much existing work addresses taxonomic identification in a broad sense, datasets that directly address the morphological confusion of cryptic species are small, manually curated, and target only a single taxon. Thus, the challenge of identifying such subtle differences in a wide range of taxa remains unaddressed. Curated from real-world trends in species misidentification among community annotators of iNaturalist, CrypticBio contains 52K unique cryptic groups spanning 67K species represented in 166 million images. Records in the dataset include research-grade image annotations—scientific, multicultural, and multilingual species terminology, hierarchical taxonomy, spatiotemporal context, and associated cryptic groups. To facilitate easy subset curation from CrypticBio, we provide an open-source pipeline, CrypticBio-Curate. The multimodal design of the dataset provides complementary cues such as spatiotemporal context that support the identification of cryptic species. To highlight the importance of the dataset, we benchmark a suite of state-of-the-art foundation models across CrypticBio subsets of common, unseen, endangered, and invasive species, and demonstrate the substantial impact of spatiotemporal context on vision-language zero-shot learning for cryptic species. By introducing CrypticBio, we aim to catalyze progress toward real-world-ready fine-grained species classification models for biodiversity monitoring capable of handling the nuanced challenges of species ambiguity. The data and the code are publicly available in the project website https://georgianagmanolache.github.io/crypticbio.

View full details

Poster

FlowerTune: A Cross-Domain Benchmark for Federated Fine-Tuning of Large Language Models

Yan Gao ⋅ Massimo R. Scamarcia ⋅ Javier Fernandez-Marques ⋅ Mohammad Naseri ⋅ Chong Ng ⋅ Dimitris Stripelis ⋅ Zexi Li ⋅ Tao Shen ⋅ Jiamu Bai ⋅ Daoyuan Chen ⋅ Zikai Zhang ⋅ Rui Hu ⋅ InSeo Song ⋅ KangYoon Lee ⋅ Hong Jia ⋅ Ting Dang ⋅ Junyan Wang ⋅ Zheyuan Liu ⋅ Daniel J. Beutel ⋅ Lingjuan Lyu ⋅ Nicholas Lane

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Large Language Models (LLMs) have achieved state-of-the-art results across diverse domains, yet their development remains reliant on vast amounts of publicly available data, raising concerns about data scarcity and the lack of access to domain-specific, sensitive information. Federated Learning (FL) presents a compelling framework to address these challenges by enabling decentralized fine-tuning on pre-trained LLMs without sharing raw data. However, the compatibility and performance of pre-trained LLMs in FL settings remain largely under explored. We introduce the FlowerTune LLM Leaderboard, a first-of-its-kind benchmarking suite designed to evaluate federated fine-tuning of LLMs across four diverse domains: general NLP, finance, medical, and coding. Each domain includes federated instruction-tuning datasets and domain-specific evaluation metrics. Our results, obtained through a collaborative, open-source and community-driven approach, provide the first comprehensive comparison across 26 pre-trained LLMs with different aggregation and fine-tuning strategies under federated settings, offering actionable insights into model performance, resource constraints, and domain adaptation. This work lays the foundation for developing privacy-preserving, domain-specialized LLMs for real-world applications.

View full details

Poster

MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios

Yang Shi ⋅ Huanqian Wang ⋅ Xie ⋅ Huanyao Zhang ⋅ Lijie Zhao ⋅ yifan zhang ⋅ Xinfeng Li ⋅ Chaoyou Fu ⋅ Zhuoer Wen ⋅ Wenting Liu ⋅ Zhuoran Zhang ⋅ Xinlong Chen ⋅ Bohan Zeng ⋅ Sihan Yang ⋅ Yushuo Guan ⋅ Zhang Zhang ⋅ Liang Wang ⋅ Haoxuan Li ⋅ Zhouchen Lin ⋅ Yuanxing Zhang ⋅ Pengfei Wan ⋅ Haotian Wang ⋅ Wenjing Yang

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Multimodal Large Language Models (MLLMs) have achieved considerable accuracy in Optical Character Recognition (OCR) from static images. However, their efficacy in video OCR is significantly diminished due to factors such as motion blur, temporal variations, and visual effects inherent in video content. To provide clearer guidance for training practical MLLMs, we introduce MME-VideoOCR benchmark, which encompasses a comprehensive range of video OCR application scenarios. MME-VideoOCR features 10 task categories comprising 25 individual tasks and spans 44 diverse scenarios. These tasks extend beyond text recognition to incorporate deeper comprehension and reasoning of textual content within videos. The benchmark consists of 1,464 videos with varying resolutions, aspect ratios, and durations, along with 2,000 meticulously curated, manually annotated question-answer pairs. We evaluate 18 state-of-the-art MLLMs on MME-VideoOCR, revealing that even the best-performing model (Gemini-2.5 Pro) achieves only an accuracy of 73.7%. Fine-grained analysis indicates that while existing MLLMs demonstrate strong performance on tasks where relevant texts are contained within a single or few frames, they exhibit limited capability in effectively handling tasks that demand holistic video comprehension. These limitations are especially evident in scenarios that require spatio-temporal reasoning, cross-frame information integration, or resistance to language prior bias. Our findings also highlight the importance of high-resolution visual input and sufficient temporal coverage for reliable OCR in dynamic video scenarios.

View full details

Poster

Nemotron-CLIMB: Clustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training

Shizhe Diao ⋅ Yu Yang ⋅ Yonggan Fu ⋅ Xin Dong ⋅ Dan SU ⋅ Markus Kliegl ⋅ ZIJIA CHEN ⋅ Peter Belcak ⋅ Yoshi Suhara ⋅ Hongxu Yin ⋅ Mostofa Patwary ⋅ Yingyan (Celine) Lin ⋅ Jan Kautz ⋅ Pavlo Molchanov

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Pre-training datasets are typically collected from web content and lack inherent domain divisions. For instance, widely used datasets like Common Crawl do not include explicit domain labels, while manually curating labeled datasets such as The Pile is labor-intensive. Consequently, identifying an optimal pre-training data mixture remains a challenging problem, despite its significant benefits for pre-training performance. To address these challenges, we propose CLustering-based Iterative Data Mixture Bootstrapping (Nemotron-CLIMB), an automated framework that discovers, evaluates, and refines data mixtures in a pre-training setting. Specifically, Nemotron-CLIMB embeds and clusters large-scale datasets in a semantic space and then iteratively searches for optimal mixtures using a smaller proxy model and a predictor. This strategy enables effective domain adaptation without relying solely on curated data. When continuously trained on 400B tokens with this mixture, our 1B model exceeds the state-of-the-art Llama-3.2-1B by 2.0%. Moreover, we observe that optimizing for a specific domain (e.g., Social Sciences) yields a 5% improvement over random sampling. Finally, we introduce Nemotron-ClimbLab, a filtered 1.2-trillion-token corpus with 20 clusters as a research playground, and Nemotron-ClimbMix, a compact yet powerful 400-billion-token dataset designed for efficient pre-training that delivers superior performance under an equal token budget. We analyze the final data mixture, elucidating the characteristics of an optimal data mixture.

View full details

Poster

Disentanglement Beyond Static vs. Dynamic: A Benchmark and Evaluation Framework for Multi-Factor Sequential Representations

Tal Barami ⋅ Nimrod Berman ⋅ Ilan Naiman ⋅ Amos H Hason ⋅ Rotem Ezra ⋅ Omri Azencot

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Learning disentangled representations in sequential data is a key goal in deep learning, with broad applications in vision, audio, and time series. While real-world data involves multiple interacting semantic factors over time, prior work has mostly focused on simpler two-factor static and dynamic settings, primarily because such settings make data collection easier, thereby overlooking the inherently multi-factor nature of real-world data. We introduce the first standardized benchmark for evaluating multi-factor sequential disentanglement across six diverse datasets spanning video, audio, and time series. Our benchmark includes modular tools for dataset integration, model development, and evaluation metrics tailored to multi-factor analysis. We additionally propose a post-hoc Latent Exploration Stage to automatically align latent dimensions with semantic factors, and introduce a Koopman-inspired model that achieves state-of-the-art results. Moreover, we show that Vision-Language Models can automate dataset annotation and serve as zero-shot disentanglement evaluators, removing the need for manual labels and human intervention. Together, these contributions provide a robust and scalable foundation for advancing multi-factor sequential disentanglement. Our code is available on GitHub, and the datasets and trained models are available on Hugging Face.

View full details

Poster

InfoChartQA: A Benchmark for Multimodal Question Answering on Infographic Charts

Tianchi Xie ⋅ Minzhi Lin ⋅ Mengchen Liu ⋅ Yilin Ye ⋅ Changjian Chen ⋅ Shixia Liu

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Understanding infographic charts with design-driven visual elements (e.g., pictograms, icons) requires both visual recognition and reasoning, posing challenges for multimodal large language models (MLLMs). However, existing visual question answering benchmarks fall short in evaluating these capabilities of MLLMs due to the lack of paired plain charts and visual-element-based questions. To bridge this gap, we introduce InfoChartQA, a benchmark for evaluating MLLMs on infographic chart understanding. It includes 5,642 pairs of infographic and plain charts, each sharing the same underlying data but differing in visual presentations. We further design visual-element-based questions to capture their unique visual designs and communicative intent. Evaluation of 20 MLLMs reveals a substantial performance decline on infographic charts, particularly for visual-element-based questions related to metaphors. The paired infographic and plain charts enable fine-grained error analysis and ablation studies, which highlight new opportunities for advancing MLLMs in infographic chart understanding. We release InfoChartQA at https://github.com/CoolDawnAnt/InfoChartQA.

View full details

Poster

SolidGeo: Measuring Multimodal Spatial Math Reasoning in Solid Geometry

Peijie Wang ⋅ Chao Yang ⋅ Zhong-Zhi Li ⋅ Fei yin ⋅ Dekang Ran ⋅ Mi Tian ⋅ Zhilong Ji ⋅ Jinfeng Bai ⋅ Cheng-lin Liu

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Geometry is a fundamental branch of mathematics and plays a crucial role in evaluating the reasoning capabilities of multimodal large language models (MLLMs). However, existing multimodal mathematics benchmarks mainly focus on plane geometry and largely ignore solid geometry, which requires spatial reasoning and is more challenging than plane geometry. To address this critical gap, we introduce **SolidGeo**, the first large-scale benchmark specifically designed to evaluate the performance of MLLMs on mathematical reasoning tasks in solid geometry. SolidGeo consists of 3,113 real-world K–12 and competition-level problems, each paired with visual context and annotated with difficulty levels and fine-grained solid geometry categories. Our benchmark covers a wide range of 3D reasoning subjects such as projection, unfolding, spatial measurement, and spatial vector, offering a rigorous testbed for assessing solid geometry. Through extensive experiments, we observe that MLLMs encounter substantial challenges in solid geometry math tasks, with a considerable performance gap relative to human capabilities on SolidGeo. Moreover, we analyze the performance, inference effiency and error patterns of various models, offering insights into the solid geometric mathematical reasoning capabilities of MLLMs. We hope SolidGeo serves as a catalyst for advancing MLLMs toward deeper geometric reasoning and spatial intelligence. The dataset is released at https://huggingface.co/datasets/HarryYancy/SolidGeo/

View full details

Poster

RoFt-Mol: Benchmarking Robust Fine-tuning with Molecular Graph Foundation Models

Shikun Liu ⋅ Deyu Zou ⋅ Nima Shoghi ⋅ Victor Fung ⋅ Kai Liu ⋅ Pan Li

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

In the era of foundation models, fine-tuning pre-trained models for specific downstream tasks has become crucial. This drives the need for robust fine-tuning methods to address challenges such as model overfitting and sparse labeling. Moleculargraph foundation models (MGFMs) face unique difficulties that complicate fine-tuning. These models are limited by smaller pre-training datasets and more severedata scarcity for downstream tasks, both of which require enhanced model generalization. Moreover, MGFMs must accommodate diverse objectives, including bothregression and classification tasks. To better understand and improve fine-tuningtechniques under these conditions, we classify eight fine-tuning methods into threemechanisms: weight-based, representation-based, and partial fine-tuning. Webenchmark these methods on downstream regression and classification tasks acrosssupervised and self-supervised pre-trained models in diverse labeling settings. Thisextensive evaluation provides valuable insights and informs the design of a refinedrobust fine-tuning method, ROFT-MOL. This approach combines the strengths ofsimple post-hoc weight interpolation with more complex weight ensemble fine-tuning methods, delivering improved performance across both task types whilemaintaining the ease of use inherent in post-hoc weight interpolation.

View full details

Poster

MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks

Yinghao Zhu ⋅ Ziyi He ⋅ Haoran Hu ⋅ Xiaochen Zheng ⋅ Xichen Zhang ⋅ Wang ⋅ Junyi Gao ⋅ Liantao Ma ⋅ Lequan Yu

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

The rapid advancement of Large Language Models (LLMs) has stimulated interest in multi-agent collaboration for addressing complex medical tasks. However, the practical advantages of multi-agent collaboration approaches remain insufficiently understood. Existing evaluations often lack generalizability, failing to cover diverse tasks reflective of real-world clinical practice, and frequently omit rigorous comparisons against both single-LLM-based and established conventional methods. To address this critical gap, we introduce MedAgentBoard, a comprehensive benchmark for the systematic evaluation of multi-agent collaboration, single-LLM, and conventional approaches. MedAgentBoard encompasses four diverse medical task categories: (1) medical (visual) question answering, (2) lay summary generation, (3) structured Electronic Health Record (EHR) predictive modeling, and (4) clinical workflow automation, across text, medical images, and structured EHR data. Our extensive experiments reveal a nuanced landscape: while multi-agent collaboration demonstrates benefits in specific scenarios, such as enhancing task completeness in clinical workflow automation, it does not consistently outperform advanced single LLMs (e.g., in textual medical QA) or, critically, specialized conventional methods that generally maintain better performance in tasks like medical VQA and EHR-based prediction. MedAgentBoard offers a vital resource and actionable insights, emphasizing the necessity of a task-specific, evidence-based approach to selecting and developing AI solutions in medicine. It underscores that the inherent complexity and overhead of multi-agent collaboration must be carefully weighed against tangible performance gains. All code, datasets, detailed prompts, and experimental results are open-sourced at [this link](https://medagentboard.netlify.app/).

View full details

Poster

Benchmarking Spatiotemporal Reasoning in LLMs and Reasoning Models: Capabilities and Challenges

Pengrui Quan ⋅ Brian Wang ⋅ Kang Yang ⋅ Liying Han ⋅ Mani Srivastava

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Spatiotemporal reasoning plays a key role in Cyber-Physical Systems (CPS). Despite advances in Large Language Models (LLMs) and Large Reasoning Models (LRMs), their capacity to reason about complex spatiotemporal signals remains underexplored. This paper proposes a hierarchical SpatioTemporal reAsoning benchmaRK, STARK, to systematically evaluate LLMs across three levels of reasoning complexity: state estimation (e.g., predicting field variables, localizing and tracking events in space and time), spatiotemporal reasoning over states (e.g., inferring spatial-temporal relationships), and world-knowledge-aware reasoning that integrates contextual and domain knowledge (e.g., intent prediction, landmark-aware navigation). We curate 26 distinct spatiotemporal tasks with diverse sensor modalities, comprising 14,552 challenges where models answer directly or by Python Code Interpreter. Evaluating 3 LRMs and 8 LLMs, we find LLMs achieve limited success in tasks requiring geometric reasoning (e.g., multilateration or triangulation), particularly as complexity increases. Surprisingly, LRMs show robust performance across tasks with various levels of difficulty, often competing or surpassing traditional first-principle-based methods. Our results show that in reasoning tasks requiring world knowledge, the performance gap between LLMs and LRMs narrows, with some LLMs even surpassing LRMs. However, the LRM o3 model continues to achieve leading performance across all evaluated tasks, a result attributed primarily to the larger size of the reasoning models. STARK motivates future innovations in model architectures and reasoning paradigms for intelligent CPS by providing a structured framework to identify limitations in the spatiotemporal reasoning of LLMs and LRMs.

View full details

Poster

ImgEdit: A Unified Image Editing Dataset and Benchmark

Yang Ye ⋅ Xianyi He ⋅ Zongjian Li ⋅ lin bin ⋅ Shenghai Yuan ⋅ Zhiyuan Yan ⋅ Bohan Hou ⋅ Li Yuan

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Recent advancements in generative models have enabled high-fidelity text-to-image generation. However, open-source image-editing models still lag behind their proprietary counterparts, primarily due to limited high-quality data and insufficient benchmarks.To overcome these limitations, we introduce **ImgEdit**, a large-scale, high-quality image-editing dataset comprising one million carefully curated edit pairs, which contain both novel and complex single-turn edits, as well as challenging multi-turn tasks.To ensure the data quality, we employ a multi-stage pipeline that integrates a cutting-edge vision-language model, a detection model, a segmentation model, alongside task-specific in-painting procedures and strict post-processing. ImgEdit surpasses existing datasets in both task novelty and data quality.Using ImgEdit, we train **ImgEdit-E1**, an editing model using Vision Language Model to process the reference image and editing prompt, which outperforms existing open-source models on multiple tasks, highlighting the value of ImgEdit and model design.For comprehensive evaluation, we introduce **ImgEdit-Bench**, a benchmark designed to evaluate image editing performance in terms of instruction adherence, editing quality, and detail preservation.It includes a basic testsuite, a challenging single-turn suite, and a dedicated multi-turn suite. We evaluate both open-source and proprietary models, as well as ImgEdit-E1, providing deep analysis and actionable insights into the current behavior of image-editing models.

View full details

Poster

Is Artificial Intelligence Generated Image Detection a Solved Problem?

Ziqiang Li ⋅ Jiazhen Yan ⋅ Ziwen He ⋅ Kai Zeng ⋅ Weiwei Jiang ⋅ Lizhi Xiong ⋅ Zhangjie Fu

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

The rapid advancement of generative models, such as GANs and Diffusion models, has enabled the creation of highly realistic synthetic images, raising serious concerns about misinformation, deepfakes, and copyright infringement. Although numerous Artificial Intelligence Generated Image (AIGI) detectors have been proposed, often reporting high accuracy, their effectiveness in real-world scenarios remains questionable. To bridge this gap, we introduce AIGIBench, a comprehensive benchmark designed to rigorously evaluate the robustness and generalization capabilities of state-of-the-art AIGI detectors. AIGIBench simulates real-world challenges through four core tasks: multi-source generalization, robustness to image degradation, sensitivity to data augmentation, and impact of test-time pre-processing. It includes 23 diverse fake image subsets that span both advanced and widely adopted image generation techniques, along with real-world samples collected from social media and AI art platforms. Extensive experiments on 11 advanced detectors demonstrate that, despite their high reported accuracy in controlled settings, these detectors suffer significant performance drops on real-world data, limited benefits from common augmentations, and nuanced effects of pre-processing, highlighting the need for more robust detection strategies. By providing a unified and realistic evaluation framework, AIGIBench offers valuable insights to guide future research toward dependable and generalizable AIGI detection.

View full details

Poster

SwitchLingua: The First Large-Scale Multilingual and Multi-Ethnic Code-Switching Dataset

Peng Xie ⋅ Xingyuan Liu ⋅ Yequan Bie ⋅ Tsz Wai Chan ⋅ Yangqiu Song ⋅ Yang Wang ⋅ Hao CHEN ⋅ Kani Chen

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Code-switching (CS) is the alternating use of two or more languages within a conversation or utterance, often influenced by social context and speaker identity. This linguistic phenomenon poses challenges for Automatic Speech Recognition (ASR) systems, which are typically designed for a single language and struggle to handle multilingual inputs. The growing global demand for multilingual applications, including Code-Switching ASR (CSASR), Text-to-Speech (TTS), and Cross-Lingual Information Retrieval (CLIR), highlights the inadequacy of existing monolingual datasets. Although some code-switching datasets exist, most are limited to bilingual mixing within homogeneous ethnic groups, leaving a critical need for a large-scale, diverse benchmark akin to ImageNet in computer vision. To bridge this gap, we introduce \textbf{LinguaMaster}, a multi-agent collaboration framework specifically designed for efficient and scalable multilingual data synthesis. Leveraging this framework, we curate \textbf{SwitchLingua}, the first large-scale multilingual and multi-ethnic code-switching dataset, including: (1) 420K CS textual samples across 12 languages, and (2) over 80 hours of audio recordings from 174 speakers representing 18 countries/regions and 63 racial/ethnic backgrounds, based on the textual data. This dataset captures rich linguistic and cultural diversity, offering a foundational resource for advancing multilingual and multicultural research. Furthermore, to address the issue that existing ASR evaluation metrics lack sensitivity to code-switching scenarios, we propose the \textbf{Semantic-Aware Error Rate (SAER)}, a novel evaluation metric that incorporates semantic information, providing a more accurate and context-aware assessment of system performance. Benchmark experiments on SwitchLingua with state-of-the-art ASR models reveal substantial performance gaps, underscoring the dataset’s utility as a rigorous benchmark for CS capability evaluation. In addition, SwitchLingua aims to encourage further research to promote cultural inclusivity and linguistic diversity in speech technology, fostering equitable progress in the ASR field. LinguaMaster (Code): github.com/Shelton1013/SwitchLingua, SwitchLingua (Data): https://huggingface.co/datasets/Shelton1013/SwitchLingua_text, https://huggingface.co/datasets/Shelton1013/SwitchLingua_audio

View full details

Poster

The ML.ENERGY Benchmark: Toward Automated Inference Energy Measurement and Optimization

Jae-Won Chung ⋅ Jeff J. Ma ⋅ Ruofan Wu ⋅ Jiachen Liu ⋅ Oh Jun Kweon ⋅ Yuxuan Xia ⋅ Zhiyu Wu ⋅ Mosharaf Chowdhury

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

As the adoption of Generative AI in real-world services grow explosively, energy has emerged as a critical bottleneck resource. However, energy remains a metric that is often overlooked, under-explored, or poorly understood in the context of building ML systems. We present the ML.ENERGY Benchmark, a benchmark suite and tool for measuring inference energy consumption under realistic service environments, and the corresponding ML.ENERGY Leaderboard, which have served as a valuable resource for those hoping to understand and optimize the energy consumption of their generative AI services. In this paper, we explain four key design principles for benchmarking ML energy we have acquired over time, and then describe how they are implemented in the ML.ENERGY Benchmark. We then highlight results from the early 2025 iteration of the benchmark, including energy measurements of 40 widely used model architectures across 6 different tasks, case studies of how ML design choices impact energy consumption, and how automated optimization recommendations can lead to significant (sometimes more than 40%) energy savings without changing what is being computed by the model. The ML.ENERGY Benchmark is open-source and can be easily extended to various customized models and application scenarios.

View full details

Poster

From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes

Tianxu Wang ⋅ Zhuofan Zhang ⋅ Ziyu Zhu ⋅ Yue Fan ⋅ Jing Xiong ⋅ Pengxiang Li ⋅ Xiaojian (Shawn) Ma ⋅ Qing Li

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

3D visual grounding has made notable progress in localizing objects within complex 3D scenes. However, grounding referring expressions beyond objects in 3D scenes remains unexplored. In this paper, we introduce Anywhere3D-Bench, a holistic 3D visual grounding benchmark consisting of 2,886 referring expression-3D bounding box pairs spanning four different grounding levels: human-activity areas, unoccupied space beyond objects, individual objects in the scene, and fine-grained object parts. We assess a range of state-of-the-art 3D visual grounding methods alongside large language models (LLMs) and multimodal LLMs (MLLMs) on Anywhere3D-Bench. Experimental results reveal that space-level and part-level visual grounding pose the greatest challenges: space-level tasks require a more comprehensive spatial reasoning ability, for example, modeling distances and spatial relations within 3D space, while part-level tasks demand fine-grained perception of object composition. Even the best performance model, OpenAI o4-mini, achieves only 23.00% accuracy on space-level tasks and 31.46% on part-level tasks, significantly lower than its performance on area-level and object-level tasks. These findings underscore a critical gap in current models’ capacity to understand and reason about 3D scenes beyond object-level semantics.

View full details

Poster

PerturBench: Benchmarking Machine Learning Models for Cellular Perturbation Analysis

Yan Wu ⋅ Esther Wershof ⋅ Sebastian Schmon ⋅ Marcel Nassar ⋅ Błażej Osiński ⋅ Ridvan Eksi ⋅ Zichao Yan ⋅ Rory Stark ⋅ Kun Zhang ⋅ Thore Graepel

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We introduce a comprehensive framework for modeling single cell transcriptomic responses to perturbations, aimed at standardizing benchmarking in this rapidly evolving field. Our approach includes a modular and user-friendly model development and evaluation platform, a collection of diverse perturbational datasets, and a set of metrics designed to fairly compare models and dissect their performance. Through extensive evaluation of both published and baseline models across diverse datasets, we highlight the limitations of widely used models, such as mode collapse. We also demonstrate the importance of rank metrics which complement traditional model fit measures, such as RMSE, for validating model effectiveness. Notably, our results show that while no single model architecture clearly outperforms others, simpler architectures are generally competitive and scale well with larger datasets. Overall, this benchmarking exercise sets new standards for model evaluation, supports robust model development, and furthers the use of these models to simulate genetic and chemical screens for therapeutic discovery.

View full details

Poster

HawkBench: Investigating Resilience of RAG Methods on Stratified Information-Seeking Tasks

Hongjin Qian ⋅ Zheng Liu ⋅ Chao Gao ⋅ Yankai Wang ⋅ Defu Lian ⋅ Zhicheng Dou

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

In real-world information-seeking scenarios, users have dynamic and diverse needs, requiring RAG systems to demonstrate adaptable resilience. To comprehensively evaluate the resilience of current RAG methods, we introduce HawkBench, a human-labeled, multi-domain benchmark designed to rigorously assess RAG performance across categorized task types. By stratifying tasks based on information-seeking behaviors, HawkBench provides a systematic evaluation of how well RAG systems adapt to diverse user needs.Unlike existing benchmarks, which focus primarily on specific task types (mostly factoid queries) and rely on varying knowledge bases, HawkBench offers: (1) systematic task stratification to cover a broad range of query types, including both factoid and rationale queries, (2) integration of multi-domain corpora across all task types to mitigate corpus bias, and (3) rigorous annotation for high-quality evaluation.HawkBench includes 1,600 high-quality test samples, evenly distributed across domains and task types. Using this benchmark, we evaluate representative RAG methods, analyzing their performance in terms of answer quality and response latency. Our findings highlight the need for dynamic task strategies that integrate decision-making, query interpretation, and global knowledge understanding to improve RAG generalizability. We believe HawkBench serves as a pivotal benchmark for advancing the resilience of RAG methods and their ability to achieve general-purpose information seeking.

View full details

Poster

VIKI‑R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning

Li Kang ⋅ Xiufeng Song ⋅ Heng Zhou ⋅ Yiran Qin ⋅ Jie Yang ⋅ Xiaohong Liu ⋅ Philip Torr ⋅ LEI BAI ⋅ Zhenfei Yin

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Coordinating multiple embodied agents in dynamic environments remains a core challenge in artificial intelligence, requiring both perception-driven reasoning and scalable cooperation strategies. While recent works have leveraged large language models (LLMs) for multi-agent planning, a few have begun to explore vision-language models (VLMs) for visual reasoning. However, these VLM-based approaches remain limited in their support for diverse embodiment types. In this work, we introduce VIKI-Bench, the first hierarchical benchmark tailored for embodied multi-agent cooperation, featuring three structured levels: agent activation, task planning, and trajectory perception. VIKI-Bench includes diverse robot embodiments, multi-view visual observations, and structured supervision signals to evaluate reasoning grounded in visual inputs. To demonstrate the utility of VIKI-Bench, we propose VIKI-R, a two-stage framework that fine-tunes a pretrained vision-language model (VLM) using Chain-of-Thought annotated demonstrations, followed by reinforcement learning under multi-level reward signals. Our extensive experiments show that VIKI-R significantly outperforms baselines method across all task levels. Furthermore, we show that reinforcement learning enables the emergence of compositional cooperation patterns among heterogeneous agents. Together, VIKI-Bench and VIKI-R offer a unified testbed and method for advancing multi-agent, visual-driven cooperation in embodied AI systems.

View full details

Poster

Martian World Model: Controllable Video Synthesis with Physically Accurate 3D Reconstructions

Longfei Li ⋅ Zhiwen Fan ⋅ Wenyan Cong ⋅ Xinhang Liu ⋅ Yuyang Yin ⋅ Matt Foutter ⋅ Panwang Pan ⋅ Chenyu You ⋅ Yue Wang ⋅ Zhangyang "Atlas" Wang ⋅ Yao Zhao ⋅ Marco Pavone ⋅ Yunchao Wei

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

The synthesis of realistic Martian landscape videos, essential for mission rehearsal and robotic simulation, presents unique challenges. These primarily stem from the scarcity of high-quality Martian data and the significant domain gap relative to terrestrial imagery.To address these challenges, we introduce a holistic solution comprising two main components: 1) a data curation framework, Multimodal Mars Synthesis (M3arsSynth), which processes stereo navigation images to render high-fidelity 3D video sequences. 2) a video-based Martian terrain generator (MarsGen), that utilizes multimodal conditioning data to accurately synthesize novel, 3D-consistent frames. Our data are sourced from NASA’s Planetary Data System (PDS), covering diverse Martian terrains and dates, enabling the production of physics-accurate 3D surface models at metric-scale resolution. During inference, MarsGen is conditioned on an initial image frame and can be guided by specified camera trajectories or textual prompts to generate new environments.Experimental results demonstrate that our solution surpasses video synthesis approaches trained on terrestrial data, achieving superior visual quality and 3D structural consistency.

View full details

Poster

Fire360: A Benchmark for Robust Perception and Episodic Memory in Degraded 360° Firefighting Video

Aditi Tiwari ⋅ Farzaneh Masoud ⋅ Dac Nguyen ⋅ Jill Kraft ⋅ Heng Ji ⋅ Klara Nahrstedt

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Modern AI systems struggle most in environments where reliability is critical - scenes with smoke, poor visibility, and structural deformation. Each year, tens of thousands of firefighters are injured on duty, often due to breakdowns in situational perception. We introduce Fire360, a benchmark for evaluating perception and reasoning in safety-critical firefighting scenarios. The dataset includes 228 360° videos from professional training sessions under diverse conditions (e.g., low light, thermal distortion), annotated with action segments, object locations, and degradation metadata. Fire360 supports five tasks: Visual Question Answering, Temporal Action Captioning, Object Localization, Safety-Critical Reasoning, and Transformed Object Retrieval (TOR). TOR tests whether models can match pristine exemplars to fire-damaged counterparts in unpaired scenes, evaluating episodic memory under irreversible visual transformations. While human experts achieve 83.5% on TOR, models like GPT-4o lag significantly, exposing failures in reasoning under degradation. By releasing Fire360 and its evaluation suite, we aim to advance models that not only see, but also remember, reason, and act under uncertainty. The dataset is available at https://uofi.box.com/v/fire360dataset

View full details

Poster

PUO-Bench: A Panel Understanding and Operation Benchmark with A Privacy-Preserving Framework

Wei LIN ⋅ Yiwei Zhou ⋅ Junkai Zhang ⋅ Rui Shao ⋅ Zhiyuan Zhao ⋅ Junyu Gao ⋅ Antoni Chan ⋅ Xuelong Li

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Recent advancements in Vision-Language Models (VLMs) have enabled GUI agents to leverage visual features for interface understanding and operation in the digital world. However, limited research has addressed the interpretation and interaction with control panels in real-world settings. To bridge this gap, we propose the Panel Understanding and Operation (PUO) benchmark, comprising annotated panel images from appliances and associated vision-language instruction pairs. Experimental results on the benchmark demonstrate significant performance disparities between zero-shot and fine-tuned VLMs, revealing the lack of PUO-specific capabilities in existing language models. Furthermore, we introduce a Privacy-Preserving Framework (PPF) to address privacy concerns in cloud-based panel parsing and reasoning. PPF employs a dual-stage architecture, performing panel understanding on edge devices while delegating complex reasoning to cloud-based LLMs. Although this design introduces a performance trade-off due to edge model limitations, it eliminates the transmission of raw visual data, thereby mitigating privacy risks. Overall, this work provides foundational resources and methodologies for advancing interactive human-machine systems and robotic field in panel-centric applications.

View full details

Poster

MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations

Wenxiang Guo ⋅ Changhao Pan ⋅ Zhiyuan Zhu ⋅ Xintong Hu ⋅ Yu Zhang ⋅ Li Tang ⋅ Rui Yang ⋅ Han Wang ⋅ Zongbao Zhang ⋅ Yuhan Wang ⋅ Yixuan Chen ⋅ Hankun Xu ⋅ Ke Xu ⋅ PengFei Fan ⋅ ZheTao Chen ⋅ Yanhao Yu ⋅ Qiange Huang ⋅ Fei Wu ⋅ Zhou Zhao

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Humans rely on multisensory integration to perceive spatial environments, where auditory cues enable sound source localization in three-dimensional space. Despite the critical role of spatial audio in immersive technologies such as VR/AR, most existing multimodal datasets provide only monaural audio, which limits the development of spatial audio generation and understanding. To address these challenges, we introduce MRSAudio, a large-scale multimodal spatial audio dataset designed to advance research in spatial audio understanding and generation. MRSAudio spans four distinct components: MRSLife, MRSSpeech, MRSMusic, and MRSSing, covering diverse real-world scenarios. The dataset includes synchronized binaural and ambisonic audio, exocentric and egocentric video, motion trajectories, and fine-grained annotations such as transcripts, phoneme boundaries, lyrics, scores, and prompts.To demonstrate the utility and versatility of MRSAudio, we establish five foundational tasks: audio spatialization, and spatial text to speech, spatial singing voice synthesis, spatial music generation and sound event localization and detection. Results show that MRSAudio enables high-quality spatial modeling and supports a broad range of spatial audio research.Demos and dataset access are available at https://mrsaudio.github.io.

View full details

Poster

EconGym: A Scalable AI Testbed with Diverse Economic Tasks

Qirui Mi ⋅ Qipeng Yang ⋅ Zijun Fan ⋅ Wentian Fan ⋅ Heyang Ma ⋅ Chengdong Ma ⋅ Siyu Xia ⋅ Bo An ⋅ Jun Wang ⋅ Haifeng Zhang

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Artificial intelligence (AI) has become a powerful tool for economic research, enabling large-scale simulation and policy optimization. However, applying AI effectively requires simulation platforms for scalable training and evaluation—yet existing environments remain limited to simplified, narrowly scoped tasks, falling short of capturing complex economic challenges such as demographic shifts, multi-government coordination, and large-scale agent interactions.To address this gap, we introduce EconGym, a scalable and modular testbed that connects diverse economic tasks with AI algorithms. Grounded in rigorous economic modeling, EconGym implements 11 heterogeneous role types (e.g., households, firms, banks, governments), their interaction mechanisms, and agent models with well-defined observations, actions, and rewards. Users can flexibly compose economic roles with diverse agent algorithms to simulate rich multi-agent trajectories across 25+ economic tasks for AI-driven policy learning and analysis.Experiments show that EconGym supports diverse and cross-domain tasks—such as coordinating fiscal, pension, and monetary policies—and enables benchmarking across AI, economic methods, and hybrids. Results indicate that richer task composition and algorithm diversity expand the policy space, while AI agents guided by classical economic methods perform best in complex settings. EconGym also scales to 100k agents with high realism and efficiency.

View full details

Poster

MineAnyBuild: Benchmarking Spatial Planning for Open-world AI Agents

Ziming Wei ⋅ Bingqian Lin ⋅ Zijian Jiao ⋅ Yunshuang Nie ⋅ Liang Ma ⋅ Yuecheng Liu ⋅ Yuzheng Zhuang ⋅ Xiaodan Liang

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Spatial Planning is a crucial part in the field of spatial intelligence, which requires the understanding and planning about object arrangements in space perspective. AI agents with the spatial planning ability can better adapt to various real-world applications, including robotic manipulation, automatic assembly, urban planning etc. Recent works have attempted to construct benchmarks for evaluating the spatial intelligence of Multimodal Large Language Models (MLLMs). Nevertheless, these benchmarks primarily focus on spatial reasoning based on typical Visual Question-Answering (VQA) forms, which suffers from the gap between abstract spatial understanding and concrete task execution. In this work, we take a step further to build a comprehensive benchmark called MineAnyBuild, aiming to evaluate the spatial planning ability of open-world AI agents in the Minecraft game. Specifically, MineAnyBuild requires an agent to generate executable architecture building plans based on the given multi-modal human instructions. It involves 4,000 curated spatial planning tasks and also provides a paradigm for infinitely expandable data collection by utilizing rich player-generated content. MineAnyBuild evaluates spatial planning through four core supporting dimensions: spatial understanding, spatial reasoning, creativity, and spatial commonsense. Based on MineAnyBuild, we perform a comprehensive evaluation for existing MLLM-based agents, revealing the severe limitations but enormous potential in their spatial planning abilities. We believe our MineAnyBuild will open new avenues for the evaluation of spatial intelligence and help promote further development for open-world AI agents capable of spatial planning.

View full details

Poster

OVERT: A Benchmark for Over-Refusal Evaluation on Text-to-Image Models

Ziheng Cheng ⋅ Yixiao Huang ⋅ Hui Xu ⋅ Somayeh Sojoudi ⋅ Xuandong Zhao ⋅ Dawn Song ⋅ Song Mei

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Text-to-Image (T2I) models have achieved remarkable success in generating visual content from text inputs. Although multiple safety alignment strategies have been proposed to prevent harmful outputs, they often lead to overly cautious behavior---rejecting even benign prompts---a phenomenon known as \textit{over-refusal} that reduces the practical utility of T2I models. Despite over-refusal having been observed in practice, there is no large-scale benchmark that systematically evaluates this phenomenon for T2I models. In this paper, we present an automatic workflow to construct synthetic evaluation data, resulting in OVERT (\textbf{OVE}r-\textbf{R}efusal evaluation on \textbf{T}ext-to-image models), the first large-scale benchmark for assessing over-refusal behaviors in T2I models. OVERT includes 4,600 seemingly harmful but benign prompts across nine safety-related categories, along with 1,785 genuinely harmful prompts (OVERT-unsafe) to evaluate the safety–utility trade-off. Using OVERT, we evaluate several leading T2I models and find that over-refusal is a widespread issue across various categories (Figure 1), underscoring the need for further research to enhance the safety alignment of T2I models without compromising their functionality. As a preliminary attempt to reduce over-refusal, we explore prompt rewriting; however, we find it often compromises faithfulness to the meaning of the original prompts.Finally, we demonstrate the flexibility of our generation framework in accommodating diverse safety requirements by generating customized evaluation data adapting to user-defined policies.

View full details

Poster

The Catechol Benchmark: Time-series Solvent Selection Data for Few-shot Machine Learning

Toby Boyne ⋅ Juan Campos ⋅ Rebecca Langdon ⋅ Jixiang Qing ⋅ Yilin Xie ⋅ Shiqiang Zhang ⋅ Calvin Tsay ⋅ Ruth Misener ⋅ Daniel Davies ⋅ Kim Jelfs ⋅ Sarah Boyall ⋅ Thomas Dixon ⋅ Linden Schrecker ⋅ Jose Pablo Folch

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Machine learning has promised to change the landscape of laboratory chemistry, with impressive results in molecular property prediction and reaction retro-synthesis. However, chemical datasets are often inaccessible to the machine learning community as they tend to require cleaning, thorough understanding of the chemistry, or are simply not available. In this paper, we introduce a novel dataset for yield prediction, providing the first-ever transient flow dataset for machine learning benchmarking, covering over 1200 process conditions. While previous datasets focus on discrete parameters, our experimental set-up allow us to sample a large number of continuous process conditions, generating new challenges for machine learning models. We focus on solvent selection, a task that is particularly difficult to model theoretically and therefore ripe for machine learning applications. We showcase benchmarking for regression algorithms, transfer-learning approaches, feature engineering, and active learning, with important applications towards solvent replacement and sustainable manufacturing.

View full details

Poster

GreenHyperSpectra: A multi-source hyperspectral dataset for global vegetation trait prediction

Eya Cherif ⋅ Arthur Ouaknine ⋅ Luke Brown ⋅ Phuong Dao ⋅ Kyle Kovach ⋅ Bing Lu ⋅ Daniel Mederer ⋅ Hannes Feilhauer ⋅ Teja Kattenborn ⋅ David Rolnick

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Plant traits such as leaf carbon content and leaf mass are essential variables in the study of biodiversity and climate change. However, conventional field sampling cannot feasibly cover trait variation at ecologically meaningful spatial scales. Machine learning represents a valuable solution for plant trait prediction across ecosystems, leveraging hyperspectral data from remote sensing.Nevertheless, trait prediction from hyperspectral data is challenged by label scarcity and substantial domain shifts (\eg across sensors, ecological distributions), requiring robust cross-domain methods.Here, we present GreenHyperSpectra, a pretraining dataset encompassing real-world cross-sensor and cross-ecosystem samples designed to benchmark trait prediction with semi- and self-supervised methods. We adopt an evaluation framework encompassing in-distribution and out-of-distribution scenarios. We successfully leverage GreenHyperSpectra to pretrain label-efficient multi-output regression models that outperform the state-of-the-art supervised baseline. Our empirical analyses demonstrate substantial improvements in learning spectral representations for trait prediction, establishing a comprehensive methodological framework to catalyze research at the intersection of representation learning and plant functional traits assessment. We also share the dataset\footnotemark[1], code and pretrained model objects for this study \href{https://github.com/echerif18/HyspectraSSL}{here}.\footnotetext[1]{GreenHyperSpectra dataset: \href{https://huggingface.co/datasets/Avatarr05/GreenHyperSpectra}{https://huggingface.co/datasets/Avatarr05/GreenHyperSpectra}}

View full details

Poster

FLiP: Towards Comprehensive and Reliable Evaluation of Federated Prompt Learning

Dongping Liao ⋅ Xitong Gao ⋅ Cheng-Zhong Xu

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

The increasing emphasis on privacy and data security has driven the adoption of federated learning (FL). Prompt learning (PL), which fine-tunes prompt embeddings of pretrained models, has gained a surge of interest in FL community, marked by the emergence of an influx of federated prompt learning (FPL) algorithms. Despite recent advancements, a systematic understanding of their underlying mechanisms and principled guidelines for deploying these techniques in different FL scenarios remain absent. Moreover, inconsistent experimental protocols, limited evaluation scenarios, and the lack of the proper assessment of centralized PL methods in existing works have obscured the essence of these algorithms. To close these gaps, we introduce a comprehensive benchmark, named F LIP, to achieve standardized FPL evaluation. F LIP assesses the performance of 13 centralized and FPL methods across 3 FL protocols and 12 open datasets, considering 6 distinct evaluation scenarios. Our findings demonstrate that PL maintains strong generalization performance in both in-distribution and out-of-distribution settings with minimal resource consumption, but there is no silver bullet found for diverse FPL scenarios. The results (1) pinpoint the suitable application scenarios of each FPL algorithm, (2) demonstrate the competitiveness of adapted centralized PL methods, and (3) offer notable insights to interpret their effectiveness and remaining challenges. All benchmarks and code are available to facilitate further research in this domain.

View full details

Poster

AlgoTune: Can Language Models Speed Up General-Purpose Numerical Programs?

Ori Press ⋅ Brandon Amos ⋅ Haoyu Zhao ⋅ Yikai Wu ⋅ Samuel Ainsworth ⋅ Dominik Krupke ⋅ Patrick Kidger ⋅ Touqir Sajed ⋅ Bartolomeo Stellato ⋅ Jisun Park ⋅ Nathanael Bosch ⋅ Eli Meril ⋅ Albert Steppi ⋅ Arman Zharmagambetov ⋅ Fangzhao Zhang ⋅ David Pérez-Piñeiro ⋅ Alberto Mercurio ⋅ Ni Zhan ⋅ Talor Abramovich ⋅ Kilian Lieret ⋅ Hanlin Zhang ⋅ Shirley Huang ⋅ Matthias Bethge ⋅ Ofir Press

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Despite progress in language model (LM) capabilities, evaluations have thus far focused on models' performance on tasks that humans have previously solved, including in programming (SWE-Bench) and mathematics (FrontierMath). We therefore propose testing models' ability to design and implement algorithms in an open-ended benchmark: We task LMs with writing code that efficiently solves computationally challenging problems in computer science, physics, and mathematics. Our AlgoTune benchmark consists of 120 tasks collected from domain experts and a framework for validating and timing LM-synthesized solution code, which is compared to reference implementations from popular open-source packages.In addition, we develop a baseline LM agent, AlgoTuner, and evaluate its performance across a suite of frontier models.AlgoTuner achieves an average 1.58x speedup against reference solvers, including methods from packages such as SciPy, scikit-learn and CVXPY.However, we find that current models fail to discover algorithmic innovations, instead preferring surface-level optimizations. We hope that AlgoTune catalyzes the development of LM agents exhibiting creative problem solving beyond state-of-the-art human performance.

View full details

Poster

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

Xeron Du ⋅ Yifan Yao ⋅ Kaijing Ma ⋅ Bingli Wang ⋅ Tianyu Zheng ⋅ Zhu ⋅ Minghao Liu ⋅ Yiming Liang ⋅ Xiaolong Jin ⋅ Zhenlin Wei ⋅ Chujie Zheng ⋅ Kaixin Deng ⋅ Shuyue Guo ⋅ Shian Jia ⋅ Sichao Jiang ⋅ Yiyan Liao ⋅ Rui Li ⋅ Qinrui Li ⋅ Sirun Li ⋅ Yizhi Li ⋅ Yunwen Li ⋅ Dehua Ma ⋅ Yuansheng Ni ⋅ Haoran Que ⋅ Qiyao Wang ⋅ Zhoufutu Wen ⋅ Siwei Wu ⋅ Tianshun Xing ⋅ 明许 ⋅ Zhenzhu Yang ⋅ Noah Wang ⋅ Junting Zhou ⋅ yuelin bai ⋅ Xingyuan Bu ⋅ chenglin cai ⋅ Liang Chen ⋅ Yifan Chen ⋅ Cheng Chengtuo ⋅ Tianhao Cheng ⋅ Keyi Ding ⋅ Siming Huang ⋅ HUANG YUN ⋅ Yaoru Li ⋅ Yizhe Li ⋅ Zhaoqun Li ⋅ Tianhao Liang ⋅ Chengdong Lin ⋅ Hongquan Lin ⋅ Yinghao Ma ⋅ Zhongyuan Peng ⋅ Zifan Peng ⋅ Qige Qi ⋅ Shi Qiu ⋅ Xingwei Qu ⋅ Shanghaoran Quan ⋅ Yizhou Tan ⋅ Zili Wang ⋅ 王晨清 ⋅ Hao Wang ⋅ Yiya Wang ⋅ Yubo Wang ⋅ Jiajun Xu ⋅ Kexin Yang ⋅ Ruibin Yuan ⋅ Yuanhao Yue ⋅ Tianyang Zhan ⋅ Chun Zhang ⋅ Jinyang Zhang ⋅ Xiyue Zhang ⋅ Owen Zhang ⋅ Yue Zhang ⋅ Yongchi Zhao ⋅ Xiangyu Zheng ⋅ ChenghuaZhong ⋅ Yang Gao ⋅ Zhoujun Li ⋅ Dayiheng Liu ⋅ Qian Liu ⋅ Tianyu Liu ⋅ Shiwen Ni ⋅ Junran Peng ⋅ Yujia Qin ⋅ Wenbo Su ⋅ Guoyin Wang ⋅ Shi Wang ⋅ Jian Yang ⋅ Min Yang ⋅ Meng Cao ⋅ Xiang Yue ⋅ ZHAO-XIANG ZHANG ⋅ Wangchunshu Zhou ⋅ Jiaheng Liu ⋅ Qunshu Lin ⋅ Wenhao Huang ⋅ Ge Zhang

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Large language models (LLMs) have demonstrated remarkable proficiency in mainstream academic disciplines such as mathematics, physics, and computer science. However, human knowledge encompasses over 200 specialized disciplines, far exceeding the scope of existing benchmarks. The capabilities of LLMs in many of these specialized fields-particularly in light industry, agriculture, and service-oriented disciplines-remain inadequately evaluated. To address this gap, we present SuperGPQA, a comprehensive benchmark that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines. Our benchmark employs a novel Human-LLM collaborative filtering mechanism to eliminate trivial or ambiguous questions through iterative refinement based on both LLM responses and expert feedback. Our experimental results reveal significant room for improvement in the performance of current state-of-the-art LLMs across diverse knowledge domains (e.g., the reasoning-focused model Gemini-2.5-Pro achieved the highest accuracy of 63.56% on SuperGPQA), highlighting the considerable gap between current model capabilities and artificial general intelligence. Additionally, we present comprehensive insights from our management of a large-scale annotation process, involving over 80 expert annotators and an interactive Human-LLM collaborative system, offering valuable methodological guidance for future research initiatives of comparable scope.

View full details

Poster

Worse than Zero-shot? A Fact-Checking Dataset for Evaluating the Robustness of RAG Against Misleading Retrievals

Linda Zeng ⋅ Rithwik Gupta ⋅ Divij Motwani ⋅ Yi Zhang ⋅ Diji Yang

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Retrieval-augmented generation (RAG) has shown impressive capabilities in mitigating hallucinations in large language models (LLMs). However, LLMs struggle to maintain consistent reasoning when exposed to misleading or conflicting evidence, especially in real-world domains such as politics, where information is polarized or selectively framed. Mainstream RAG benchmarks evaluate models under clean retrieval settings, where systems generate answers from gold-standard documents, or under synthetically perturbed settings, where documents are artificially injected with noise. These assumptions fail to reflect real-world conditions, often leading to an overestimation of RAG system performance. To address this gap, we introduce \textsc{RAGuard}, the first benchmark to evaluate the robustness of RAG systems against \textit{misleading} retrievals. Unlike prior benchmarks that rely on synthetic noise, our fact-checking dataset captures naturally occurring misinformation by constructing its retrieval corpus from Reddit discussions. It categorizes retrieved evidence into three types: \textit{supporting}, \textit{misleading}, and \textit{unrelated}, providing a realistic and challenging testbed for assessing how well RAG systems navigate different types of evidence. Our experiments reveal that, when exposed to potentially misleading retrievals, all tested LLM-powered RAG systems perform worse than their zero-shot baselines (i.e., no retrieval at all), while human annotators consistently perform better, highlighting LLMs' susceptibility to noisy environments. To our knowledge, \textsc{RAGuard} is the first benchmark to systematically assess the robustness of the RAG against misleading evidence.We expect this benchmark to drive future research toward improving RAG systems beyond idealized datasets, making them more reliable for real-world applications.

View full details

Poster

VolleyBots: A Testbed for Multi-Drone Volleyball Game Combining Motion Control and Strategic Play

Zelai Xu ⋅ Ruize Zhang ⋅ Chao Yu ⋅ Huining Yuan ⋅ Xiangmin Yi ⋅ Shilong Ji ⋅ Chuqi Wang ⋅ Wenhao Tang ⋅ Feng Gao ⋅ Wenbo Ding ⋅ Xinlei Chen ⋅ Yu Wang

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Robot sports, characterized by well-defined objectives, explicit rules, and dynamic interactions, present ideal scenarios for demonstrating embodied intelligence. In this paper, we present VolleyBots, a novel robot sports testbed where multiple drones cooperate and compete in the sport of volleyball under physical dynamics. VolleyBots integrates three features within a unified platform: competitive and cooperative gameplay, turn-based interaction structure, and agile 3D maneuvering.These intertwined features yield a complex problem combining motion control and strategic play, with no available expert demonstrations.We provide a comprehensive suite of tasks ranging from single-drone drills to multi-drone cooperative and competitive tasks, accompanied by baseline evaluations of representative reinforcement learning (RL), multi-agent reinforcement learning (MARL) and game-theoretic algorithms. Simulation results show that on-policy RL methods outperform off-policy methods in single-agent tasks, but both approaches struggle in complex tasks that combine motion control and strategic play.We additionally design a hierarchical policy which achieves 69.5% win rate against the strongest baseline in the 3 vs 3 task, demonstrating its potential for tackling the complex interplay between low-level control and high-level strategy.To highlight VolleyBots’ sim-to-real potential, we further demonstrate the zero-shot deployment of a policy trained entirely in simulation on real-world drones.

View full details

Poster

CellVerse: Do Large Language Models Really Understand Cell Biology?

Fan Zhang ⋅ Tianyu Liu ⋅ Zhihong Zhu ⋅ Hao Wu ⋅ Haixin Wang ⋅ Donghao Zhou ⋅ Yefeng Zheng ⋅ Kun Wang ⋅ Xian Wu ⋅ Pheng-Ann Heng

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Recent studies have demonstrated the feasibility of modeling single-cell data as natural languages and the potential of leveraging powerful large language models (LLMs) for understanding cell biology. However, a comprehensive evaluation of LLMs' performance on language-driven single-cell analysis tasks still remains unexplored. Motivated by this challenge, we introduce CellVerse, a unified language-centric question-answering benchmark that integrates four types of single-cell multi-omics data and encompasses three hierarchical levels of single-cell analysis tasks: cell type annotation (cell-level), drug response prediction (drug-level), and perturbation analysis (gene-level). Going beyond this, we systematically evaluate the performance across 14 open-source and closed-source LLMs ranging 160M $\rightarrow$ 671B on CellVerse. Remarkably, the experimental results reveal: (1) Existing specialist models (C2S-Pythia) fail to make reasonable decisions across all sub-tasks within CellVerse, while generalist models such as Qwen, Llama, GPT, and DeepSeek family models exhibit preliminary understanding capabilities within the realm of cell biology. (2) The performance of current LLMs falls short of expectations and has substantial room for improvement. Notably, in the widely studied drug response prediction task, none of the evaluated LLMs demonstrate significant performance improvement over random guessing. CellVerse offers the first large-scale empirical demonstration that significant challenges still remain in applying LLMs to cell biology. By introducing CellVerse, we lay the foundation for advancing cell biology through natural languages and hope this paradigm could facilitate next-generation single-cell analysis. Project Page: https://cellverse-cuhk.github.io

View full details

Poster

SECODEPLT: A Unified Benchmark for Evaluating the Security Risks and Capabilities of Code GenAI

Yuzhou Nie ⋅ Zhun Wang ⋅ Yu Yang ⋅ Ruizhe Jiang ⋅ Yuheng Tang ⋅ Xander Davies ⋅ Yarin Gal ⋅ Bo Li ⋅ Wenbo Guo ⋅ Dawn Song

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Existing benchmarks for evaluating the security risks and capabilities (e.g., vulnerability detection) of code-generating large language models (LLMs) face several key limitations:(1) limited coverage of risk and capabilities;(2) reliance on static evaluation metrics such as LLM judgments or rule-based detection, which lack the precision of dynamic analysis; and(3) a trade-off between data quality and benchmark scale.To address these challenges, we introduce a general and scalable benchmark construction framework that begins with manually validated, high-quality seed examples and expands them via targeted mutations.Each mutated sample retains the seed’s security semantics while providing diverse, unseen instances. The resulting benchmark bundles every artifact required for dynamic evaluation, including prompts, vulnerable and patched code, test cases, and ground-truth proofs of concept, enabling rigorous measurement of insecure coding, vulnerability detection, and patch generation. Applying this framework to Python, C/C++, and Java, we build SECODEPLT, a dataset of more than 5.9k samples spanning 44 CWE-based risk categories and three security capabilities. Compared with state-of-the-art benchmarks, SECODEPLT offers broader coverage, higher data fidelity, and substantially greater scale. We use SECODEPLT to evaluate leading code-generation LLMs and agents, revealing their strengths and weaknesses in both generating secure code and identifying or fixing vulnerabilities.We provide our code in \url{https://github.com/ucsb-mlsec/SeCodePLT}, data in \url{https://huggingface.co/datasets/UCSB-SURFI/SeCodePLT}

View full details

Poster

A Technical Report on “Erasing the Invisible”: The 2024 NeurIPS Competition on Stress Testing Image Watermarks

Mucong Ding ⋅ Bang An ⋅ Tahseen Rabbani ⋅ Chenghao Deng ⋅ Anirudh Satheesh ⋅ Souradip Chakraborty ⋅ Mehrdad Saberi ⋅ Yuxin Wen ⋅ Kyle Sang ⋅ Aakriti Agrawal ⋅ Xuandong Zhao ⋅ Mo Zhou ⋅ Mary-Anne Hartley ⋅ Lei Li ⋅ Yu-Xiang Wang ⋅ Vishal Patel ⋅ Soheil Feizi ⋅ Tom Goldstein ⋅ Furong Huang

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

AI-generated images have become pervasive, raising critical concerns around content authenticity, intellectual property, and the spread of misinformation. Invisible watermarks offer a promising solution for identifying AI-generated images, preserving content provenance without degrading visual quality. However, their real-world robustness remains uncertain due to the lack of standardized evaluation protocols and large-scale stress testing. To bridge this gap, we organized “Erasing the Invisible,” a NeurIPS 2024 competition and newly established benchmark designed to systematically stress testing the resilience of watermarking techniques. The competition introduced two attack tracks—Black-box and Beige-box—that simulate practical scenarios with varying levels of attacker knowledge on watermarks, providing a comprehensive assessment of watermark robustness. The competition attracted significant global participation, with 2,722 submissions from 298 teams. Through a rigorous evaluation pipeline featuring real-time feedback and human-verified final rankings, participants developed and demonstrated new attack strategies that revealed critical vulnerabilities in state-of-the-art watermarking methods. On average, the top-5 teams in both tracks could remove watermarks from $\geq$ 89% of the images while preserving high visual quality, setting strong baselines for future research on watermark attacks and defenses. To support continued progress in this field, we summarize the insights and lessons learned from this competition in this paper, and release the benchmark dataset, evaluation toolkit, and competition results. “Erasing the Invisible” establishes a valuable open resource for advancing more robust watermarking techniques and strengthening content provenance in the era of generative AI.

View full details

Poster

Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for and with Foundation Models

Daoyuan Chen ⋅ Yilun Huang ⋅ Xuchen Pan ⋅ Jiang Nana ⋅ Haibin Wang ⋅ Yilei Zhang ⋅ Ce Ge ⋅ Yushuo Chen ⋅ Wenhao Zhang ⋅ Zhijian Ma ⋅ Jun Huang ⋅ Wei Lin ⋅ Yaliang Li ⋅ Bolin Ding ⋅ Jingren Zhou

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Foundation models demand advanced data processing for their vast, multimodal datasets.However, traditional frameworks struggle with the unique complexities of multimodal data.In response, we present Data-Juicer 2.0, a data processing system backed by 100+ data processing operators spanning text, image, video, and audio modalities, supporting more critical tasks including data analysis, synthesis, annotation, and foundation model post-training.With seamless compatibility and dedicated optimization for popular dataset hubs like Hugging Face and computing engines like Ray, it improves upon its predecessor in terms of usability, efficiency, and programmability.It features an easily accessible user interface layer that supports decoupled Python interactions, RESTful APIs, and conversational commands. Its new runtime layer offers adaptive execution across diverse scales and environments, abstracting away system complexities.Extensive empirical evaluations demonstrate Data-Juicer 2.0's remarkable performance and scalability, highlighting its capability to efficiently process TB-level data with 10k+ CPU cores. The system is publicly available and has been widely adopted in diverse research fields and real-world products such as Alibaba Cloud PAI. We actively maintain the system and share practical insights to foster research and applications of next-generation foundation models.

View full details

Poster

IR-OptSet: An Optimization-Sensitive Dataset for Advancing LLM-Based IR Optimizer

Zi Yang ⋅ Lei Qiu ⋅ FANG LYU ⋅ Ming Zhong ⋅ Zhilei Chai ⋅ Haojie Zhou ⋅ Huimin Cui ⋅ Xiaobing Feng

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Compiler optimization is essential for improving program performance, yet modern compilers still depend on manually crafted transformation rules over intermediate representations (IRs). As compilers grow in complexity, maintaining these rule-based optimizations becomes increasingly labor-intensive and difficult to scale. Recent advances in large language models (LLMs) offer a promising alternative, but their effectiveness in compiler optimization remains limited—primarily due to the lack of IR-oriented datasets that expose models to diverse transformation samples in real-world scenarios (*optimization-sensitive samples*), hindering LLMs from learning rich and generalizable optimization strategies.In this paper, we introduce IR-OptSet, the first public optimization-sensitive dataset for advancing LLM-based IR optimizers. It comprises 170K LLVM IR samples from open-source repositories across 8 representative optimization domains. IR-OptSet defines two core tasks: Code Analysis and Optimized Code Generation, and provides tools for correctness verification, performance evaluation, and dataset expansion. In our experiments, fine-tuning three representative LLMs on IR-OptSet leads to significant accuracy improvements across both tasks. Moreover, the LLM fine-tuned with IR-OptSet *outperforms traditional compiler with the -O3 option* in 64 test cases in terms of performance. Further analysis reveals that IR-OptSet provides greater transformation diversity and representativeness than three widely used IR-oriented datasets, highlighting its potential to drive model-based IR optimization. IR-OptSet is publicly available at [https://huggingface.co/datasets/YangziResearch/IR-OptSet](https://huggingface.co/datasets/YangziResearch/IR-OptSet).

View full details

Poster

MONITRS: Multimodal Observations of Natural Incidents Through Remote Sensing

Shreelekha Revankar ⋅ Utkarsh Mall ⋅ Cheng Perng Phoo ⋅ Kavita Bala ⋅ Bharath Hariharan

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Natural disasters cause devastating damage to communities and infrastructure every year. Effective disaster response is hampered by the difficulty of accessing affected areas during and after events. Remote sensing has allowed us to monitor natural disasters in a remote way. More recently there have been advances in computer vision and deep learning that help automate satellite imagery analysis, However, they remain limited by their narrow focus on specific disaster types, reliance on manual expert interpretation, and lack of datasets with sufficient temporal granularity or natural language annotations for tracking disaster progression. We present MONITRS, a novel multimodal dataset of $\sim$10,000 FEMA disaster events with temporal satellite imagery with natural language annotations from news articles, accompanied by geotagged locations, and question-answer pairs. We demonstrate that fine-tuning existing MLLMs on our dataset yields significant performance improvements for disaster monitoring tasks, establishing a new benchmark for machine learning-assisted disaster response systems.

View full details

Poster

VideoGameQA-Bench: Evaluating Vision-Language Models for Video Game Quality Assurance

Mohammad Reza Taesiri ⋅ Abhijay Ghildyal ⋅ Saman Zadtootaghaj ⋅ Nabajeet Barman ⋅ Cor-Paul Bezemer

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

With video games leading in entertainment revenues, optimizing game development workflows is critical to the industry’s long-term success. Recent advances in vision-language models (VLMs) hold significant potential to automate and enhance various aspects of game development—particularly video game quality assurance (QA), which remains one of the most labor-intensive processes with limited automation. To effectively measure VLM performance in video game QA tasks and evaluate their ability to handle real-world scenarios, there is a clear need for standardized benchmarks, as current ones fall short in addressing this domain. To bridge this gap, we introduce VideoGameQA-Bench - a comprehensive benchmark designed to encompass a wide range of game QA activities, including visual unit testing, visual regression testing, needle-in-a-haystack, glitch detection, and bug report generation for both images and videos.

View full details

Poster

ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering

Yuki Imajuku ⋅ Kohki Horie ⋅ Yoichi Iwata ⋅ Kensho Aoki ⋅ Naohiro Takahashi ⋅ Takuya Akiba

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

How well do AI systems perform in algorithm engineering for hard optimization problems in domains such as package-delivery routing, crew scheduling, factory production planning, and power-grid balancing?We introduce $\textit{ALE-Bench}$, a new benchmark for evaluating AI systems on score-based algorithmic programming contests. Drawing on real tasks from the AtCoder Heuristic Contests, ALE-Bench presents optimization problems that are computationally hard and admit no known exact solution.Unlike short-duration, pass/fail coding benchmarks, ALE-Bench encourages iterative solution refinement over long time horizons.Our software framework supports interactive agent architectures that leverage test-run feedback and visualizations. Our evaluation of frontier LLMs revealed that while they demonstrate high performance on specific problems, a notable gap remains compared to humans in terms of consistency across problems and long-horizon problem-solving capabilities. This highlights the need for this benchmark to foster future AI advancements.

View full details

Poster

UniEdit: A Unified Knowledge Editing Benchmark for Large Language Models

Qizhou Chen ⋅ Dakan Wang ⋅ Taolin Zhang ⋅ Zaoming Yan ⋅ Chengsong You ⋅ Chengyu Wang ⋅ Xiaofeng He

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Model editing aims to efficiently revise incorrect or outdated knowledge within LLMs without incurring the high cost of full retraining and risking catastrophic forgetting. Currently, most LLM editing datasets are confined to narrow knowledge domains and cover a limited range of editing evaluation. They often overlook the broad scope of editing demands and the diversity of ripple effects resulting from edits. In this context, we introduce \uniedit, a unified benchmark for LLM editing grounded in open-domain knowledge. First, we construct editing samples by selecting entities from 25 common domains across five major categories, utilizing the extensive triple knowledge available in open-domain knowledge graphs to ensure comprehensive coverage of the knowledge domains. To address the issues of generality and locality in editing, we design an Neighborhood Multi-hop Chain Sampling (NMCS) algorithm to sample subgraphs based on a given knowledge piece to entail comprehensive ripple effects to evaluate. Finally, we employ proprietary LLMs to convert the sampled knowledge subgraphs into natural language text, guaranteeing grammatical accuracy and syntactical diversity. Extensive statistical analysis confirms the scale, comprehensiveness, and diversity of our \uniedit benchmark. We conduct comprehensive experiments across multiple LLMs and editors, analyzing their performance to highlight strengths and weaknesses in editing across open knowledge domains and various evaluation criteria, thereby offering valuable insights for future research endeavors.

View full details

Poster

FlareX: A Physics-Informed Dataset for Lens Flare Removal via 2D Synthesis and 3D Rendering

Lishen Qu ⋅ Zhihao Liu ⋅ Jinshan Pan ⋅ Shihao Zhou ⋅ Jinglei Shi ⋅ Duosheng Chen ⋅ Jufeng Yang

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Lens flare occurs when shooting towards strong light sources, significantly degrading the visual quality of images. Due to the difficulty in capturing flare-corrupted and flare-free image pairs in the real world, existing datasets are typically synthesized in 2D by overlaying artificial flare templates onto background images. However, the lack of flare diversity in templates and the neglect of physical principles in the synthesis process hinder models trained on these datasets from generalizing well to real-world scenarios. To address these challenges, we propose a new physics-informed method for flare data generation, which consists of three stages: parameterized template creation, the laws of illumination-aware 2D synthesis, and physical engine-based 3D rendering, which finally gives us a mixed flare dataset that incorporates both 2D and 3D perspectives, namely FlareX. This dataset offers 9,500 2D templates derived from 95 flare patterns and 3,000 flare image pairs rendered from 60 3D scenes. Furthermore, we design a masking approach to obtain real-world flare-free images from their corrupted counterparts to measure the performance of the model on real-world images. Extensive experiments demonstrate the effectiveness of our method and dataset.

View full details

Poster

OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation

Shenghai Yuan ⋅ Xianyi He ⋅ Yufan Deng ⋅ Yang Ye ⋅ Jinfa Huang ⋅ lin bin ⋅ Chongyang Ma ⋅ Jiebo Luo ⋅ Li Yuan

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Subject-to-Video (S2V) generation aims to create videos that faithfully incorporate reference content, providing enhanced flexibility in the production of videos. To establish the infrastructure for S2V generation, we propose **OpenS2V-Nexus**, consisting of (i) **OpenS2V‑Eval**, a fine‑grained benchmark, and (ii) **OpenS2V‑5M**, a million‑scale dataset.In contrast to existing S2V benchmarks inherited from VBench that focus on global and coarse-grained assessment of generated videos, *OpenS2V-Eval* focuses on the model's ability to generate subject-consistent videos with natural subject appearance and identity fidelity. For these purposes, *OpenS2V-Eval* introduces 180 prompts from seven major categories of S2V, which incorporate both real and synthetic test data. Furthermore, to accurately align human preferences with S2V benchmarks, we propose three automatic metrics, NexusScore, NaturalScore and GmeScore, to separately quantify subject consistency, naturalness, and text relevance in generated videos. Building on this, we conduct a comprehensive evaluation of 18 representative S2V models, highlighting their strengths and weaknesses across different content. Moreover, we create the first open-source large-scale S2V generation dataset *OpenS2V-5M*, which consists of five million high-quality 720P subject-text-video triplets. Specifically, we ensure subject‐information diversity in our dataset by (1) segmenting subjects and building pairing information via cross‐video associations and (2) prompting GPT-4o on raw frames to synthesize multi-view representations. Through *OpenS2V-Nexus*, we deliver a robust infrastructure to accelerate future S2V generation research.

View full details

Poster

PARROT: A Benchmark for Evaluating LLMs in Cross-System SQL Translation

Wei Zhou ⋅ Guoliang Li ⋅ Haoyu Wang ⋅ Yuxing Han ⋅ Xufei Wu ⋅ Fan Wu ⋅ Xuanhe Zhou

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Large language models (LLMs) have shown increasing effectiveness in Text-to-SQL tasks. However, another closely related problem, Cross-System SQL Translation (a.k.a., SQL-to-SQL), which adapts a query written for one database system (e.g., MySQL) into its equivalent one for another system (e.g., ClickHouse), is of great practical importance but remains underexplored. Existing SQL benchmarks are not well-suited for SQL-to-SQL evaluation, which (1) focus on a limited set of database systems (often just SQLite) and (2) cannot capture many system-specific SQL dialects (e.g., customized functions, data types, and syntax rules). Thus, in this paper, we introduce PARROT, a Practical And Realistic BenchmaRk for CrOss-System SQL Translation. PARROT comprises 598 translation pairs from 38 open-source benchmarks and real-world business services, specifically prepared to challenge system-specific SQL understanding (e.g., LLMS achieve lower than 38.53% accuracy on average). We also provide multiple benchmark variants, including PARROT-Diverse with 28,003 translation (for extensive syntax testing) and PARROT-Simple with 5,306 representative samples (for focused stress testing), covering 22 production-grade database systems. To promote future research, we release a public leaderboard and source code at: https://code4db.github.io/parrot-bench/.

View full details

Poster

MedChain: Bridging the Gap Between LLM Agents and Clinical Practice with Interactive Sequence

Jie Liu ⋅ Wenxuan Wang ⋅ Zizhan Ma ⋅ Guolin Huang ⋅ Yihang SU ⋅ Kao-Jung Chang ⋅ Haoliang Li ⋅ Linlin Shen ⋅ Michael R Lyu ⋅ Wenting Chen

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Clinical decision making (CDM) is a complex, dynamic process crucial to healthcare delivery, yet it remains a significant challenge for artificial intelligence systems. While Large Language Model (LLM)-based agents have been tested on general medical knowledge using licensing exams and knowledge question-answering tasks, their performance in the CDM in real-world scenarios is limited due to the lack of comprehensive benchmark that mirror actual medical practice. To address this gap, we present MedChain, a dataset of 12,163 clinical cases that covers five key stages of clinical workflow. MedChain distinguishes itself from existing benchmarks with three key features of real-world clinical practice: personalization, interactivity, and sequentiality. Further, to tackle real-world CDM challenges, we also propose MedChain-Agent, an AI system that integrates a feedback mechanism and a MedCase-RAG module to learn from previous cases and adapt its responses. MedChain-Agent demonstrates remarkable adaptability in gathering information dynamically and handling sequential clinical tasks, significantly outperforming existing approaches. The relevant dataset and code will be released upon acceptance of this paper.

View full details

Poster

Reasoning Gym: Reasoning Environments for Reinforcement Learning with Verifiable Rewards

Zafir Stojanovski ⋅ Oliver Stanley ⋅ Joe Sharratt ⋅ Richard Jones ⋅ Abdulhakeem Adefioye ⋅ Jean Kaddour ⋅ Andreas Köpf

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We introduce Reasoning Gym, a library of reasoning environments for reinforcement learning with verifiable rewards (RLVR). It provides over 100 tasks spanning multiple domains including algebra, arithmetic, computation, cognition, geometry, graph theory, logic, and various common games. Its key innovation is the ability to generate virtually infinite training data with adjustable complexity, unlike most previous reasoning datasets, which are typically fixed. This procedural generation approach allows for continuous evaluation across varying difficulty levels and task configurations. Our experimental results demonstrate the efficacy of Reasoning Gym in both evaluating and reinforcement learning of reasoning models.

View full details

Poster

AutoOpt: A Dataset and a Unified Framework for Automating Optimization Problem Solving

Ankur Sinha ⋅ Shobhit Arora ⋅ Dhaval Pujara

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

This study presents AutoOpt-11k, a unique image dataset of over 11,000 handwritten and printed mathematical optimization models corresponding to single-objective, multi-objective, multi-level, and stochastic optimization problems exhibiting various types of complexities such as non-linearity, non-convexity, non-differentiability, discontinuity, and high-dimensionality. The labels consist of the LaTeX representation for all the images and modeling language representation for a subset of images. The dataset is created by 25 experts following ethical data creation guidelines and verified in two-phases to avoid errors. Further, we develop AutoOpt framework, a machine learning based automated approach for solving optimization problems, where the user just needs to provide an image of the formulation and AutoOpt solves it efficiently without any further human intervention. AutoOpt framework consists of three Modules: (i) M1 (Image_to_Text)- a deep learning model performs the Mathematical Expression Recognition (MER) task to generate the LaTeX code corresponding to the optimization formulation in image; (ii) M2 (Text_to_Text)- a small-scale fine-tuned LLM generates the PYOMO script (optimization modeling language) from LaTeX code; (iii) M3 (Optimization)- a Bilevel Optimization based Decomposition (BOBD) method solves the optimization formulation described in the PYOMO script. We use AutoOpt-11k dataset for training and testing of deep learning models employed in AutoOpt. The deep learning model for MER task (M1) outperforms ChatGPT, Gemini and Nougat on BLEU score metric. BOBD method (M3), which is a hybrid approach, yields better results on complex test problems compared to common approaches, like interior-point algorithm and genetic algorithm.

View full details

Poster

MultiHuman-Testbench: Benchmarking Image Generation for Multiple Humans

Shubhankar Borse ⋅ Seokeon Choi ⋅ Sunghyun Park ⋅ Jeongho Kim ⋅ Shreya Kadambi ⋅ Risheek Garrepalli ⋅ Sungrack Yun ⋅ Durga Malladi ⋅ Fatih Porikli

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Generation of images containing multiple humans, performing complex actions, while preserving their facial identities, is a significant challenge. A major factor contributing to this is the lack of a a dedicated benchmark. To address this, we introduce MultiHuman-Testbench, a novel benchmark for rigorously evaluating generative models for multi-human generation. The benchmark comprises 1800 samples, including carefully curated text prompts, describing a range of simple to complex human actions. These prompts are matched with a total of 5,550 unique human face images, sampled uniformly to ensure diversity across age, ethnic background, and gender. Alongside captions, we provide human-selected pose conditioning images which accurately match the prompt. We propose a multi-faceted evaluation suite employing four key metrics to quantify face count, ID similarity, prompt alignment, and action detection. We conduct a thorough evaluation of a diverse set of models, including zero-shot approaches and training-based methods, with and without regional priors. We also propose novel techniques to incorporate image and region isolation using human segmentation and Hungarian matching, significantly improving ID similarity. Our proposed benchmark and key findings provide valuable insights and a standardized tool for advancing research in multi-human image generation.

View full details

Poster

PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

Shi Qiu ⋅ Shaoyang Guo ⋅ Zhuo-Yang Song ⋅ Yunbo Sun ⋅ Zeyu Cai ⋅ Jiashen Wei ⋅ Tianyu Luo ⋅ Yixuan Yin ⋅ Zhang Haoxu ⋅ Yi Hu ⋅ Chenyang Wang ⋅ Chencheng Tang ⋅ Haoling Chang ⋅ Qi Liu ⋅ Ziheng Zhou ⋅ Tianyu Zhang ⋅ Jingtian Zhang ⋅ Zhangyi Liu ⋅ Minghao Li ⋅ Yuku Zhang ⋅ Boxuan Jing ⋅ Xianqi Yin ⋅ Yutong Ren ⋅ Zizhuo Fu ⋅ Jiaming Ji ⋅ Weike Wang ⋅ Xudong Tian ⋅ Anqi Lv ⋅ Laifu Man ⋅ Jianxiang Li ⋅ Feiyu Tao ⋅ Qihua Sun ⋅ Zhou Liang ⋅ Yushu Mu ⋅ Zhongxuan Li ⋅ Jing-Jun Zhang ⋅ Shutao Zhang ⋅ Xiaotian Li ⋅ Xingqi Xia ⋅ Jiawei Lin ⋅ Zheyu Shen ⋅ Jiahang Chen ⋅ Qiuhao Xiong ⋅ Binran Wang ⋅ Fengyuan Wang ⋅ Niziyang ⋅ Bohan Zhang ⋅ Fan Cui ⋅ shaochangkun ⋅ Qing-Hong Cao ⋅ Ming-xing Luo ⋅ Muhan Zhang ⋅ Hua Xing Zhu

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Current benchmarks for evaluating the reasoning capabilities of Large Language Models (LLMs) face significant limitations: task oversimplification, data contamination, and flawed evaluation items. These deficiencies necessitate more rigorous assessment methods. To address these limitations, we introduce PHYBench, a benchmark of 500 original physics problems ranging from high school to Physics Olympiad difficulty. PHYBench addresses data contamination through original content and employs a systematic curation pipeline to eliminate flawed items. Evaluations show that PHYBench activates more tokens and provides stronger differentiation between reasoning models compared to other baselines like AIME 2024, OlympiadBench and GPQA. Even the best-performing model, Gemini 2.5 Pro, achieves only 36.9\% accuracy compared to human experts' 61.9\%. To further enhance evaluation precision, we introduce the Expression Edit Distance (EED) Score for mathematical expression assessment, which improves sample efficiency by 204\% over binary scoring. Moreover, PHYBench effectively elicits multi-step and multi-condition reasoning, providing a platform for examining models' reasoning robustness, preferences, and deficiencies. The benchmark results and dataset are publicly available at https://www.phybench.cn/.

View full details

Poster

Hyperphantasia: A Benchmark for Evaluating the Mental Visualization Capabilities of Multimodal LLMs

Mohammad Shahab Sepehri ⋅ Berk Tinaz ⋅ Zalan Fabian ⋅ Mahdi Soltanolkotabi

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Mental visualization, the ability to construct and manipulate visual representations internally, is a core component of human cognition and plays a vital role in tasks involving reasoning, prediction, and abstraction. Despite the rapid progress of Multimodal Large Language Models (MLLMs), current benchmarks primarily assess passive visual perception, offering limited insight into the more active capability of internally constructing visual patterns to support problem solving. Yet mental visualization is a critical cognitive skill in humans, supporting abilities such as spatial navigation, predicting physical trajectories, and solving complex visual problems through imaginative simulation. To bridge this gap, we introduce Hyperphantasia, a synthetic benchmark designed to evaluate the mental visualization abilities of MLLMs through four carefully constructed puzzles. Each task is procedurally generated and presented at three difficulty levels, enabling controlled analysis of model performance across increasing complexity. Our comprehensive evaluation of state-of-the-art models reveals a substantial gap between the performance of humans and MLLMs. Additionally, we explore the potential of reinforcement learning to improve visual simulation capabilities. Our findings suggest that while some models exhibit partial competence in recognizing visual patterns, robust mental visualization remains an open challenge for current MLLMs.

View full details

Poster

MMCSBench: A Fine-Grained Benchmark for Large Vision-Language Models in Camouflage Scenes

Jin Zhang ⋅ Ruiheng Zhang ⋅ Zhe Cao ⋅ Kaizheng Chen

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Current camouflaged object detection methods predominantly follow discriminative segmentation paradigms and heavily rely on predefined categories present in the training data, limiting their generalization to unseen or emerging camouflage objects. This limitation is further compounded by the labor-intensive and time-consuming nature of collecting camouflage imagery. Although Large Vision-Language Models (LVLMs) show potential to improve such issues with their powerful generative capabilities, their understanding of camouflage scenes is still insufficient. To bridge this gap, we introduce MMCSBench, the first comprehensive multimodal benchmark designed to evaluate and advance LVLM capabilities in camouflage scenes. MMCSBench comprises 22,537 images and 76,843 corresponding image-text pairs across five fine-grained camouflage tasks. Additionally, we propose a new task, Camouflage Efficacy Assessment (CEA), aimed at quantitatively evaluating the camouflage effectiveness of objects in images and enabling automated collection of camouflage images from large-scale databases. Extensive experiments on 26 LVLMs reveal significant shortcomings in models' ability to perceive and interpret camouflage scenes. These findings highlight the fundamental differences between natural and camouflaged visual inputs, offering insights for future research in advancing LVLM capabilities within this challenging domain.

View full details

Poster

Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents

Yaxin Luo ⋅ Zhaoyi Li ⋅ Jiacheng Liu ⋅ Jiacheng Cui ⋅ Xiaohan Zhao ⋅ Zhiqiang Shen

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

CAPTCHAs have been a critical bottleneck for deploying web agents in real-world applications, often blocking them from completing end-to-end automation tasks. While modern multimodal LLM agents have demonstrated impressive performance in static perception tasks, their ability to handle interactive, multi-step reasoning challenges like CAPTCHAs is largely untested. To address this gap, we introduce **Open CaptchaWorld**, the first web-based benchmark and platform specifically designed to evaluate the visual reasoning and interaction capabilities of MLLM-powered agents through diverse and dynamic CAPTCHA puzzles. Our benchmark spans 20 modern CAPTCHA types, totaling 225 CAPTCHAs, annotated with a new metric we propose: CAPTCHA Reasoning Depth, which quantifies the number of cognitive and motor steps required to solve each puzzle. Experimental results show that humans consistently achieve near-perfect scores, state-of-the-art MLLM agents struggle significantly, with success rates at most **40.0\%** by Browser-Use Openai-o3, far below human-level performance,**93.3\%**. This highlights Open CaptchaWorld as a vital benchmark for diagnosing the limits of current multimodal agents and guiding the development of more robust multimodal reasoning systems.

View full details

Poster

PHANTOM: A Benchmark for Hallucination Detection in Financial Long-Context QA

Lanlan Ji ⋅ Dominic Seyler ⋅ Gunkirat Kaur ⋅ Manjunath Hegde ⋅ Koustuv Dasgupta ⋅ Bing Xiang

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

While Large Language Models (LLMs) show great promise, their tendencies to hallucinate pose significant risks in high-stakes domains like finance, especially when used for regulatory reporting and decision-making. Existing hallucination detection benchmarks fail to capture the complexities of financial benchmarks, which require high numerical precision, nuanced understanding of the language of finance, and ability to handle long-context documents. To address this, we introduce PHANTOM, a novel benchmark dataset for evaluating hallucination detection in long-context financial QA. Our approach first generates a seed dataset of high-quality "query-answer-document (chunk)" triplets, with either hallucinated or correct answers - that are validated by human annotators and subsequently expanded to capture various context lengths and information placements. We demonstrate how PHANTOM allows fair comparison of hallucination detection models and provides insights into LLM performance, offering a valuable resource for improving hallucination detection in financial applications. Further, our benchmarking results highlight the severe challenges out-of-the-box models face in detecting real-world hallucinations on long context data, and establish some promising directions towards alleviating these challenges, by fine-tuning open-source LLMs using PHANTOM.

View full details

Poster

MolVision: Molecular Property Prediction with Vision Language Models

Deepan Adak ⋅ Yogesh Rawat ⋅ Shruti Vyas

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Molecular property prediction is a fundamental task in computational chemistry with critical applications in drug discovery and materials science. While recent works have explored Large Language Models (LLMs) for this task, they primarily rely on textual molecular representations such as SMILES/SELFIES, which can be ambiguous and structurally uninformative. In this work, we introduce MolVision, a novel approach that leverages Vision-Language Models (VLMs) by integrating both molecular structure images and textual descriptions to enhance property prediction. We construct a benchmark spanning nine diverse datasets, covering both classification and regression tasks. Evaluating nine different VLMs in zero-shot, few-shot, and fine-tuned settings, we find that visual information improves prediction performance, particularly when combined with efficient fine-tuning strategies such as LoRA. Our results reveal that while visual information alone is insufficient, multimodal fusion significantly enhances generalization across molecular properties. Adaptation of vision encoder for molecular images in conjunction with LoRA further improves the performance. The code and data is available at : https://molvision.github.io/MolVision/.

View full details

Poster

QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks?

Belinda Li ⋅ Been Kim ⋅ Zi Wang

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Large language models (LLMs) have shown impressive performance on reasoning benchmarks like math and logic. While many works have largely assumed well-defined tasks, real-world queries are often underspecified and only solvable by acquiring missing information. We formalize this information-gathering problem as a constraint satisfaction problem (CSP) with missing variable assignments. Using a special case where only one necessary variable assignment is missing, we can evaluate an LLM's ability to identify the minimal necessary question to ask. We present QuestBench, a set of underspecified reasoning tasks solvable by asking at most one question, which includes: (1) Logic-Q: logical reasoning tasks with one missing proposition, (2) Planning-Q: PDDL planning problems with partially-observed initial states, (3) GSM-Q: human-annotated grade school math problems with one unknown variable, and (4) GSME-Q: equation-based version of GSM-Q. The LLM must select the correct clarification question from multiple options. While current models excel at GSM-Q and GSME-Q, they achieve only 40-50% accuracy on Logic-Q and Planning-Q. Analysis shows that the ability to solve well-specified reasoning problems is not sufficient for success on our benchmark: models struggle to identify the right question even when they can solve the fully specified version. This highlights the need for specifically optimizing models' information acquisition capabilities.

View full details

Poster

AgentDAM: Privacy Leakage Evaluation for Autonomous Web Agents

Arman Zharmagambetov ⋅ Chuan Guo ⋅ Ivan Evtimov ⋅ Maya Pavlova ⋅ Ruslan Salakhutdinov ⋅ Kamalika Chaudhuri

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Autonomous AI agents that can follow instructions and perform complex multi-step tasks have tremendous potential to boost human productivity. However, to perform many of these tasks, the agents need access to personal information from their users, raising the question of whether they are capable of using it appropriately. In this work, we introduce a new benchmark **AgentDAM** that measures if AI web-navigation agents follow the privacy principle of *"data minimization"*. For the purposes of our benchmark, data minimization means that the agent uses a piece of potentially sensitive information only if it is "necessary" to complete a particular task. Our benchmark simulates realistic web interaction scenarios end-to-end and is adaptable to all existing web navigation agents. We use AgentDAM to evaluate how well AI agents built on top of GPT-4, Llama-3 and Claude can limit processing of potentially private information, and show that they are prone to inadvertent use of unnecessary sensitive information. We also propose a prompting-based defense that reduces information leakage, and demonstrate that our end-to-end benchmarking provides a more realistic measure than probing LLMs about privacy. Our results highlight that further research is needed to develop AI agents that can prioritize data minimization at inference time. We open source our benchmark at: https://github.com/facebookresearch/ai-agent-privacy

View full details

Poster

SeePhys: Does Seeing Help Thinking? – Benchmarking Vision-Based Physics Reasoning

Kun Xiang ⋅ Heng Li ⋅ Terry Jingchen Zhang ⋅ Yinya Huang ⋅ Zirong Liu ⋅ Peixin Qu ⋅ Jixi He ⋅ Jiaqi Chen ⋅ Yu-Jie Yuan ⋅ Jianhua Han ⋅ Hang Xu ⋅ Hanhui Li ⋅ Mrinmaya Sachan ⋅ Xiaodan Liang

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We present SeePhys, a large-scale multimodal benchmark for LLM reasoning grounded in physics questions ranging from middle school to PhD qualifying exams. The benchmark covers 7 fundamental domains spanning the physics discipline, incorporating 21 categories of highly heterogeneous diagrams. In contrast to prior works where visual elements mainly serve auxiliary purposes, our benchmark features a substantial proportion of vision-essential problems (75%) that mandate visual information extraction for correct solutions. Through extensive evaluation, we observe that even the most advanced visual reasoning models (e.g., Gemini-2.5-pro and o4-mini) achieve sub-60% accuracy on our benchmark. These results reveal fundamental challenges in current large language models' visual understanding capabilities, particularly in: (i) establishing rigorous coupling between diagram interpretation and physics reasoning, and (ii) overcoming their persistent reliance on textual cues as cognitive shortcuts.Project Page: github.com/SeePhys/seephys-projectHugging Face: huggingface.co/datasets/SeePhys/SeePhys

View full details

Poster

Semantic-KG: Using Knowledge Graphs to Construct Benchmarks for Measuring Semantic Similarity

Qiyao Wei ⋅ Edward R Morrell ⋅ Lea Goetz ⋅ Mihaela van der Schaar

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Evaluating the open-form textual responses generated by Large Language Models (LLMs) typically requires measuring the semantic similarity of the response to a (human generated) reference. However, there is evidence that current semantic similarity methods may capture syntactic or lexical forms over semantic content. While benchmarks exist for semantic equivalence, they often suffer from high generation costs due to reliance on subjective human judgment, limited availability for domain-specific applications, and unclear definitions of equivalence. This paper introduces a novel method for generating benchmarks to evaluate semantic similarity methods for LLM outputs, specifically addressing these limitations. Our approach leverages knowledge graphs (KGs) to generate pairs of natural-language statements that are semantically similar or dissimilar, with dissimilar pairs categorized into one of four sub-types. We generate benchmark datasets in four different domains (general knowledge, biomedicine, finance, biology), and conduct a comparative study of semantic similarity methods including traditional natural language processing scores and LLM-as-a-judge predictions. We observe that the sub-type of semantic variation, as well as the domain of the benchmark impact the performance of semantic similarity methods, with no method being consistently superior. Our results present important implications for the use of LLM-as-a-judge in detecting the semantic content of text. Code is available at \url{https://github.com/QiyaoWei/semantic-kg} and the dataset is available at \url{https://huggingface.co/datasets/QiyaoWei/Semantic-KG}.

View full details

Poster

All that structure matches does not glitter

Maya Martirossyan ⋅ Thomas Egg ⋅ Philipp Höllmer ⋅ George Karypis ⋅ Mark Transtrum ⋅ Adrian Roitberg ⋅ Mingjie Liu ⋅ Richard Hennig ⋅ Ellad Tadmor ⋅ Stefano Martiniani

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Generative models for materials, especially inorganic crystals, hold potential to transform the theoretical prediction of novel compounds and structures. Advancement in this field depends critically on robust benchmarks and minimal, information-rich datasets that enable meaningful model evaluation. This paper critically examines common datasets and reported metrics for a crystal structure prediction task—generating the most likely structures given the chemical composition of a material. We focus on three key issues: First, materials datasets should contain unique crystal structures; for example, we show that the widely-utilized carbon-24 dataset only contains $\approx 40$% unique structures. Second, materials datasets should not be split randomly if polymorphs of many different compositions are numerous—which we find to be the case for the perov-5 and MP-20 datasets. Third, benchmarks can mislead if used uncritically, e.g., reporting a match rate metric without considering the structural variety exhibited by identical building blocks. To address these oft-overlooked issues, we introduce several fixes. We provide revised versions of the carbon-24 dataset: one with duplicates removed, one deduplicated and split by number of atoms $N$, one with enantiomorphs, and two containing only identical structures but with different unit cells. We also propose new splits for datasets with polymorphs, ensuring that polymorphs are grouped within each split subset, setting a more sensible standard for benchmarking model performance. Finally, we present METRe and cRMSE, new model evaluation metrics that can correct existing issues with the match rate metric.

View full details

Poster

IndEgo: A Dataset of Industrial Scenarios and Collaborative Work for Egocentric Assistants

Vivek Chavan ⋅ Yasmina Imgrund ⋅ Tung Dao ⋅ Sanwantri Bai ⋅ Bosong Wang ⋅ Ze Lu ⋅ Oliver Heimann ⋅ Jörg Krüger

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We introduce IndEgo, a multimodal egocentric and exocentric dataset addressing common industrial tasks, including assembly/disassembly, logistics and organisation, inspection and repair, woodworking, and others. The dataset contains 3,460 egocentric recordings (approximately 197 hours), along with 1,092 exocentric recordings (approximately 97 hours). A key focus of the dataset is collaborative work, where two workers jointly perform cognitively and physically intensive tasks. The egocentric recordings include rich multimodal data and added context via eye gaze, narration, sound, motion, and others. We provide detailed annotations (actions, summaries, mistake annotations, narrations), metadata, processed outputs (eye gaze, hand pose, semi-dense point cloud), and benchmarks on procedural and non-procedural task understanding, Mistake Detection, and reasoning-based Question Answering. Baseline evaluations for Mistake Detection, Question Answering and collaborative task understanding show that the dataset presents a challenge for the state-of-the-art multimodal models. Our dataset is available at: https://huggingface.co/datasets/FraunhoferIPK/IndEgo

View full details

Poster

EgoExoBench: A Benchmark for First- and Third-person View Video Understanding in MLLMs

Yuping He ⋅ Yifei Huang ⋅ Guo Chen ⋅ Baoqi Pei ⋅ Jilan Xu ⋅ Tong Lu ⋅ Jiangmiao Pang

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Transferring and integrating knowledge across first-person (egocentric) and third-person (exocentric) viewpoints is intrinsic to human intelligence, enabling humans to learn from others and convey insights from their own experiences. Despite rapid progress in multimodal large language models (MLLMs), their ability to perform such cross-view reasoning remains unexplored. To address this, we introduce EgoExoBench, the first benchmark for egocentric exocentric video understanding and reasoning. Built from publicly available datasets, EgoExoBench comprises over 7300 question–answer pairs spanning eleven sub-tasks organized into three core challenges: semantic alignment, viewpoint association, and temporal reasoning. We evaluate 13 state-of-the-art MLLMs and find that while these models excel on single-view tasks, they struggle to align semantics across perspectives, accurately associate views, and infer temporal dynamics in the ego-exo context. We hope EgoExoBench can serve as a valuable resource for research on embodied agents and intelligent assistants seeking human-like cross-view intelligence.

View full details

Poster

mmWalk: Towards Multi-modal Multi-view Walking Assistance

Kedi Ying ⋅ Ruiping Liu ⋅ Chongyan Chen ⋅ Mingzhe Tao ⋅ Hao Shi ⋅ Kailun Yang ⋅ Jiaming Zhang ⋅ Rainer Stiefelhagen

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Walking assistance in extreme or complex environments remains a significant challenge for people with blindness or low vision (BLV), largely due to the lack of a holistic scene understanding. Motivated by the real-world needs of the BLV community, we build mmWalk, a simulated multi-modal dataset that integrates multi-view sensor and accessibility-oriented features for outdoor safe navigation. Our dataset comprises $120$ manually controlled, scenario-categorized walking trajectories with $62k$ synchronized frames. It contains over $559k$ panoramic images across RGB, depth, and semantic modalities. Furthermore, to emphasize real-world relevance, each trajectory involves outdoor corner cases and accessibility-specific landmarks for BLV users. Additionally, we generate mmWalkVQA, a VQA benchmark with over $69k$ visual question-answer triplets across $9$ categories tailored for safe and informed walking assistance. We evaluate state-of-the-art Vision-Language Models (VLMs) using zero- and few-shot settings and found they struggle with our risk assessment and navigational tasks. We validate our mmWalk-finetuned model on real-world datasets and show the effectiveness of our dataset for advancing multi-modal walking assistance.

View full details

Poster

EffiBench-X: A Multi-Language Benchmark for Measuring Efficiency of LLM-Generated Code

Yuhao Qing ⋅ Boyu Zhu ⋅ Mingzhe Du ⋅ Zhijiang Guo ⋅ Terry Yue Zhuo ⋅ Qianru Zhang ⋅ Jie Zhang ⋅ Heming Cui ⋅ Siu Ming Yiu ⋅ Dong HUANG ⋅ See-Kiong Ng ⋅ Anh Tuan Luu

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Existing code generation benchmarks primarily evaluate functional correctness, with limited attention to code efficiency, and they are often restricted to a single language such as Python. To address this gap, we introduce EffiBench‑X, the first large‑scale multi‑language benchmark specifically designed for robust efficiency evaluation of LLM‑generated code. EffiBench‑X supports Python, C++, Java, JavaScript, Ruby, and Go, and comprises competitive programming tasks paired with human‑expert solutions as efficiency baselines. Evaluating state‑of‑the‑art LLMs on EffiBench‑X reveals that while models frequently generate functionally correct code, they consistently underperform human experts in efficiency. Even the most efficient LLM‑generated solutions (e.g., Qwen3‑32B) achieve only around 62% of human efficiency on average, with significant language‑specific variation: models tend to perform better in Python, Ruby, and JavaScript than in Java, C++, and Go (e.g., DeepSeek‑R1’s Python code is markedly more efficient than its Java code). These findings highlight the need for research into optimization‑oriented methods to improve the efficiency of LLM‑generated code across diverse languages. The dataset and evaluation infrastructure are publicly available at https://github.com/EffiBench/EffiBench-X.git and https://huggingface.co/datasets/EffiBench/effibench-x.

View full details

Poster

RBench-V: A Primary Assessment for Visual Reasoning Models with Multimodal Outputs

Meng-Hao Guo ⋅ Xuanyu Chu ⋅ Qianrui Yang ⋅ Zhe-Han Mo ⋅ Yiqing Shen ⋅ Pei-lin Li ⋅ Xinjie Lin ⋅ Jinnian Zhang ⋅ Xin-Sheng Chen ⋅ Yi Zhang ⋅ Kiyohiro Nakayama ⋅ Zhengyang Geng ⋅ Houwen Peng ⋅ Han Hu ⋅ Shi-min Hu

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

The rapid advancement of native multi-modal models and omni-models, exemplified by GPT-4o, Gemini and o3 with their capability to process and generate content across modalities such as text and images, marks a significant milestone in the evolution of intelligence. Systematic evaluation of their multi-modal output capabilities in visual thinking process (a.k.a., multi-modal chain of thought, M-CoT) becomes critically important. However, existing benchmarks for evaluating multi-modal models primarily focus on assessing multi-modal inputs and text-only reasoning process while neglecting the importance of reasoning through multi-modal outputs. In this paper, we present a benchmark, dubbed as RBench-V, designed to assess models’ vision-indispensable reasoning. To conduct RBench-V, we carefully hand-pick 803 questions covering math, physics, counting and games. Unlike problems in previous benchmarks, which typically specify certain input modalities, RBench-V presents problems centered on multi-modal outputs, which require image manipulation, such as generating novel images and constructing auxiliary lines to support reasoning process. We evaluate numerous open- and closed-source models on RBench-V, including o3, Gemini 2.5 pro, Qwen2.5-VL, etc. Even the best-performing model, o3, achieves only 25.8% accuracy on RBench-V, far below the human score of 82.3%, which shows current models struggle to leverage multi-modal reasoning. Data and code are available at https://evalmodels.github.io/rbenchv.

View full details

Poster

VeriThoughts: Enabling Automated Verilog Code Generation using Reasoning and Formal Verification

Patrick Yubeaton ⋅ Andre Nakkab ⋅ Weihua Xiao ⋅ Luca Collini ⋅ Ramesh Karri ⋅ Chinmay Hegde ⋅ Siddharth Garg

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

This paper introduces VeriThoughts, a novel dataset designed for reasoning-based Verilog code generation. We establish a new benchmark framework grounded in formal verification methods to evaluate the quality and correctness of generated hardware descriptions. Additionally, we present a suite of specialized small-scale models optimized specifically for Verilog generation. Our work addresses the growing need for automated hardware design tools that can produce verifiably correct implementations from high-level specifications, potentially accelerating the hardware development process while maintaining rigorous correctness guarantees.

View full details

Poster

Factorio Learning Environment

Jack Hopkins ⋅ Mart Bakler ⋅ Akbir Khan

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Large Language Models (LLMs) are rapidly saturating existing benchmarks, necessitating new open-ended evaluations. We introduce the Factorio Learning Environment (FLE), based on the game of Factorio, that tests agents in long-term planning, spatial reasoning, program synthesis, and resource optimization. FLE provides exponentially scaling challenges -- from basic automation to complex factories processing millions of resource units per second. We provide two settings: (1) open-play with the open-ended task of building the largest factory on an procedurally generated map and (2) lab-play consisting of 33 bounded tasks accross three settings with fixed resources. We demonstrate across both settings that models still lack strong spatial reasoning. In lab-play, we find that LLMs exhibit promising short-horizon skills, yet are unable to operate effectively in constrained environments, reflecting limitations in error analysis. In open-play, while LLMs discover automation strategies that improve growth (e.g electric-powered drilling), they fail to achieve complex automation (e.g electronic-circuit manufacturing)

View full details

Poster

T1: A Tool-Oriented Conversational Dataset for Multi-Turn Agentic Planning

Amartya Chakraborty ⋅ Paresh Dashore ⋅ Nadia Bathaee ⋅ Anmol Jain ⋅ Anirban Das ⋅ Shi-Xiong Zhang ⋅ Sambit Sahu ⋅ Milind Naphade ⋅ Genta Winata

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Large Language Models (LLMs) have demonstrated impressive capabilities as intelligent agents capable of solving complex problems. However, effective planning in scenarios involving dependencies between API or tool calls-particularly in multi-turn conversations-remains a significant challenge. To address this, we introduce T1, a tool-augmented, multi-domain, multi-turn conversational dataset specifically designed to capture and manage inter-tool dependencies across diverse domains. T1 enables rigorous evaluation of agents' ability to coordinate tool use across nine distinct domains (4 single domain and 5 multi-domain) with the help of an integrated caching mechanism for both short- and long-term memory, while supporting dynamic replanning-such as deciding whether to recompute or reuse cached results. Beyond facilitating research on tool use and planning, T1 also serves as a benchmark for evaluating the performance of open-weight and proprietary large language models. We present results powered by T1-Agent highlighting their ability to plan and reason in complex, tool-dependent scenarios.

View full details

Poster

APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay

Akshara Prabhakar ⋅ Zuxin Liu ⋅ Ming Zhu ⋅ Jianguo Zhang ⋅ Tulika Manoj Awalgaonkar ⋅ Shiyu Wang ⋅ Zhiwei Liu ⋅ Haolin Chen ⋅ Thai Hoang ⋅ Juan Carlos Niebles ⋅ Shelby Heinecke ⋅ Weiran Yao ⋅ Huan Wang ⋅ Silvio Savarese ⋅ Caiming Xiong

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Training effective AI agents for multi-turn interactions requires high-quality data that captures realistic human-agent dynamics, yet such data is scarce and expensive to collect manually. We introduce APIGen-MT, a two-phase framework that generates verifiable and diverse multi-turn agent data. In the first phase, our agentic pipeline produces detailed task blueprints with ground-truth actions, leveraging a committee of LLM reviewers and iterative feedback loops. These blueprints are then transformed into complete interaction trajectories through simulated human-agent interplay. We train a family of models---the xLAM-2-fc-r series with sizes ranging from 1B to 70B parameters. Our models outperform frontier models such as GPT-4o and Claude 3.5 on $\tau$-bench and BFCL benchmarks, with the smaller models surpassing their larger counterparts, particularly in multi-turn settings, while maintaining superior consistency across multiple trials. Comprehensive experiments demonstrate that our verified blueprint-to-details approach yields high-quality training data, enabling the development of more reliable, efficient, and capable agents. We open-source both the synthetic data collected and the trained xLAM-2-fc-r models to advance research in AI agents.Dataset: https://huggingface.co/datasets/Salesforce/APIGen-MT-5k & Models: https://huggingface.co/collections/Salesforce/xlam-2-67ef5be12949d8dcdae354c4

View full details

Poster

LooGLE v2: Are LLMs Ready for Real World Long Dependency Challenges?

Ziyuan He ⋅ Yuxuan Wang ⋅ Jiaqi Li ⋅ Kexin Liang ⋅ Muhan Zhang

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Large language models (LLMs) are equipped with increasingly extended context windows recently, yet their long context understanding capabilities over long dependency tasks remain fundamentally limited and underexplored. This gap is especially significant in many real-world long-context applications that were rarely benchmarked. In this paper, we introduce $\textbf{LooGLE v2}$, a novel benchmark designed to evaluate LLMs' long context ability in real-world applications and scenarios. Our benchmark consists of automatically collected real-world long texts, ranging from 16k to 2M tokens, encompassing domains in law, finance, game and code. Accordingly, we delicately design 10 types of domain-specific long-dependency tasks and generate 1,934 QA instances with various diversity and complexity in a scalable data curation pipeline for further practical needs. We conduct a comprehensive assessment of 6 locally deployed and 4 API-based LLMs. The evaluation results show that even the best-performing model achieves only a 59.2\% overall score on our benchmark. Despite the extensive context windows, popular LLMs are only capable of understanding a much shorter length of context than they claim to be, revealing significant limitations in their ability to handle real-world tasks with long dependencies and highlighting substantial room for model improvement in practical long-context understanding.

View full details

Poster

TAPAS: Datasets for Learning the Learning with Errors Problem

Eshika Saxena ⋅ Alberto Alfarano ⋅ Francois Charton ⋅ Emily Wenger ⋅ Kristin E. Lauter

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

AI-powered attacks on Learning with Errors (LWE)—an important hard math problem in post-quantum cryptography—rival or outperform "classical" attacks on LWE under certain parameter settings. Despite the promise of this approach, a dearth of accessible data limits AI practitioners' ability to study and improve these attacks. Creating LWE data for AI model training is time- and compute-intensive and requires significant domain expertise. To fill this gap and accelerate AI research on LWE attacks, we propose the TAPAS datasets, a ${\bf t}$oolkit for ${\bf a}$nalysis of ${\bf p}$ost-quantum cryptography using ${\bf A}$I ${\bf s}$ystems. These datasets cover several LWE settings and can be used off-the-shelf by AI practitioners to prototype new approaches to cracking LWE. This work documents TAPAS dataset creation, establishes attack performance baselines, and lays out directions for future work.

View full details

Poster

Do You Really Need Public Data? Surrogate Public Data for Differential Privacy on Tabular Data

Shlomi Hod ⋅ Lucas Rosenblatt ⋅ Julia Stoyanovich

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Differentially private (DP) machine learning often relies on the availability of public data for tasks like privacy-utility trade-off estimation, hyperparameter tuning, and pretraining. While public data assumptions may be reasonable in text and image data, they are less likely to hold for tabular data due to tabular data heterogeneity across domains. We propose leveraging powerful priors to address this limitation; specifically, we synthesize realistic tabular data directly from schema-level specifications -- such as variable names, types, and permissible ranges -- without ever accessing sensitive records. To that end, this work introduces the notion of ``surrogate'' public data -- datasets generated independently of sensitive data, which consume no privacy loss budget and are constructed solely from publicly available schema or metadata. Surrogate public data are intended to encode plausible statistical assumptions (informed by publicly available information) into a dataset with many downstream uses in private mechanisms. We automate the process of generating surrogate public data with large language models (LLMs); in particular, we propose two methods: direct record generation as CSV files, and automated structural causal model (SCM) construction for sampling records. Through extensive experiments, we demonstrate that surrogate public tabular data can effectively replace traditional public data when pretraining differentially private tabular classifiers. To a lesser extent, surrogate public data are also useful for hyperparameter tuning of DP synthetic data generators, and for estimating the privacy-utility tradeoff.

View full details

Poster

DQVis Dataset: Natural Language to Biomedical Visualization

Devin Lange ⋅ Pengwei Sui ⋅ Shanghua Gao ⋅ Marinka Zitnik ⋅ Nils Gehlenborg

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Biomedical research data portals are essential resources for scientific inquiry, and interactive exploratory visualizations are an integral component for querying such data repositories. Increasingly, machine learning is being integrated into visualization systems to create natural language interfaces where questions about data can be answered with visualizations, and follow-up questions can build on the previous state. This paper introduces a framework that takes abstract low-level questions about data and a visualization grammar specification that can answer such a question, reifies them with data entities and fields that meet certain constraints, and paraphrases the question language to produce the final collection of realized data-question-visualization triplets. Furthermore, we can link these foundational elements together to construct chains of queries, visualizations, and follow-up queries. We developed an open-source review interface for evaluating the results of these datasets. We applied this framework to five biomedical research data repositories, resulting in DQVis, a dataset of 1.08 million data-question-visualization triplets and 11.4 thousand two-step question samples. Five visualization experts provided feedback on the generated dataset through our review interface. We present a summary of their input and publish the full reviews as an additional resource alongside the dataset.The DQVis dataset and generation code are available at https://huggingface.co/datasets/HIDIVE/DQVis and https://github.com/hms-dbmi/DQVis-Generation.

View full details

Poster

MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research

Hui Chen ⋅ Miao Xiong ⋅ Yujie Lu ⋅ Wei Han ⋅ Ailin Deng ⋅ Yufei He ⋅ Jiaying Wu ⋅ Yibo Li ⋅ Yue Liu ⋅ Bryan Hooi

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Recent advancements in AI agents have demonstrated their growing potential to drive and support scientific discovery. In this work, we introduce MLR-Bench, a comprehensive benchmark for evaluating AI agents on open-ended machine learning research. MLR-Bench includes three key components: (1) 201 research tasks sourced from NeurIPS, ICLR, and ICML workshops covering diverse ML topics; (2) MLR-Judge, an automated evaluation framework combining LLM-based reviewers with carefully designed review rubrics to assess research quality; and (3) MLR-Agent, a modular agent scaffold capable of completing research tasks through four stages: idea generation, proposal formulation, experimentation, and paper writing. Our framework supports both stepwise assessment across these distinct research stages, and end-to-end evaluation of the final research paper. We then use MLR-Bench to evaluate six frontier LLMs and an advanced coding agent, finding that while LLMs are effective at generating coherent ideas and well-structured papers, current coding agents frequently (e.g., in 80\% of the cases) produce fabricated or invalidated experimental results—posing a major barrier to scientific reliability. We validate MLR-Judge through human evaluation, showing high agreement with expert reviewers, supporting its potential as a scalable tool for research evaluation. We open-source MLR-Bench to help the community benchmark, diagnose, and improve AI research agents toward trustworthy and transparent scientific discovery.

View full details

Poster

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Nikhil Kandpal ⋅ Brian Lester ⋅ Colin Raffel ⋅ Sebastian Majstorovic ⋅ Stella Biderman ⋅ Baber Abbasi ⋅ Luca Soldaini ⋅ Enrico Shippole ⋅ A. Feder Cooper ⋅ Aviya Skowron ⋅ Shayne Longpre ⋅ Lintang Sutawika ⋅ Alon Albalak ⋅ Zhenlin Xu ⋅ Guilherme Penedo ⋅ Loubna Ben allal ⋅ Elie Bakouch ⋅ John Pressman ⋅ Honglu Fan ⋅ Dashiell Stander ⋅ Guangyu Song ⋅ Aaron Gokaslan ⋅ John Kirchenbauer ⋅ Tom Goldstein ⋅ Brian Bartoldson ⋅ Bhavya Kailkhura ⋅ Tyler Murray

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concerns. Training LLMs on openly licensed text presents a first step towards addressing these issues, but prior data collection efforts have yielded datasets too small or low-quality to produce performant LLMs. To address this gap, we collect, curate, and release the Common Pile v0.1, an eight terabyte collection of openly licensed text designed for LLM pretraining. The Common Pile comprises content from 30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts, and more. Crucially, we validate our efforts by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1 and 2 trillion tokens respectively. Both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B. In addition to releasing the Common Pile v0.1 itself, we also release the code used in its creation as well as the training mixture and checkpoints for the Comma v0.1 models.

View full details

Poster

FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding

Chongjun Tu ⋅ Lin Zhang ⋅ pengtao chen ⋅ Peng Ye ⋅ Xianfang Zeng ⋅ Wei Cheng ⋅ Gang Yu ⋅ Tao Chen

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Multimodal Large Language Models (MLLMs) have shown impressive video content understanding capabilities but struggle with fine-grained motion comprehension. To comprehensively assess the motion understanding ability of existing MLLMs, we introduce FAVOR-Bench, which comprises 1,776 videos from both ego-centric and third-person perspectives and enables assessment through both close-ended and open-ended tasks. For close-ended evaluation, we carefully design 8,184 multiple-choice question-answer pairs spanning six distinct sub-tasks. For open-ended evaluation, we employ the GPT-assisted evaluation and develop a novel cost-efficient LLM-free assessment method, where the latter can enhance benchmarking interpretability and accessibility. Comprehensive experiments with21 state-of-the-art MLLMs reveal significant limitations in their ability to comprehend and describe detailed temporal dynamics in video motions. To alleviate this limitation, we further build FAVOR-Train, a dataset of 17,152 videos with fine-grained motion annotations. Finetuning Qwen2.5-VL on FAVOR-Train yields consistent improvements on motion-related tasks across TVBench, MotionBenchand our FAVOR-Bench. Our assessment results demonstrate that the proposed FAVOR-Bench and FAVOR-Train provide valuable tools for the community to develop more powerful video understanding models.

View full details

Poster

The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio

Sicong Leng ⋅ Yun Xing ⋅ Zesen Cheng ⋅ Yang Zhou ⋅ Hang Zhang ⋅ Xin Li ⋅ Deli Zhao ⋅ Shijian Lu ⋅ Chunyan Miao ⋅ Lidong Bing

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Recent advancements in large multimodal models (LMMs) have significantly enhanced performance across diverse tasks, with ongoing efforts to further integrate additional modalities such as video and audio. However, most existing LMMs remain vulnerable to hallucinations, the discrepancy between the factual multimodal input and the generated textual output, which has limited their applicability in various real-world scenarios. This paper presents the first systematic investigation of hallucinations in LMMs involving the three most common modalities: language, visual, and audio. Our study reveals two key contributors to hallucinations: overreliance on unimodal priors and spurious inter-modality correlations. To address these challenges, we introduce the benchmark The Curse of Multi-Modalities (CMM), which comprehensively evaluates hallucinations in LMMs, providing a detailed analysis of their underlying issues. Our findings highlight key vulnerabilities, including imbalances in modality integration and biases from training data, underscoring the need for balanced cross-modal learning and enhanced hallucination mitigation strategies. Based on our observations and findings, we suggest potential research directions that could enhance the reliability of LMMs.

View full details

Poster

Can LLMs Outshine Conventional Recommenders? A Comparative Evaluation

Qijiong Liu ⋅ Jieming Zhu ⋅ Lu Fan ⋅ Kun Wang ⋅ Hengchang Hu ⋅ Wei Guo ⋅ Yong Liu ⋅ Xiao-Ming Wu

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Integrating large language models (LLMs) into recommender systems has created new opportunities for improving recommendation quality. However, a comprehensive benchmark is needed to thoroughly evaluate and compare the recommendation capabilities of LLMs with traditional recommender systems. In this paper, we introduce \recbench{}, which systematically investigates various item representation forms (including unique identifier, text, semantic embedding, and semantic identifier) and evaluates two primary recommendation tasks, i.e., click-through rate prediction (CTR) and sequential recommendation (SeqRec). Our extensive experiments cover up to 17 large models and are conducted across five diverse datasets from fashion, news, video, books, and music domains. Our findings indicate that LLM-based recommenders outperform conventional recommenders, achieving up to a 5% AUC improvement in CTR and up to a 170% NDCG@10 improvement in SeqRec. However, these substantial performance gains come at the expense of significantly reduced inference efficiency, rendering LLMs impractical as real-time recommenders. We have released our code and data to enable other researchers to reproduce and build upon our experimental results.

View full details

Poster

Through the Lens: Benchmarking Deepfake Detectors Against Moiré-Induced Distortions

Razaib Tariq ⋅ Minji Heo ⋅ Shahroz Tariq ⋅ Simon Woo

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Deepfake detection remains a pressing challenge, particularly in real-world settings where smartphone-captured media from digital screens often introduces Moiré artifacts that can distort detection outcomes. This study systematically evaluates state-of-the-art (SOTA) deepfake detectors on Moiré-affected videos—an issue that has received little attention. We collected a dataset of 12,832 videos, spanning 35.64 hours, from Celeb-DF, DFD, DFDC, UADFV, and FF++ datasets, capturing footage under diverse real-world conditions, including varying screens, smartphones, lighting setups, and camera angles. To further examine the influence of Moiré patterns on deepfake detection, we conducted additional experiments using our DeepMoiréFake, referred to as (DMF) dataset, and two synthetic Moiré generation techniques. Across 15 top-performing detectors, our results show that Moiré artifacts degrade performance by as much as 25.4\%, while synthetically generated Moiré patterns lead to a 21.4\% drop in accuracy. Surprisingly, demoiréing methods, intended as a mitigation approach, instead worsened the problem, reducing accuracy by up to 16\%. These findings underscore the urgent need for detection models that can robustly handle Moiré distortions alongside other real-world challenges, such as compression, sharpening, and blurring. By introducing the DMF dataset, we aim to drive future research toward closing the gap between controlled experiments and practical deepfake detection.

View full details

Poster

Benchmarking End-To-End Performance of AI-Based Chip Placement Algorithms

Zhihai Wang ⋅ Zijie Geng ⋅ Zhaojie Tu ⋅ Jie Wang ⋅ Yuxi Qian ⋅ Zhexuan Xu ⋅ Ziyan Liu ⋅ Siyuan Xu ⋅ Zhentao Tang ⋅ Shixiong Kai ⋅ Mingxuan Yuan ⋅ Jianye Hao ⋅ Bin Li ⋅ Feng Wu

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Chip placement is a critical step in the Electronic Design Automation (EDA) workflow, which aims to arrange chip modules on the canvas to optimize the performance, power, and area (PPA) metrics of final designs.Recent advances show great potential of AI-based algorithms in chip placement.However, due to the lengthy EDA workflow, evaluations of these algorithms often focus on intermediate surrogate metrics, which are computationally efficient but often misalign with the final end-to-end performance (i.e., the final design PPA).To address this challenge, we propose to build ChiPBench, a comprehensive benchmark specifically designed to evaluate the effectiveness of AI-based algorithms in final design PPA metrics.Specifically, we generate a diverse evaluation dataset from $20$ circuits across various domains, such as CPUs, GPUs, and NPUs.We then evaluate six state-of-the-art AI-based chip placement algorithms on the dataset and conduct a thorough analysis of their placement behavior.Extensive experiments show that AI-based chip placement algorithms produce unsatisfactory final PPA results, highlighting the significant influence of often-overlooked factors like regularity and dataflow.We believe ChiPBench will effectively bridge the gap between academia and industry.

View full details

Poster

CLIMB: Class-imbalanced Learning Benchmark on Tabular Data

Zhining Liu ⋅ Zihao Li ⋅ Ze Yang ⋅ Tianxin Wei ⋅ Jian Kang ⋅ Yada Zhu ⋅ Hendrik Hamann ⋅ Jingrui He ⋅ Hanghang Tong

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Class-imbalanced learning (CIL) on tabular data is important in many real-world applications where the minority class holds the critical but rare outcomes. In this paper, we present CLIMB, a comprehensive benchmark for class-imbalanced learning on tabular data. CLIMB includes 73 real-world datasets across diverse domains and imbalance levels, along with unified implementations of 29 representative CIL algorithms. Built on a high-quality open-source Python package with unified API designs, detailed documentation, and rigorous code quality controls, CLIMB supports easy implementation and comparison between different CIL algorithms. Through extensive experiments, we provide practical insights on method accuracy and efficiency, highlighting the limitations of naive rebalancing, the effectiveness of ensembles, and the importance of data quality. Our code, documentation, and examples are available at https://github.com/ZhiningLiu1998/imbalanced-ensemble.

View full details

Poster

Decompile-Bench: Million-Scale Binary-Source Function Pairs for Real-World Binary Decompilation

hanzhuo tan ⋅ Xiaolong Tian ⋅ Hanrui Qi ⋅ Jiaming Liu ⋅ Siyi Wang ⋅ GAO Zuchen ⋅ Qi Luo ⋅ Jing Li ⋅ Yuqun Zhang

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Recent advances in LLM-based decompilers have been shown effective to convert low-level binaries into human-readable source code. However, there still lacks a comprehensive benchmark that provides large-scale binary-source function pairs, which is critical for advancing the LLM decompilation technology. Creating accurate binary-source mappings incurs severe issues caused by complex compilation settings and widespread function inlining that obscure the correspondence between binaries and their original source code. Previous efforts have either relied on used contest‐style benchmarks, synthetic binary–source mappings that diverge significantly from the mappings in real world, or partially matched binaries with only code lines or variable names, compromising the effectiveness of analyzing the binary functionality. To alleviate these issues, we introduce Decompile-Bench, the first open-source dataset comprising two million binary-source function pairs condensed from 100 million collected function pairs, i.e., 450GB of binaries compiled from permissively licensed GitHub projects. For the evaluation purposes, we also developed a benchmark Decompile-Bench-Eval including manually crafted binaries from the well-established HumanEval and MBPP, alongside the compiled GitHub repositories released after 2025 to mitigate data leakage issues. We further explore commonly-used evaluation metrics to provide a thorough assessment of the studied LLM decompilers and find that fine-tuning with Decompile-Bench causes a 20% improvement over previous benchmarks in terms of the re-executability rate. Our code and data has been released in HuggingFace and Github. https://github.com/anonepo/LLM4Decompile

View full details

Poster

Escaping the SpuriVerse: Can Large Vision-Language Models Generalize Beyond Seen Spurious Correlations?

Yiwei Yang ⋅ Chung Peng Lee ⋅ Shangbin Feng ⋅ Dora Zhao ⋅ Bingbing Wen ⋅ Anthony Liu ⋅ Yulia Tsvetkov ⋅ Bill Howe

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Spurious correlations occur when models rely on non-essential features that coincidentally co-vary with target labels, leading to incorrect reasoning under distribution shift. We consider spurious correlations in multi-modal Large Vision Language Models (LVLMs) pretrained on extensive and diverse datasets without explicit task supervision. We develop a benchmark by sourcing GPT-4o errors on real-world visual-question-answering (VQA) benchmarks, then curating a subset through LVLM-human annotation and synthetic counterfactual evaluation to identify errors caused by spurious correlations. This process yields SpuriVerse, a novel benchmark comprised of 124 distinct types of spurious correlations extracted from real-world datasets, each containing 1 realistic and 10 synthetic VQA samples for a total of 1364 multiple choice questions. We evaluate 15 open and closed-source LVLMs on SpuriVerse, finding that even state-of-the-art closed-source models struggle significantly, achieving at best only 35.0\% accuracy. Fine-tuning on synthetic examples that emphasize the spurious correlation improves performance to 78.4\%, suggesting that training on diverse spurious patterns generalizes to unseen situations: models appear to learn to avoid "shortcuts" and attend to the overall image context.

View full details

Poster

Knot So Simple: A Minimalistic Environment for Spatial Reasoning

Zizhao Chen ⋅ Yoav Artzi

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We propose KnotGym, an interactive environment for complex, spatial reasoning and manipulation. KnotGym includes goal-oriented rope manipulation tasks with varying levels of complexity, all requiring acting from pure image observations.Tasks are defined along a clear and quantifiable axis of complexity based on the number of knot crossings, creating a natural generalization test.KnotGym has a simple observation space, allowing for scalable development, yet it highlights core challenges in integrating acute perception, spatial reasoning, and grounded manipulation.We evaluate methods of different classes, including model-based RL, model-predictive control, and chain-of-thought reasoning, and illustrate the challenges KnotGym presents.

View full details

Poster

MS-Bench: Evaluating LMMs in Ancient Manuscript Study through a Dunhuang Case Study

Yuqing Zhang ⋅ Yue Han ⋅ Shuanghe Zhu ⋅ Haoxiang Wu ⋅ Hangqi Li ⋅ Shengyu Zhang ⋅ Junchi Yan ⋅ Zemin Liu ⋅ Kun Kuang ⋅ Huaiyong Dou ⋅ Yongquan Zhang ⋅ Fei Wu

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Analyzing ancient manuscripts has traditionally been a labor-intensive and time-consuming task for philologists. While recent advancements in LMMs have demonstrated their potential across diverse domains, their effectiveness in manuscript study remains underexplored. In this paper, we introduce MS-Bench, the first comprehensive benchmark co-developed with archaeologists, comprising 5,076 high-resolution images from 4th to 14th century and 9,982 expert-curated questions across nine sub-tasks aligned with archaeological workflows. Through four prompting strategies, we systematically evaluate 32 LMMs on their effectiveness, robustness, and cultural contextualization. Our analysis reveals scale-driven performance and reliability improvements, prompting strategies' impact on performance (CoT has two-sides effect, while visual retrieval-augmented prompts provide consistent boost), and task-specific preferences depending on LMM’s visual capabilities. Although current LMMs are not yet capable of replacing domain expertise, they demonstrate promising potential to accelerate manuscript research through future human–AI collaboration.

View full details

Poster

AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions

Polina Kirichenko ⋅ Mark Ibrahim ⋅ Kamalika Chaudhuri ⋅ Samuel J. Bell

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

For Large Language Models (LLMs) to be reliably deployed in both everyday and high-stakes domains, knowing when not to answer is equally critical as answering correctly.Real-world user queries, which can be underspecified, ill-posed, or fundamentally unanswerable, require LLMs to reason about uncertainty and selectively abstain---i.e., refuse to answer definitively.However, abstention remains understudied, without a systematic evaluation framework for modern LLMs.In this work, we introduce AbstentionBench: a large-scale benchmark for holistically evaluating abstention across 20 diverse datasets, including questions with unknown answers, underspecification, false premises, subjective interpretations, and outdated information.Evaluating 20 frontier LLMs reveals abstention is an unsolved problem, and one where scaling models is of little use.While recent reasoning LLMs have shown impressive results in complex problem solving, surprisingly, we find that reasoning fine-tuning degrades abstention (by 24\% on average), even for math and science domains on which reasoning models are explicitly trained.We find that while a carefully crafted system prompt can boost abstention in practice, it does not resolve models’ fundamental inability to reason about uncertainty.We release AbstentionBench to foster research into advancing LLM reliability.

View full details

Poster

nvBench 2.0: Resolving Ambiguity in Text-to-Visualization through Stepwise Reasoning

Tianqi Luo ⋅ Chuhan Huang ⋅ Leixian Shen ⋅ Boyan Li ⋅ Shuyu Shen ⋅ Wei Zeng ⋅ Nan Tang ⋅ Yuyu Luo

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Text-to-Visualization (Text2VIS) enables users to create visualizations from natural language queries, making data insights more accessible. However, Text2VIS faces challenges in interpreting ambiguous queries, as users often express their visualization needs in imprecise language. To address this challenge, we introduce nBench 2.0, a new benchmark designed to evaluate Text2VIS systems in scenarios involving ambiguous queries. nvBench 2.0 includes 7,878 natural language queries and 24,076 corresponding visualizations, derived from 780 tables across 153 domains. It is built using a controlled ambiguity-injection pipeline that generates ambiguous queries through a reverse-generation workflow. By starting with unambiguous seed visualizations and selectively injecting ambiguities, the pipeline yields multiple valid interpretations for each query, with each ambiguous query traceable to its corresponding visualization through step-wise reasoning paths.We evaluate various Large Language Models (LLMs) on their ability to perform ambiguous Text2VIS tasks using nBench 2.0. We also propose Step-Text2Vis, an LLM-based model trained on nvBench 2.0, which enhances performance in ambiguous scenarios through step-wise preference optimization. Our results show that Step-Text2Vis outperforms all baselines, setting a new state-of-the-art for ambiguous Text2VIS tasks. Our source code and data are available at https://nvbench2.github.io/

View full details

Poster

V2X-Radar: A Multi-modal Dataset with 4D Radar for Cooperative Perception

Lei Yang ⋅ Xinyu Zhang ⋅ Jun Li ⋅ Chen Wang ⋅ Jiaqi Ma ⋅ Zhiying Song ⋅ Tong Zhao ⋅ Ziying Song ⋅ Li Wang ⋅ Mo Zhou ⋅ Yang Shen ⋅ Kai WU ⋅ Chen Lv

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Modern autonomous vehicle perception systems often struggle with occlusions and limited perception range. Previous studies have demonstrated the effectiveness of cooperative perception in extending the perception range and overcoming occlusions, thereby enhancing the safety of autonomous driving. In recent years, a series of cooperative perception datasets have emerged; however, these datasets primarily focus on cameras and LiDAR, neglecting 4D Radar—a sensor used in single-vehicle autonomous driving to provide robust perception in adverse weather conditions. In this paper, to bridge the gap created by the absence of 4D Radar datasets in cooperative perception, we present V2X-Radar, the first large-scale, real-world multi-modal dataset featuring 4D Radar. V2X-Radar dataset is collected using a connected vehicle platform and an intelligent roadside unit equipped with 4D Radar, LiDAR, and multi-view cameras. The collected data encompasses sunny and rainy weather conditions, spanning daytime, dusk, and nighttime, as well as various typical challenging scenarios. The dataset consists of 20K LiDAR frames, 40K camera images, and 20K 4D Radar data, including 350K annotated boxes across five categories. To support various research domains, we have established V2X-Radar-C for cooperative perception, V2X-Radar-I for roadside perception, and V2X-Radar-V for single-vehicle perception. Furthermore, we provide comprehensive benchmarks across these three sub-datasets.

View full details

Poster

URB - Urban Routing Benchmark for RL-equipped Connected Autonomous Vehicles

Ahmet Onur Akman ⋅ Anastasia Psarou ⋅ Michał Hoffmann ⋅ Łukasz Gorczyca ⋅ Lukasz Kowalski ⋅ Paweł Gora ⋅ Grzegorz Jamróz ⋅ Rafal Kucharski

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Connected Autonomous Vehicles (CAVs) promise to reduce congestion in future urban networks, potentially by optimizing their routing decisions. Unlike for human drivers, these decisions can be made with collective, data-driven policies, developed using machine learning algorithms. Reinforcement learning (RL) can facilitate the development of such collective routing strategies, yet standardized and realistic benchmarks are missing. To that end, we present $\texttt{URB}$: Urban Routing Benchmark for RL-equipped Connected Autonomous Vehicles. $\texttt{URB}$ is a comprehensive benchmarking environment that unifies evaluation across 29 real-world traffic networks paired with realistic demand patterns. $\texttt{URB}$ comes with a catalog of predefined tasks, multi-agent RL (MARL) algorithm implementations, three baseline methods, domain-specific performance metrics, and a modular configuration scheme. Our results show that, despite the lengthy and costly training, state-of-the-art MARL algorithms rarely outperformed humans. The experimental results reported in this paper initiate the first leaderboard for MARL in large-scale urban routing optimization. They reveal that current approaches struggle to scale, emphasizing the urgent need for advancements in this domain.

View full details

Poster

PF∆: A Benchmark Dataset for Power Flow under Load, Generation, and Topology Variations

Ana Rivera Him ⋅ Anvita Bhagavathula ⋅ Alvaro Carbonero ⋅ Priya Donti

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Power flow (PF) calculations are the backbone of real-time grid operations, across workflows such as contingency analysis (where repeated PF evaluations assess grid security under outages) and topology optimization (which involves PF-based searches over combinatorially large action spaces). Running these calculations at operational timescales or across large evaluation spaces remains a major computational bottleneck. Additionally, growing uncertainty in power system operations from the integration of renewables and climate-induced extreme weather also calls for tools that can accurately and efficiently simulate a wide range of scenarios and operating conditions. Machine learning methods offer a potential speedup over traditional solvers, but their performance has not been systematically assessed on benchmarks that capture real-world variability. This paper introduces PF∆, a benchmark dataset for power flow that captures diverse variations in load, generation, and topology. PF∆ contains 859,800 solved power flow instances spanning six different bus system sizes, capturing three types of contingency scenarios (N , N –1, and N –2), and including close-to-infeasible cases near steady-state voltage stability limits. We evaluate traditional solvers and GNN-based methods, highlighting key areas where existing approaches struggle, and identifying open problems for future research. Our dataset is available at https://huggingface.co/datasets/pfdelta/pfdelta/tree/main and our code with data generation scripts and model implementations is at https: //github.com/MOSSLab-MIT/pfdelta

View full details

Poster

TAPVid-360: Tracking Any Point in 360 from Narrow Field of View Video

Finlay Hudson ⋅ James Gardner ⋅ William Smith

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Humans excel at constructing panoramic mental models of their surroundings, maintaining object permanence and inferring scene structure beyond visible regions. In contrast, current artificial vision systems struggle with persistent, panoramic understanding, often processing scenes egocentrically on a frame-by-frame basis. This limitation is pronounced in the Track Any Point (TAP) task, where existing methods fail to track 2D points outside the field of view. To address this, we introduce TAP-Vid 360, a novel task that requires predicting the 3D direction to queried scene points across a video sequence, even when far outside the narrow field of view of the observed video. This task fosters learning allocentric scene representations without needing dynamic 4D ground truth scene models for training. Instead, we exploit 360 videos as a source of supervision, resampling them into narrow field-of-view perspectives while computing ground truth directions by tracking points across the full panorama using a 2D pipeline. We introduce a new dataset and benchmark, TAP360-10k comprising 10k perspective videos with ground truth directional point tracking. Our baseline adapts CoTracker v3 to predict per-point rotations for direction updates, outperforming existing TAP and TAP-Vid 3D methods.

View full details

Poster

Fantastic Bugs and Where to Find Them in AI Benchmarks

Sang Truong ⋅ Yuheng Tu ⋅ Michael Hardy ⋅ Anka Reuel-Lamparth ⋅ Zeyu Tang ⋅ Jirayu Burapacheep ⋅ Jonathan Perera ⋅ Chibuike Uwakwe ⋅ Benjamin Domingue ⋅ Nick Haber ⋅ Sanmi Koyejo

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Benchmarks are pivotal in driving AI progress, and invalid benchmark questions frequently undermine their reliability. Manually identifying and correcting errors among thousands of benchmark questions is not only infeasible but also a critical bottleneck for reliable evaluation. In this work, we introduce a framework for systematic benchmark revision that leverages statistical analysis of response patterns to flag potentially invalid questions for further expert review. Our approach builds on a core assumption commonly used in AI evaluations that the mean score sufficiently summarizes model performance. This implies a unidimensional latent construct underlying the measurement experiment, yielding expected ranges for various statistics for each item. When empirically estimated values for these statistics fall outside the expected range for an item, the item is more likely to be problematic. Across nine widely used benchmarks, our method guides expert review to identify problematic questions with up to 84\% precision. In addition, we introduce an LLM‑judge first pass to review questions, further reducing human effort. Together, these components provide an efficient and scalable framework for systematic benchmark revision.

View full details

Poster

Long-term Intracortical Neural activity and Kinematics (LINK): An intracortical neural dataset for chronic brain-machine interfaces, neuroscience, and machine learning

Hisham Temmar ⋅ Yixuan Wang ⋅ Nina Gill ⋅ Nicholas Mellon ⋅ Chang Liu ⋅ Luis Cubillos ⋅ Rio Parsons ⋅ Joseph Costello ⋅ Matteo Ceradini ⋅ Madison Kelberman ⋅ Matthew Mender ⋅ Aren Hite ⋅ Dylan Wallace ⋅ Samuel Nason-Tomaszewski ⋅ Parag Patil ⋅ Matt Willsey ⋅ Anne Draelos ⋅ Cynthia Chestek

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Intracortical brain-machine interfaces (iBMIs) have enabled movement and speech in people living with paralysis by using neural data to decode behaviors in real-time. However, intracortical neural recordings exhibit significant instabilities over time, which poses problems for iBMIs, neuroscience, and machine learning. For iBMIs, neural instabilities require frequent decoder recalibration to maintain high performance, a critical bottleneck for real-world translation. Several approaches have been developed to address this issue, and the field has recognized the need for standardized datasets on which to compare them, but no standard dataset exists for evaluation over year-long timescales. In neuroscience, a growing body of research attempts to elucidate the latent computations performed by populations of neurons. Nonstationarity in neural recordings imposes significant challenges to the design of these studies, so a dataset containing recordings over large time spans would improve methods to account for instabilities. In machine learning, continuous domain adaptation of temporal data is an area of active research, and a dataset containing shift distributions on long time scales would be beneficial to researchers. To address these gaps, we present the LINK Dataset (Long-term Intracortical Neural activity and Kinematics), which contains intracortical spiking activity and kinematic data from 312 sessions of a non-human primate performing a dexterous, 2 degree-of-freedom finger movement task, spanning 1,242 days. We also present longitudinal analyses of the dataset’s neural spiking activity and its relationship to kinematics, as well as overall decoding performance using linear and neural network models. The LINK dataset (https://dandiarchive.org/dandiset/001201) and code (https://github.com/chesteklab/LINK_dataset) are freely available to the public.

View full details

Poster

EPFL-Smart-Kitchen: An Ego-Exo Multi-Modal Dataset for Challenging Action and Motion Understanding in Video-Language Models

Andy Bonnetto ⋅ Haozhe Qi ⋅ Franklin Leong ⋅ Matea Tashkovska ⋅ Mahdi Rad ⋅ Solaiman Shokur ⋅ Friedhelm C. Hummel ⋅ Silvestro Micera ⋅ Marc Pollefeys ⋅ Alexander Mathis

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Understanding behavior requires datasets that capture humans while carrying out complex tasks. The kitchen is an excellent environment for assessing human motor and cognitive function, as many complex actions are naturally exhibited in kitchens from chopping to cleaning. Here, we introduce the EPFL-Smart-Kitchen-30 dataset, collected in a noninvasive motion capture platform inside a kitchen environment. Nine static RGB-D cameras, inertial measurement units (IMUs) and one head-mounted HoloLens~2 headset were used to capture 3D hand, body, and eye movements. The EPFL-Smart-Kitchen-30 dataset is a multi-view action dataset with synchronized exocentric, egocentric, depth, IMUs, eye gaze, body and hand kinematics spanning 29.7 hours of 16 subjects cooking four different recipes. Action sequences were densely annotated with 33.78 action segments per minute. Leveraging this multi-modal dataset, we propose four benchmarks to advance behavior understanding and modeling through 1) a vision-language benchmark, 2) a semantic text-to-motion generation benchmark, 3) a multi-modal action recognition benchmark, 4) a pose-based action segmentation benchmark. We expect the EPFL-Smart-Kitchen-30 dataset to pave the way for better methods as well as insights to understand the nature of ecologically-valid human behavior. Code and data are available at https://amathislab.github.io/EPFL-Smart-Kitchen

View full details

Poster

3D-RAD: A Comprehensive 3D Radiology Med-VQA Dataset with Multi-Temporal Analysis and Diverse Diagnostic Tasks

Xiaotang Gai ⋅ Jiaxiang Liu ⋅ Yichen Li ⋅ Zijie Meng ⋅ Jian Wu ⋅ Zuozhu Liu

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Medical Visual Question Answering (Med-VQA) holds significant potential for clinical decision support, yet existing efforts primarily focus on 2D imaging with limited task diversity. This paper presents 3D-RAD, a large-scale dataset designed to advance 3D Med-VQA using radiology CT scans. The 3D-RAD dataset encompasses six diverse VQA tasks: anomaly detection, image observation, medical computation, existence detection, static temporal diagnosis, and longitudinal temporal diagnosis. It supports both open- and closed-ended questions while introducing complex reasoning challenges, including computational tasks and multi-stage temporal analysis, to enable comprehensive benchmarking. Extensive evaluations demonstrate that existing vision-language models (VLMs), especially medical VLMs exhibit limited generalization, particularly in multi-temporal tasks, underscoring the challenges of real-world 3D diagnostic reasoning. To drive future advancements, we release a high-quality training set 3D-RAD-T of 136,195 expert-aligned samples, showing that fine-tuning on this dataset could significantly enhance model performance. Our dataset and code, aiming to catalyze multimodal medical AI research and establish a robust foundation for 3D medical visual understanding, are publicly available.

View full details

Poster

CaMiT: A Time-Aware Car Model Dataset for Classification and Generation

Frédéric Lin ⋅ Biruk Abere Ambaw ⋅ Adrian Popescu ⋅ Hejer AMMAR ⋅ Romaric Audigier ⋅ Hervé Le Borgne

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

AI systems must adapt to the evolving visual landscape, especially in domains where object appearance shifts over time. While prior work on time-aware vision models has primarily addressed commonsense-level categories, we introduce Car Models in Time (CaMiT). This fine-grained dataset captures the temporal evolution of this representative subset of technological artifacts. CaMiT includes 787K labeled samples of 190 car models (2007–2023) and 5.1M unlabeled samples (2005–2023), supporting supervised and self-supervised learning. We show that static pretraining on in-domain data achieves competitive performance with large-scale generalist models, offering a more resource-efficient solution. However, accuracy degrades when testing a year's models backward and forward in time. To address this, we evaluate CaMiT in a time-incremental classification setting, a realistic continual learning scenario with emerging, evolving, and disappearing classes. We investigate two mitigation strategies: time-incremental pretraining, which updates the backbone model, and time-incremental classifier learning, which updates the final classification layer, with positive results in both cases. Finally, we introduce time-aware image generation by consistently using temporal metadata during training. Results indicate improved realism compared to standard generation. CaMiT provides a rich resource for exploring temporal adaptation in a fine-grained visual context for discriminative and generative AI systems.

View full details

Poster

Mars-Bench: A Benchmark for Evaluating Foundation Models for Mars Science Tasks

Mirali Purohit ⋅ Bimal Gajera ⋅ Vatsal Malaviya ⋅ Irish Mehta ⋅ Kunal Kasodekar ⋅ Jacob Adler ⋅ Steven Lu ⋅ Umaa Rebbapragada ⋅ Hannah Kerner

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Foundation models have enabled rapid progress across many specialized domains by leveraging large-scale pre-training on unlabeled data, demonstrating strong generalization to a variety of downstream tasks. While such models have gained significant attention in fields like Earth Observation, their application to Mars science remains limited. A key enabler of progress in other domains has been the availability of standardized benchmarks that support systematic evaluation. In contrast, Mars science lacks such benchmarks and standardized evaluation frameworks, which have limited progress toward developing foundation models for Martian tasks. To address this gap, we introduce Mars-Bench, the first benchmark designed to systematically evaluate models across a broad range of Mars-related tasks using both orbital and surface imagery. Mars-Bench comprises 20 datasets spanning classification, segmentation, and object detection, focused on key geologic features such as craters, cones, boulders, and frost. We provide standardized, ready-to-use datasets and baseline evaluations using models pre-trained on natural images, Earth satellite data, and state-of-the-art vision-language models. Results from all analyses suggest that Mars-specific foundation models may offer advantages over general-domain counterparts, motivating further exploration of domain-adapted pre-training. Mars-Bench aims to establish a standardized foundation for developing and comparing machine learning models for Mars science. Our data, models, and code are available at: https://mars-bench.github.io/.

View full details

Poster

OpenLex3D: A Tiered Benchmark for Open-Vocabulary 3D Scene Representations

Christina Kassab ⋅ Sacha Morin ⋅ Martin Büchner ⋅ Matias Mattamala ⋅ Kumaraditya Gupta ⋅ Abhinav Valada ⋅ Liam Paull ⋅ Maurice Fallon

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

3D scene understanding has been transformed by open-vocabulary language models that enable interaction via natural language. However, at present the evaluation of these representations is limited to datasets with closed-set semantics that do not capture the richness of language. This work presents OpenLex3D, a dedicated benchmark for evaluating 3D open-vocabulary scene representations. OpenLex3D provides entirely new label annotations for scenes from Replica, ScanNet++, and HM3D, which capture real-world linguistic variability by introducing synonymical object categories and additional nuanced descriptions. Our label sets provide 13 times more labels per scene than the original datasets. By introducing an open-set 3D semantic segmentation task and an object retrieval task, we evaluate various existing 3D open-vocabulary methods on OpenLex3D, showcasing failure cases, and avenues for improvement. Our experiments provide insights on feature precision, segmentation, and downstream capabilities. The benchmark is publicly available at: https://openlex3d.github.io/.

View full details

Poster

Generalizing Verifiable Instruction Following

Valentina Pyatkin ⋅ Saumya Malik ⋅ Victoria Graf ⋅ Hamish Ivison ⋅ Shengyi Huang ⋅ Pradeep Dasigi ⋅ Nathan Lambert ⋅ Hanna Hajishirzi

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

A crucial factor for successful human and AI interaction is the ability of language models or chatbots to follow human instructions precisely. A common feature of instructions are output constraints like ``only answer with yes or no" or ``mention the word `abracadabra' at least 3 times" that the user adds to craft a more useful answer.Even today's strongest models struggle with fulfilling such constraints. We find that most models strongly overfit on a small set of verifiable constraints from the benchmarks that test these abilities, a skill called precise instruction following, and are not able to generalize well to unseen output constraints. We introduce a new benchmark, IFBench, to evaluate precise instruction following generalization on 58 new, diverse, and challenging verifiable out-of-domain constraints. In addition, we perform an extensive analysis of how and on what data models can be trained to improve precise instruction following generalization. Specifically, we carefully design constraint verification modules and show that reinforcement learning with verifiable rewards (RLVR) significantly improves instruction following. In addition to IFBench, we release 29 additional new hand-annotated training constraints and verification functions, RLVR training prompts, and code.

View full details

Poster

BurstDeflicker: A Benchmark Dataset for Flicker Removal in Dynamic Scenes

Lishen Qu ⋅ Zhihao Liu ⋅ Shihao Zhou ⋅ LUO YAQI ⋅ Jie Liang ⋅ Hui Zeng ⋅ Lei Zhang ⋅ Jufeng Yang

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Flicker artifacts in short-exposure images are caused by the interplay between the row-wise exposure mechanism of rolling shutter cameras and the temporal intensity variations of alternating current (AC)-powered lighting. These artifacts typically appear as uneven brightness distribution across the image, forming noticeable dark bands. Beyond compromising image quality, this structured noise also affects high-level tasks, such as object detection and tracking, where reliable lighting is crucial. Despite the prevalence of flicker, the lack of a large-scale, realistic dataset has been a significant barrier to advancing research in flicker removal. To address this issue, we present BurstDeflicker, a scalable benchmark constructed using three complementary data acquisition strategies. First, we develop a Retinex-based synthesis pipeline that redefines the goal of flicker removal and enables controllable manipulation of key flicker-related attributes (e.g., intensity, area, and frequency), thereby facilitating the generation of diverse flicker patterns. Second, we capture 4,000 real-world flicker images from different scenes, which help the model better understand the spatial and temporal characteristics of real flicker artifacts and generalize more effectively to wild scenarios. Finally, due to the non-repeatable nature of dynamic scenes, we propose a green-screen method to incorporate motion into image pairs while preserving real flicker degradation. Comprehensive experiments demonstrate the effectiveness of our dataset and its potential to advance research in flicker removal.

View full details

Poster

Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models

Matvei Popov ⋅ Peter Robicheaux ⋅ Anish Madan ⋅ Isaac Robinson ⋅ Joseph Nelson ⋅ Deva Ramanan ⋅ Neehar Peri

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Vision-language models (VLMs) trained on internet-scale data achieve remarkable zero-shot detection performance on common objects like car, truck, and pedestrian. However, state-of-the-art models still struggle to generalize to out-of-distribution classes, tasks and imaging modalities not typically found in their pre-training. Rather than simply re-training VLMs on more visual data, we argue that one should align VLMs to new concepts with annotation instructions containing a few visual examples and rich textual descriptions. To this end, we introduce Roboflow100-VL, a large-scale collection of 100 multi-modal object detection datasets with diverse concepts not commonly found in VLM pre-training. We evaluate state-of-the-art models on our benchmark in zero-shot, few-shot, semi-supervised, and fully-supervised settings, allowing for comparison across data regimes. Notably, we find that VLMs like GroundingDINO and Qwen2.5-VL achieve less than 2% zero-shot accuracy on challenging medical imaging datasets within Roboflow100-VL, demonstrating the need for few-shot concept alignment. Lastly, we discuss our recent CVPR 2025 Foundational FSOD competition and share insights from the community. Notably, the winning team significantly outperforms our baseline by 17 mAP! Our code and dataset are available on GitHub and Roboflow.

View full details

Poster

MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology

Kiril Vasilev ⋅ Alexandre Misrahi ⋅ Eeshaan Jain ⋅ Phil F Cheng ⋅ Petros Liakopoulos ⋅ Olivier Michielin ⋅ Michael Moor ⋅ Charlotte Bunne

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Multimodal Large Language Models (LLMs) hold promise for biomedical reasoning, but current benchmarks fail to capture the complexity of real-world clinical workflows. Existing evaluations primarily assess unimodal, decontextualized question-answering, overlooking multi-agent decision-making environments such as Molecular Tumor Boards (MTBs). MTBs bring together diverse experts in oncology, where diagnostic and prognostic tasks require integrating heterogeneous data and evolving insights over time. Current benchmarks lack this longitudinal and multimodal complexity. We introduce **MTBBench**, an agentic benchmark simulating MTB-style decision-making through clinically challenging, multimodal, and longitudinal oncology questions. Ground truth annotations are validated by clinicians via a co-developed app, ensuring clinical relevance. We benchmark multiple open and closed-source LLMs and show that, even at scale, they lack reliability—frequently hallucinating, struggling with reasoning from time-resolved data, and failing to reconcile conflicting evidence or different modalities. To address these limitations, MTBBench goes beyond benchmarking by providing an agentic framework with foundation model-based tools that enhance multi-modal and longitudinal reasoning, leading to task-level performance gains of up to 9.0% and 11.2%, respectively. Overall, MTBBench offers a challenging and realistic testbed for advancing multimodal LLM reasoning, reliability, and tool-use with a focus on MTB environments in precision oncology.

View full details

Poster

VADB: A Large-Scale Video Aesthetic Database with Professional and Multi-Dimensional Annotations

Qianqian Qiao ⋅ DanDan Zheng ⋅ Yihang Bo ⋅ Bao Peng ⋅ Heng Huang ⋅ Longteng Jiang ⋅ HuayeWang ⋅ Jingdong Chen ⋅ Jun Zhou ⋅ Xin Jin

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Video aesthetic assessment, a vital area in multimedia computing, integrates computer vision with human cognition. Its progress is limited by the lack of standardized datasets and robust models, as the temporal dynamics of video and multimodal fusion challenges hinder direct application of image-based methods. This study introduces VADB, the largest video aesthetic database with 10,490 diverse videos annotated by 37 professionals across multiple aesthetic dimensions, including overall and attribute-specific aesthetic scores, rich language comments and objective tags. We propose VADB-Net, a dual-modal pre-training framework with a two-stage training strategy, which outperforms existing video quality assessment models in scoring tasks and supports downstream video aesthetic assessment tasks. The dataset and source code are available at https://github.com/BestiVictory/VADB.

View full details

Poster

Diffusion Classifiers Understand Compositionality, but Conditions Apply

Yujin Jeong ⋅ Arnas Uselis ⋅ Seong Joon Oh ⋅ Anna Rohrbach

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Understanding visual scenes is fundamental to human intelligence. While discriminative models have significantly advanced computer vision, they often struggle with compositional understanding. In contrast, recent generative text-to-image diffusion models excel at synthesizing complex scenes, suggesting inherent compositional capabilities.Building on this, zero-shot diffusion classifiers have been proposed to repurpose diffusion models for discriminative tasks. While prior work offered promising results in discriminative compositional scenarios, these results remain preliminary due to a small number of benchmarks and a relatively shallow analysis of conditions under which the models succeed. To address this, we present a comprehensive study of the discriminative capabilities of diffusion classifiers on a wide range of compositional tasks. Specifically, our study covers three diffusion models (SD 1.5, 2.0, and, for the first time, 3-m) spanning 10 datasets and over 30 tasks. Further, we shed light on the role that target dataset domains play in respective performance; to isolate the domain effects, we introduce a new diagnostic benchmark \textsc{Self-Bench} comprised of images created by diffusion models themselves. Finally, we explore the importance of timestep weighting and uncover a relationship between domain gap and timestep sensitivity, particularly for SD3-m.To sum up, diffusion classifiers understand compositionality, but conditions apply! Code and dataset are available at https://github.com/eugene6923/Diffusion-Classifiers-Compositionality

View full details

Poster

UniHG: A Large-scale Universal Heterogeneous Graph Dataset and Benchmark for Representation Learning and Cross-Domain Transferring

Yide Qiu ⋅ Tong Zhang ⋅ Shaoxiang Ling ⋅ Xing Cai ⋅ Ziqi Gu ⋅ Zhen Cui

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Irregular data in the real world are usually organized as heterogeneous graphs consisting of multiple types of nodes and edges. However, current heterogeneous graph research confronts three fundamental challenges: i) Benchmark Deficiency, ii) Semantic Disalignment, and iii) Propagation Degradation. In this paper, we construct a large-scale, universal, and joint multi-domain heterogeneous graph dataset named UniHG to facilitate heterogeneous graph representation learning and cross-domain knowledge mining. Overall, UniHG contains 77.31 million nodes and 564 million directed edges with thousands of labels and attributes, which is currently the largest universal heterogeneous graph dataset available to the best of our knowledge. To perform effective learning and provide comprehensively benchmarks on UniHG , two key measures are taken, including i) the semantic alignment strategy for multi-attribute entities, which projects the feature description of multi-attribute nodes and edges into a common embedding space to facilitate information aggregation; ii) proposing the novel Heterogeneous Graph Decoupling (HGD) framework with a specifically designed Anisotropy Feature Propagation (AFP) module for learning effective multi-hop anisotropic propagation kernels. These two strategies enable efficient information propagation among a tremendous number of multi-attribute entities and meanwhile mine multi-attribute association adaptively through the multi-hop aggregation in large-scale heterogeneous graphs. Comprehensive benchmark results demonstrate that our model significantly outperforms existing methods with an accuracy improvement of 28.93\%. And the UniHG can facilitate downstream tasks, achieving an NDCG@20 improvement rate of 11.48\% and 11.71\%. The UniHG dataset and benchmark codes have been released at https://anonymous.4open.science/r/UniHG-AA78.

View full details

Poster

FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents

Nandan Thakur ⋅ Jimmy Lin ⋅ Samuel Havens ⋅ Michael Carbin ⋅ Omar Khattab ⋅ Andrew Drozdov

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We introduce FreshStack, a holistic framework for automatically building information retrieval (IR) evaluation benchmarks by incorporating challenging questions and answers. FreshStack conducts the following steps:(1) automatic corpus collection from code and technical documentation,(2) nugget generation from community-asked questions and answers, and(3) nugget-level support, retrieving documents using a fusion of retrieval techniques and hybrid architectures.We use FreshStack to build five datasets on fast-growing, recent, and niche domains to ensure the tasks are sufficiently challenging. On FreshStack, existing retrieval models, when applied out-of-the-box, significantly underperform oracle approaches on all five domains, denoting plenty of headroom to improve IR quality. In addition, we identify cases where rerankers do not improve first-stage retrieval accuracy (two out of five domains) and oracle context helps an LLM generator generate a high-quality RAG answer.We hope FreshStack will facilitate future work toward constructing realistic, scalable, and uncontaminated IR and RAG evaluation benchmarks.

View full details

Poster

Alchemist: Turning Public Text-to-Image Data into Generative Gold

Valerii Startsev ⋅ Alexander Ustyuzhanin ⋅ Alexey Kirillov ⋅ Dmitry Baranchuk ⋅ Sergey Kastryulin

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Pre-training equips text-to-image (T2I) models with broad world knowledge, but this alone is often insufficient to achieve high aesthetic quality and alignment. Consequently, supervised fine-tuning (SFT) is crucial for further refinement. However, its effectiveness highly depends on the quality of the fine-tuning dataset.Existing public SFT datasets frequently target narrow domains (e.g., anime or specific art styles), and the creation of high-quality, general-purpose SFT datasets remains a significant challenge.Current curation methods are often costly and struggle to identify truly impactful samples.This challenge is further complicated by the scarcity of public general-purpose datasets, as leading models often rely on large, proprietary, and poorly documented internal data, hindering broader research progress.This paper introduces a novel methodology for creating general-purpose SFT datasets by leveraging a pre-trained generative model as an estimator of high-impact training samples. We apply this methodology to construct and release Alchemist, a compact (3,350 samples) yet highly effective SFT dataset. Experiments demonstrate that Alchemist substantially improves the generative quality of five public T2I models while preserving diversity and style. Additionally, we release the fine-tuned models' weights to the public.

View full details

Poster

DroneAudioset: An Audio Dataset for Drone-based Search and Rescue

Chitralekha Gupta ⋅ Soundarya Ramesh ⋅ Praveen Sasikumar ⋅ Kian Yeo ⋅ Suranga Nanayakkara

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Unmanned Aerial Vehicles (UAVs) or drones, are increasingly used in search and rescue missions to detect human presence. Existing systems primarily leverage vision-based methods which are prone to fail under low-visibility or occlusion. Drone-based audio perception offers promise but suffers from extreme ego-noise that masks sounds indicating human presence. Existing datasets are either limited in diversity or synthetic, lacking real acoustic interactions, and there are no standardized setups for drone audition. To this end, we present DroneAudioset (The dataset is publicly available at https://huggingface.co/datasets/ahlab-drone-project/DroneAudioSet/ under the MIT license), a comprehensive drone audition dataset featuring 23.5 hours of annotated recordings, covering a wide range of signal-to-noise ratios (SNRs) from -57.2 dB to -2.5 dB, across various drone types, throttles, microphone configurations as well as environments. The dataset enables development and systematic evaluation of noise suppression and classification methods for human-presence detection under challenging conditions, while also informing practical design considerations for drone audition systems, such as microphone placement trade-offs, and development of drone noise-aware audio processing. This dataset is an important step towards enabling design and deployment of drone-audition systems.

View full details

Poster

Bubbleformer: Forecasting Boiling with Transformers

Sheikh Md Shakeel Hassan ⋅ Xianwei Zou ⋅ Akash Dhruv ⋅ Aparna Chandramowlishwaran

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Modeling boiling---an inherently chaotic, multiphase process central to energy and thermal systems---remains a significant challenge for neural PDE surrogates. Existing models require future input (e.g., bubble positions) during inference because they fail to learn nucleation from past states, limiting their ability to autonomously forecast boiling dynamics. They also fail to model flow boiling velocity fields, where sharp interface–momentum coupling demands long-range and directional inductive biases. We introduce Bubbleformer, a transformer-based spatiotemporal model that forecasts stable and long-range boiling dynamics including nucleation, interface evolution, and heat transfer without dependence on simulation data during inference. Bubbleformer integrates factorized axial attention, frequency-aware scaling, and conditions on thermophysical parameters to generalize across fluids, geometries, and operating conditions.To evaluate physical fidelity in chaotic systems, we propose interpretable physics-based metrics that evaluate heat flux consistency, interface geometry, and mass conservation. We also release BubbleML 2.0, a high-fidelity dataset that spans diverse working fluids (cryogens, refrigerants, dielectrics), boiling configurations (pool and flow boiling), flow regimes (bubbly, slug, annular), and boundary conditions. Bubbleformer sets new benchmark results in both prediction and forecasting of two-phase boiling flows.

View full details

Poster

BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems

Andy Zhang ⋅ Joey Ji ⋅ Celeste Menders ⋅ Riya Dulepet ⋅ Thomas Qin ⋅ Ron Wang ⋅ Junrong Wu ⋅ Kyleen Liao ⋅ Jiliang Li ⋅ Jinghan Hu ⋅ Sara Hong ⋅ Nardos Demilew ⋅ Shivatmica Murgai ⋅ Jason Tran ⋅ Nishka Kacheria ⋅ Ethan Ho ⋅ Denis Liu ⋅ Lauren McLane ⋅ Olivia Bruvik ⋅ Dai-Rong Han ⋅ Seungwoo Kim ⋅ Akhil Vyas ⋅ Cuiyuanxiu Chen ⋅ Ryan Li ⋅ Weiran Xu ⋅ Jonathan Ye ⋅ Prerit Choudhary ⋅ Siddharth M. Bhatia ⋅ Vikram Sivashankar ⋅ Yuxuan Bao ⋅ Dawn Song ⋅ Dan Boneh ⋅ Daniel Ho ⋅ Percy Liang

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

AI agents have the potential to significantly alter the cybersecurity landscape. Here, we introduce the first framework to capture offensive and defensive cyber-capabilities in evolving real-world systems. Instantiating this framework with BountyBench, we set up 25 systems with complex, real-world codebases. To capture the vulnerability lifecycle, we define three task types: Detect (detecting a new vulnerability), Exploit (exploiting a given vulnerability), and Patch (patching a given vulnerability). For Detect, we construct a new success indicator, which is general across vulnerability types and provides localized evaluation. We manually set up the environment for each system, including installing packages, setting up server(s), and hydrating database(s). We add 40 bug bounties, which are vulnerabilities with monetary awards from \\$10 to \\$30,485, covering 9 of the OWASP Top 10 Risks. To modulate task difficulty, we devise a new strategy based on information to guide detection, interpolating from identifying a zero day to exploiting a given vulnerability. We evaluate 10 agents: Claude Code, OpenAI Codex CLI with o3-high and o4-mini, and custom agents with o3-high, GPT-4.1, Gemini 2.5 Pro Preview, Claude 3.7 Sonnet Thinking, Qwen3 235B A22B, Llama 4 Maverick, and DeepSeek-R1. Given up to three attempts, the top-performing agents are OpenAI Codex CLI: o3-high (12.5% on Detect, mapping to \\$3,720; 90% on Patch, mapping to \\$14,152), Custom Agent with Claude 3.7 Sonnet Thinking (67.5% on Exploit), and OpenAI Codex CLI: o4-mini (90% on Patch, mapping to \\$14,422). OpenAI Codex CLI: o3-high, OpenAI Codex CLI: o4-mini, and Claude Code are more capable at defense, achieving higher Patch scores of 90%, 90%, and 87.5%, compared to Exploit scores of 47.5%, 32.5%, and 57.5% respectively; while the custom agents are relatively balanced between offense and defense, achieving Exploit scores of 17.5-67.5% and Patch scores of 25-60%.

View full details

Poster

NoBOOM: Chemical Process Datasets for Industrial Anomaly Detection

Dennis Wagner ⋅ Fabian Hartung ⋅ Justus Arweiler ⋅ Aparna Muraleedharan ⋅ Indra Jungjohann ⋅ Arjun Nair ⋅ Steffen Reithermann ⋅ Ralf Schulz ⋅ Michael Bortz ⋅ Daniel Neider ⋅ Heike Leitte ⋅ Joachim Pfeffinger ⋅ Stephan Mandt ⋅ Sophie Fellenz ⋅ Torsten Katz ⋅ Fabian Jirasek ⋅ Jakob Burger ⋅ Hans Hasse ⋅ Marius Kloft

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Monitoring chemical processes is essential to prevent catastrophic failures, optimize costs and profits, and ensure the safety of employees and the environment. A key component of modern monitoring systems is the automated detection of anomalies in sensor data over time, called time series, enabling partial automation of plant operation and adding additional layers of supervision to crucial components. The development of anomaly detection methods in this domain is challenging, since real chemical process data are usually proprietary, and simulated data are generally not a sufficient replacement. In this paper, we present NoBOOM, the first collection of datasets for anomaly detection in real-world chemical process data, including labeled data from a running process at our industry partner BASF SE — one of the world’s leading chemical companies — and several chemical processes run in laboratory‑scale and pilot‑scale plants. While we are not able to share every detail about the industrial process, for the laboratory‑ and pilot‑scale plants, we provide comprehensive information on plant configuration, process operation, and, in particular, anomaly events, enabling a differentiated analysis of anomaly detection methods. To demonstrate the complexity of the benchmark, we analyze the data with regard to common issues of time-series anomaly detection (TSAD) benchmarks, including potential triviality and bias.

View full details

Poster

MergeBench: A Benchmark for Merging Domain-Specialized LLMs

Yifei He ⋅ Siqi Zeng ⋅ Yuzheng Hu ⋅ Rui Yang ⋅ Tong Zhang ⋅ Han Zhao

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Model merging provides a scalable alternative to multi-task training by combining specialized finetuned models through parameter arithmetic, enabling efficient deployment without the need for joint training or access to all task data. While recent methods have shown promise, existing evaluations are limited in both model scale and task diversity, leaving open questions about their applicability to large, domain-specialized LLMs. To tackle the challenges, we introduce MergeBench, a comprehensive evaluation suite designed to assess model merging at scale. MergeBench builds on state-of-the-art open-source language models, including Llama and Gemma families at 2B to 9B scales, and covers five key domains: instruction following, mathematics, multilingual understanding, coding and safety. We standardize finetuning and evaluation protocols, and assess eight representative merging methods across multi-task performance, forgetting and runtime efficiency. Based on extensive experiments, we provide practical guidelines for algorithm selection and share insights showing that model merging tends to perform better on stronger base models, with techniques such as merging coefficient tuning and sparsification improving knowledge retention. However, several challenges remain, including the computational cost on large models, the gap for in-domain performance compared to multi-task models, and the underexplored role of model merging in standard LLM training pipelines. We hope MergeBench provides a foundation for future research to advance the understanding and practical application of model merging.

View full details

Poster

Risk Management for Mitigating Benchmark Failure Modes: BenchRisk

Sean McGregor ⋅ Vassil Tashev ⋅ Armstrong Foundjem ⋅ Aishwarya Ramasethu ⋅ Sadegh AlMahdi Kazemi Zarkouei ⋅ Chris Knotz ⋅ Kongtao Chen ⋅ Alicia Parrish ⋅ Anka Reuel-Lamparth ⋅ Heather Frase

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Large language model (LLM) benchmarks inform LLM use decisions (e.g., "is this LLM safe to deploy for my use case and context?"). However, benchmarks may be rendered unreliable by various failure modes impacting benchmark bias, variance, coverage, or people's capacity to understand benchmark evidence. Using the National Institute of Standards and Technology's risk management process as a foundation, this research iteratively analyzed 26 popular benchmarks, identifying 57 potential failure modes and 196 corresponding mitigation strategies. The mitigations reduce failure likelihood and/or severity, providing a frame for evaluating "benchmark risk," which is scored to provide a metaevaluation benchmark: BenchRisk. Higher scores indicate benchmark users are less likely to reach an incorrect or unsupported conclusion about an LLM. All 26 scored benchmarks present significant risk within one or more of the five scored dimensions (comprehensiveness, intelligibility, consistency, correctness, and longevity), which points to important open research directions for the field of LLM benchmarking. The BenchRisk workflow allows for comparison between benchmarks; as an open-source tool, it also facilitates the identification and sharing of risks and their mitigations.

View full details

Poster

NOVA: A Benchmark for Rare Anomaly Localization and Clinical Reasoning in Brain MRI

Cosmin Bercea ⋅ Jun Li ⋅ Philipp Raffler ⋅ Evamaria O. Riedel ⋅ Lena Schmitzer ⋅ Angela Kurz ⋅ Felix Bitzer ⋅ Paula Roßmüller ⋅ Julian Canisius ⋅ Mirjam Beyrle ⋅ Che Liu ⋅ Wenjia Bai ⋅ Bernhard Kainz ⋅ Julia Schnabel ⋅ Benedikt Wiestler

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

In many real-world applications, deployed models encounter inputs that differ from the data seen during training. Open-world recognition ensures that such systems remain robust as ever-emerging, previously _unknown_ categories appear and must be addressed without retraining.Foundation and vision-language models are pre-trained on large and diverse datasets with the expectation of broad generalization across domains, including medical imaging.However, benchmarking these models on test sets with only a few common outlier types silently collapses the evaluation back to a closed-set problem, masking failures on rare or truly novel conditions encountered in clinical use.We therefore present NOVA, a challenging, real-life _evaluation-only_ benchmark of $\sim$900 brain MRI scans that span 281 rare pathologies and heterogeneous acquisition protocols. Each case includes rich clinical narratives and double-blinded expert bounding-box annotations. Together, these enable joint assessment of anomaly localisation, visual captioning, and diagnostic reasoning. Because NOVA is never used for training, it serves as an _extreme_ stress-test of out-of-distribution generalisation: models must bridge a distribution gap both in sample appearance and in semantic space. Baseline results with leading vision-language models (GPT-4o, Gemini 2.0 Flash, and Qwen2.5-VL-72B) reveal substantial performance drops, with approximately a 65\% gap in localisation compared to natural-image benchmarks and 40\% and 20\% gaps in captioning and reasoning, respectively, compared to resident radiologists. Therefore, NOVA establishes a testbed for advancing models that can detect, localize, and reason about truly unknown anomalies.

View full details

Poster

Beyond Chemical QA: Evaluating LLM's Chemical Reasoning with Modular Chemical Operations

Li Hao ⋅ He CAO ⋅ Bin Feng ⋅ Daniel Shao ⋅ Robert Tang ⋅ Zhiyuan Yan ⋅ Yonghong Tian ⋅ Li Yuan ⋅ Yu Li

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

While large language models (LLMs) with Chain-of-Thought (CoT) reasoning excel in mathematics and coding, their potential for systematic reasoning in chemistry, a domain demanding rigorous structural analysis for real-world tasks like drug design and reaction engineering, remains untapped. Current benchmarks focus on simple knowledge retrieval, neglecting step-by-step reasoning required for complex tasks such as molecular optimization and reaction prediction. To address this, we introduce ChemCoTBench, a reasoning framework that bridges molecular structure understanding with arithmetic-inspired operations, including addition, deletion, and substitution, to formalize chemical problem-solving into transparent, step-by-step workflows. By treating molecular transformations as modular "chemical operations", the framework enables slow-thinking reasoning, mirroring the logic of mathematical proofs while grounding solutions in real-world chemical constraints. We evaluate models on two high-impact tasks: Molecular Property Optimization and Chemical Reaction Prediction. These tasks mirror real-world challenges while providing structured evaluability. We further provide ChemCoTDataset, a pioneering 22,000-instance chemical reasoning dataset with expert-annotated chains of thought to facilitate LLM fine-tuning. By providing annotated trainable datasets, a reasoning taxonomy, and baseline evaluations, our work bridges the gap between abstract reasoning methods and practical chemical discovery, establishing a foundation for advancing LLMs as tools for AI-driven scientific innovation.

View full details

Poster

SVRPBench: A Realistic Benchmark for Stochastic Vehicle Routing Problem

Ahmed Heakl ⋅ Yahia Salaheldin Shaaban ⋅ Salem Lahlou ⋅ Martin Takac ⋅ Zangir Iklassov

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Robust routing under uncertainty is central to real-world logistics, yet most benchmarks assume static, idealized settings. We present \texttt{SVRPBench}, the first open benchmark to capture high-fidelity stochastic dynamics in vehicle routing at urban scale. Spanning more than 500 instances with up to 1000 customers, it simulates realistic delivery conditions: time-dependent congestion, log-normal delays, probabilistic accidents, and empirically grounded time windows for residential and commercial clients. Our pipeline generates diverse, constraint-rich scenarios, including multi-depot and multi-vehicle setups. Benchmarking reveals that state-of-the-art RL solvers like POMO and AM degrade by over 20\% under distributional shift, while classical and metaheuristic methods remain robust. To enable reproducible research, we release the dataset ([Huggingface](https://huggingface.co/datasets/MBZUAI/svrp-bench)) and evaluation suite ([Github](https://github.com/yehias21/vrp-benchmarks)). SVRPBench challenges the community to design solvers that generalize beyond synthetic assumptions and adapt to real-world uncertainty.

View full details

Poster

Seeking and Updating with Live Visual Knowledge

Mingyang Fu ⋅ Yuyang Peng ⋅ Dongping Chen ⋅ Zetong Zhou ⋅ Benlin Liu ⋅ Yao Wan ⋅ Zhou Zhao ⋅ Philip S Yu ⋅ Ranjay Krishna

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The visual world around us constantly evolves, from real-time news and social media trends to global infrastructure changes visible through satellite imagery and augmented reality enhancements. However, Multimodal Large Language Models (MLLMs), which automate many tasks, struggle to stay current, limited by the cutoff dates in their fixed training datasets.To quantify this stagnation, we introduce LiveVQA, the first-of-its-kind dataset featuring 107,143 samples and 12 categories data specifically designed to support research in both seeking and updating with live visual knowledge.Drawing from recent news articles, video platforms, and academic publications in April 2024-May 2025, LiveVQA enables evaluation of how models handle latest visual information beyond their knowledge boundaries and how current methods help to update them. Our comprehensive benchmarking of 17 state-of-the-art MLLMs reveals significant performance gaps on content beyond knowledge cutoff, and tool-use or agentic visual seeking framework drastically gain an average of 327% improvement. Furthermore, we explore parameter-efficient fine-tuning methods to update MLLMs with new visual knowledge.We dive deeply to the critical balance between adapter capacity and model capability when updating MLLMs with new visual knowledge. All the experimental dataset and source code are publicly available at: https://livevqa.github.io.

View full details

Poster

Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving of Inequalities

Haoyu Zhao ⋅ Yihan Geng ⋅ Shange Tang ⋅ Yong Lin ⋅ Bohan Lyu ⋅ Hongzhou Lin ⋅ Chi Jin ⋅ Sanjeev Arora

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

LLM-based formal proof assistants (e.g., in Lean) hold great promise for automating mathematical discovery. But beyond syntactic correctness, do these systems truly understand mathematical structure as humans do? We investigate this question in context of mathematical inequalities---specifically the prover's ability to recognize that the given problem simplifies by applying a known inequality such as AM/GM. Specifically, we are interested in their ability to do this in a {\em compositional setting} where multiple inequalities must be applied as part of a solution. We introduce \ineqcomp, a benchmark built from elementary inequalities through systematic transformations, including variable duplication, algebraic rewriting, and multi-step composition. Although these problems remain easy for humans, we find that most provers---including Goedel, STP, and Kimina-7B---struggle significantly. DeepSeek-Prover-V2-7B shows relative robustness, but still suffers a 20\% performance drop (pass@32). Even for DeepSeek-Prover-V2-671B model, the gap between compositional variants and seed problems exists, implying that simply scaling up the model size alone does not fully solve the compositional weakness. Strikingly, performance remains poor for all models even when formal proofs of the constituent parts are provided in context, revealing that the source of weakness is indeed in compositional reasoning. Our results expose a persisting gap between the generalization behavior of current AI provers and human mathematical intuition. All data and evaluation code can be found at \url{https://github.com/haoyuzhao123/LeanIneqComp}.

View full details

Poster

Merlin L48 Spectrogram Dataset

Aaron Sun ⋅ Subhransu Maji ⋅ Grant Van Horn

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

In the single-positive multi-label (SPML) setting, each image in a dataset is labeled with the presence of a single class, while the true presence of other classes remains unknown. The challenge is to narrow the performance gap between this partially-labeled setting and fully-supervised learning, which often requires a significant annotation budget. Prior SPML methods were developed and benchmarked on synthetic datasets created by randomly sampling single positive labels from fully-annotated datasets like Pascal VOC, COCO, NUS-WIDE, and CUB200. However, this synthetic approach does not reflect real-world scenarios and fails to capture the fine-grained complexities that can lead to difficult misclassifications. In this work, we introduce the L48 dataset, a fine-grained, real-world multi-label dataset derived from recordings of bird sounds. L48 provides a natural SPML setting with single-positive annotations on a challenging, fine-grained domain, as well as two extended settings in which domain priors give access to additional negative labels. We benchmark existing SPML methods on L48 and observe significant performance differences compared to synthetic datasets and analyze method weaknesses, underscoring the need for more realistic and difficult benchmarks.

View full details

Poster

Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence

Yining Hong ⋅ Rui Sun ⋅ Bingxuan Li ⋅ Xingcheng Yao ⋅ Maxine Wu ⋅ Alexander Chien ⋅ Da Yin ⋅ Ying Nian Wu ⋅ Zhecan Wang ⋅ Kai-Wei Chang

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

AI agents today are mostly siloed — they either retrieve and reason over vast amount of digital information and knowledge obtained online; or interact with the physical world through embodied perception, planning and action — but rarely both. This separation limits their ability to solve tasks that require integrated physical and digital intelligence, such as cooking from online recipes, navigating with dynamic map data, or interpreting real-world landmarks using web knowledge. We introduce \textsc{Embodied Web Agents}, a novel paradigm for AI agents that fluidly bridge embodiment and web-scale reasoning. To operationalize this concept, we first develop the \textsc{Embodied Web Agents} task environments, a unified simulation platform that integrates realistic 3D indoor and outdoor environments with functional web interfaces. Building upon this platform, we construct and release the \textsc{Embodied Web Agents} Benchmark, which encompasses a diverse suite of tasks including cooking, navigation, shopping, tourism, and geolocation — all requiring coordinated reasoning across physical and digital realms for systematic assessment of cross-domain intelligence. Experimental results reveal significant performance gaps between state-of-the-art AI systems and human capabilities, establishing both challenges and opportunities at the intersection of embodied cognition and web-scale knowledge access.

View full details

Poster

EgoExOR: An Ego-Exo-Centric Operating Room Dataset for Surgical Activity Understanding

Ege Özsoy ⋅ Arda Mamur ⋅ Felix Tristram ⋅ Chantal Pellegrini ⋅ Magdalena Wysocki ⋅ Benjamin Busam ⋅ Nassir Navab

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Operating rooms (ORs) demand precise coordination among surgeons, nurses, and equipment in a fast-paced, occlusion-heavy environment, necessitating advanced perception models to enhance safety and efficiency. Existing datasets either provide partial egocentric views or sparse exocentric multi-view context, but do not explore the comprehensive combination of both. We introduce EgoExOR, the first OR dataset and accompanying benchmark to fuse first-person and third-person perspectives. Spanning 94 minutes (84,553 frames at 15 FPS) of two emulated spine procedures, Ultrasound-Guided Needle Insertion and Minimally Invasive Spine Surgery, EgoExOR integrates egocentric data (RGB, gaze, hand tracking, audio) from wearable glasses, exocentric RGB and depth from RGB-D cameras, and ultrasound imagery. Its detailed scene graph annotations, covering 36 entities and 22 relations (568,235 triplets), enable robust modeling of clinical interactions, supporting tasks like action recognition and human-centric perception. We evaluate the surgical scene graph generation performance of two adapted state-of-the-art models and offer a new baseline that explicitly leverages EgoExOR’s multimodal and multi-perspective signals. This new dataset and benchmark set a new foundation for OR perception, offering a rich, multimodal resource for next-generation clinical perception. Our code and data are available at https://github.com/ardamamur/EgoExOR.

View full details

Poster

DermaCon-IN: A Multiconcept-Annotated Dermatological Image Dataset of Indian Skin Disorders for Clinical AI Research

Shanawaj Sahebpatel Madarkar ⋅ Mahajabeen Madarkar ⋅ Madhumitha Venkatesh ⋅ TELI PRAKASH ⋅ Konda Reddy Mopuri ⋅ Vinaykumar MV ⋅ Kota Sathwika ⋅ Adarsh Kasturi ⋅ Gandla Raj ⋅ Padharthi Supranitha ⋅ Harsh Udai

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Artificial intelligence is poised to augment dermatological care by enabling scalable image-based diagnostics. Yet, the development of robust and equitable models remains hindered by datasets that fail to capture the clinical and demographic complexity of real-world practice. This complexity stems from region-specific disease distributions, wide variation in skin tones, and the underrepresentation of outpatient scenarios from non-Western populations. We introduce DermaCon-IN, a prospectively curated dermatology dataset comprising 5,450 clinical images from 2,993 patients across outpatient clinics in South India. Each image is annotated by board-certified dermatologists with 245 distinct diagnoses, structured under a hierarchical, etiology-based taxonomy adapted from Rook’s classification. The dataset captures a wide spectrum of dermatologic conditions and tonal variation commonly seen in Indian outpatient care. We benchmark a range of architectures, including convolutional models (ResNet, DenseNet, EfficientNet), transformer-based models (ViT, MaxViT, Swin), and Concept Bottleneck Models to establish baseline performance and explore how anatomical and concept-level cues may be integrated. These results are intended to guide future efforts toward interpretable and clinically realistic models. DermaCon-IN provides a scalable and representative foundation for advancing dermatology AI in real-world settings.

View full details

Poster

Leader360V: A Large-scale, Real-world 360 Video Dataset for Multi-task Learning in Diverse Environment

WEIMING ZHANG ⋅ Dingwen Xiao ⋅ Aobotao DAI ⋅ Yexin Liu ⋅ Tianbo Pan ⋅ Shiqi Wen ⋅ Lei Chen ⋅ Lin Wang

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

360 video captures the complete surrounding scenes with the ultra-large field of view of 360x180. This makes 360 scene understanding tasks, *e.g.*, segmentation and tracking, crucial for appications, such as autonomous driving, robotics. With the recent emergence of foundation models, the community is, however, impeded by the lack of large-scale, labelled real-world datasets. This is caused by the inherent spherical properties, *e.g.*, severe distortion in polar regions, and content discontinuities, rendering the annotation costly yet complex. This paper introduces **Leader360V**, the **first** large-scale (10K+), labeled real-world 360 video datasets for instance segmentation and tracking. Our datasets enjoy high scene diversity, ranging from indoor and urban settings to natural and dynamic outdoor scenes. To automate annotation, we design an automatic labeling pipeline, which subtly coordinates pre-trained 2D segmentors and large language models (LLMs) to facilitate the labeling. The pipeline operates in three novel stages. Specifically, in the **Initial Annotation Phase**, we introduce a Semantic- and Distortion-aware Refinement (**SDR**) module, which combines object mask proposals from multiple 2D segmentors with LLM-verified semantic labels. These are then converted into mask prompts to guide SAM2 in generating distortion-aware masks for subsequent frames. In the **Auto-Refine Annotation Phase**, missing or incomplete regions are corrected either by applying the SDR again or resolving the discontinuities near the horizontal borders. The **Manual Revision Phase** finally incorporates LLMs and human annotators to further refine and validate the annotations. Extensive user studies and evaluations demonstrate the effectiveness of our labeling pipeline. Meanwhile, experiments confirm that Leader360V significantly enhances model performance for 360 video segmentation and tracking, paving the way for more scalable 360 scene understanding. We release our dataset and code at {https://leader360v.github.io/Leader360V\_HomePage/} for better understanding.

View full details

Poster

OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding

Jingli Lin ⋅ Chenming Zhu ⋅ Runsen Xu ⋅ Xiaohan Mao ⋅ Xihui Liu ⋅ Tai WANG ⋅ Jiangmiao Pang

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Recent advances in multimodal large language models (MLLMs) have shown remarkable capabilitiesin integrating vision and language for complex reasoning. While most existing benchmarks evaluate models under offline settings with a fixed set of pre-recorded inputs, we introduce OST-Bench, a benchmark designed to evaluate Online Spatio-Temporal understanding from the perspective of an agent actively exploring a scene. The “Online” aspect emphasizes the need to process and reason over incrementally acquired observations, while the “Spatio-Temporal” component requires integrating current visual inputs with historical memory to support dynamic spatial reasoning. OST-Bench better reflects the challenges of real-world embodied perception. Built on an efficient data collection pipeline, OST-Bench consists of 1.4k scenes and 10k question-answer pairs collected from ScanNet, Matterport3D, and ARKitScenes. We evaluate several leading MLLMs on OST-Bench and observe that they fall short on tasks requiring complex spatio-temporal reasoning. Under the online setting, their accuracy declines as the exploration horizon extends and the memory grows. Through further experimental analysis, we identify common error patterns across models and find that both complex clue-based spatial reasoning demands and long-term memory retrieval requirements significantly drop model performance along two separate axes, highlighting the core challenges that must be addressed to improve online embodied reasoning. To foster further research and development in the field, our codes, dataset, and benchmark are available at https://github.com/InternRobotics/OST-Bench.

View full details

Poster

$\texttt{AVROBUSTBENCH}$: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-Time

Sarthak Kumar Maharana ⋅ Saksham Singh Kushwaha ⋅ Baoming Zhang ⋅ Adrian Rodriguez ⋅ Songtao Wei ⋅ Yapeng Tian ⋅ Yunhui Guo

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

While recent audio-visual models have demonstrated impressive performance, their robustness to distributional shifts at test-time remains not fully understood. Existing robustness benchmarks mainly focus on single modalities, making them insufficient for thoroughly assessing the robustness of audio-visual models. Motivated by real-world scenarios where shifts can occur $\textit{simultaneously}$ in both audio and visual modalities, we introduce $\texttt{AVROBUSTBENCH}$, a comprehensive benchmark designed to evaluate the test-time robustness of audio-visual recognition models. $\texttt{AVROBUSTBENCH}$ comprises four audio-visual benchmark datasets, $\texttt{AUDIOSET-2C}$, $\texttt{VGGSOUND-2C}$, $\texttt{KINETICS-2C}$, and $\texttt{EPICKITCHENS-2C}$, each incorporating 75 bimodal audio-visual corruptions that are $\textit{co-occurring}$ and $\textit{correlated}$. Through extensive evaluations, we observe that state-of-the-art supervised and self-supervised audio-visual models exhibit declining robustness as corruption severity increases. Furthermore, online test-time adaptation (TTA) methods, on $\texttt{VGGSOUND-2C}$ and $\texttt{KINETICS-2C}$, offer minimal improvements in performance under bimodal corruptions. We further propose $\texttt{AV2C}$, a simple TTA approach enabling on-the-fly cross-modal fusion by penalizing high-entropy samples, which achieves large improvements on $\texttt{VGGSOUND-2C}$. We hope $\texttt{AVROBUSTBENCH}$ steers the development of more effective and robust audio-visual TTA approaches. Our code is available [here](https://github.com/sarthaxxxxx/AV-C-Robustness-Benchmark).

View full details

Poster

TalkCuts: A Large-Scale Dataset for Multi-Shot Human Speech Video Generation

Jiaben Chen ⋅ Zixin Wang ⋅ AILING ZENG ⋅ Yang Fu ⋅ Xueyang Yu ⋅ Siyuan Cen ⋅ Julian Tanke ⋅ Yihang Chen ⋅ Koichi Saito ⋅ Yuki Mitsufuji ⋅ Chuang Gan

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

In this work, we present TalkCuts, a large-scale dataset designed to facilitate the study of multi-shot human speech video generation. Unlike existing datasets that focus on single-shot, static viewpoints, TalkCuts offers 164k clips totaling over 500 hours of high-quality 1080P human speech videos with diverse camera shots, including close-up, half-body, and full-body views. The dataset includes detailed textual descriptions, 2D keypoints and 3D SMPL-X motion annotations, covering over 10k identities, enabling multimodal learning and evaluation. As a first attempt to showcase the value of the dataset, we present Orator, an LLM-guided multi-modal generation framework as a simple baseline, where the language model functions as a multi-faceted director, orchestrating detailed specifications for camera transitions, speaker gesticulations, and vocal modulation. This architecture enables the synthesis of coherent long-form videos through our integrated multi-modal video generation module. Extensive experiments in both pose-guided and audio-driven settings show that training on TalkCuts significantly enhances the cinematographic coherence and visual appeal of generated multi-shot speech videos. We believe TalkCuts provides a strong foundation for future work in controllable, multi-shot speech video generation and broader multimodal learning.

View full details

Poster

Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind

Qingmei Li ⋅ Yang Zhang ⋅ Zurong Mai ⋅ Yuhang Chen ⋅ Loushuohong ⋅ Henglian Huang ⋅ Jiarui Zhang ⋅ Zhiwei Zhang ⋅ Yibin Wen ⋅ Weijia Li ⋅ Haohuan Fu ⋅ Huang Jianxi ⋅ Juepeng Zheng

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Large Multimodal Models (LMMs) has demonstrated capabilities across various domains, but comprehensive benchmarks for agricultural remote sensing (RS) remain scarce. Existing benchmarks designed for agricultural RS scenarios exhibit notable limitations, primarily in terms of insufficient scene diversity in the dataset and oversimplified task design. To bridge this gap, we introduce AgroMind, a comprehensive agricultural remote sensing benchmark covering four task dimensions: spatial perception, object understanding, scene understanding, and scene reasoning, with a total of 13 task types, ranging from crop identification and health monitoring to environmental analysis. We curate a high-quality evaluation set by integrating nine public datasets and one private global parcel dataset, containing 28,482 QA pairs and 20,850 images. The pipeline begins with multi-source data pre-processing, including collection, format standardization, and annotation refinement. We then generate a diverse set of agriculturally relevant questions through the systematic definition of tasks. Finally, we employ LMMs for inference, generating responses, and performing detailed examinations. We evaluated 20 open-source LMMs and 4 closed-source models on AgroMind. Experiments reveal significant performance gaps, particularly in spatial reasoning and fine-grained recognition, it is notable that human performance lags behind several leading LMMs. By establishing a standardized evaluation framework for agricultural RS, AgroMind reveals the limitations of LMMs in domain knowledge and highlights critical challenges for future work. Data and code can be accessed at https://rssysu.github.io/AgroMind/.

View full details

Poster

MLIP Arena: Advancing Fairness and Transparency in Machine Learning Interatomic Potentials via an Open, Accessible Benchmark Platform

Yuan Chiang ⋅ Tobias Kreiman ⋅ Christine Zhang ⋅ Matthew Kuner ⋅ Elizabeth Weaver ⋅ Ishan Amin ⋅ Hyunsoo Park ⋅ Yunsung Lim ⋅ Jihan Kim ⋅ Daryl Chrzan ⋅ Aron Walsh ⋅ Samuel Blau ⋅ Mark Asta ⋅ Aditi Krishnapriyan

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Machine learning interatomic potentials (MLIPs) have revolutionized molecular and materials modeling, but existing benchmarks suffer from data leakage, limited transferability, and an over-reliance on error-based metrics tied to specific density functional theory (DFT) references. We introduce MLIP Arena, a benchmark platform that evaluates force field performance based on physics awareness, chemical reactivity, stability under extreme conditions, and predictive capabilities for thermodynamic properties and physical phenomena. By moving beyond static DFT references and revealing the important failure modes of current foundation MLIPs in real-world settings, MLIP Arena provides a reproducible framework to guide the next-generation MLIP development toward improved predictive accuracy and runtime efficiency while maintaining physical consistency. The Python package and online leaderboard are available at https://github.com/atomind-ai/mlip-arena.

View full details

Poster

CSI-Bench: A Large-Scale In-the-Wild Dataset for Multi-task WiFi Sensing

Guozhen Zhu ⋅ Yuqian Hu ⋅ Weihang Gao ⋅ Wei-Hsiang Wang ⋅ Beibei Wang ⋅ K. Liu

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

WiFi sensing has emerged as a compelling contactless modality for human activity monitoring by capturing fine-grained variations in Channel State Information (CSI). Its ability to operate continuously and non-intrusively while preserving user privacy makes it particularly suitable for health monitoring. However, existing WiFi sensing systems struggle to generalize in real-world settings, largely due to datasets collected in controlled environments with homogeneous hardware and fragmented, session-based recordings that fail to reflect continuous daily activity.We present CSI-Bench, a large-scale, in-the-wild benchmark dataset collected using commercial WiFi edge devices across 26 diverse indoor environments with 35 real users. Spanning over 461 hours of effective data, CSI-Bench captures realistic signal variability under natural conditions. It includes task-specific datasets for fall detection, breathing monitoring, localization, and motion source recognition, as well as a co-labeled multitask dataset with joint annotations for user identity, activity, and proximity. To support the development of robust and generalizable models, CSI-Bench provides standardized evaluation splits and baseline results for both single-task and multi-task learning. CSI-Bench offers a foundation for scalable, privacy-preserving WiFi sensing systems in health and broader human-centric applications.

View full details

Poster

Whose View of Safety? A Deep DIVE Dataset for Pluralistic Alignment of Text-to-Image Models

Charvi Rastogi ⋅ Tian Huey Teh ⋅ Pushkar Mishra ⋅ Roma Patel ⋅ Ding Wang ⋅ Mark Díaz ⋅ Alicia Parrish ⋅ Aida Mostafazadeh Davani ⋅ Zoe Ashwood ⋅ Michela Paganini ⋅ Vinodkumar Prabhakaran ⋅ Verena Rieser ⋅ Lora Aroyo

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Current text-to-image (T2I) models often fail to account for diverse human experiences, leading to misaligned systems. We advocate for pluralism in AI alignment, where an AI understands and is steerable towards diverse, and often conflicting, human values. Our work provides three core contributions to achieve this in T2I models. First, we introduce a novel dataset for Diverse Intersectional Visual Evaluation (DIVE) -- the first multimodal dataset for pluralistic alignment. It enables deep alignment to diverse safety perspectives through a large pool of demographically intersectional human raters who provided extensive feedback across 1000 prompts, with high replication, capturing nuanced safety perceptions. Second, we empirically confirm demographics as a crucial proxy for diverse viewpoints in this domain, revealing significant, context-dependent differences in harm perception that diverge from conventional evaluations. Finally, we discuss implications for building aligned T2I models, including efficient data collection strategies, LLM judgment capabilities, and model steerability towards diverse perspectives. This research offers foundational tools for more equitable and aligned T2I systems.Content Warning: The paper includes sensitive content that may be harmful.

View full details

Poster

When No Paths Lead to Rome: Benchmarking Systematic Neural Relational Reasoning

Anirban Das ⋅ Muhammad Irtaza Khalid ⋅ Rafael Peñaloza ⋅ Steven Schockaert

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Designing models that can learn to reason in a systematic way is an important and long-standing challenge. In recent years, a wide range of solutions have been proposed for the specific case of systematic relational reasoning, including Neuro-Symbolic approaches, variants of the Transformer architecture, and specialized Graph Neural Networks. However, existing benchmarks for systematic relational reasoning focus on an overly simplified setting, based on the assumption that reasoning can be reduced to composing relational paths. In fact, this assumption is hard-baked into the architecture of several recent models, leading to approaches that can perform well on existing benchmarks but are difficult to generalize to other settings. To support further progress in the field of systematic relational reasoning with neural networks, we introduce a new benchmark that adds several levels of difficulty, requiring models to go beyond path-based reasoning.

View full details

Poster

HARDMath2: A Benchmark for Applied Mathematics Built by Students as Part of a Graduate Class

James Roggeveen ⋅ Erik Wang ⋅ David Ettel ⋅ Will Flintoft ⋅ Peter Donets ⋅ Raglan Ward ⋅ Ahmed Roman ⋅ Anton Graf ⋅ Siddharth Dandavate ⋅ Ava Williamson ⋅ Felix Yeung ⋅ Kacper Migacz ⋅ Yijun Wang ⋅ Egemen Bostan ⋅ Duy Thuc Nguyen ⋅ Zhe He ⋅ Marc Descoteaux ⋅ Anne Mykland ⋅ Shida Liu ⋅ Jorge Garcia Ponce ⋅ Luke Zhu ⋅ Yuyang Chen ⋅ Ekaterina Ivshina ⋅ Miguel Fernandez ⋅ Minjae Kim ⋅ Kennan Gumbs ⋅ Matthew Tan ⋅ Russell Yang ⋅ Mai Hoang ⋅ David Brown ⋅ Isabella Silveira ⋅ Lavon Sykes ⋅ Arjun Nageswaran ⋅ William Fredenberg ⋅ Yiming Chen ⋅ Lucas Martin ⋅ Yixing Tang ⋅ Kelly Smith ⋅ Hongyu Liao ⋅ Logan Wilson ⋅ Alexander D. Cai ⋅ Lucy Nathwani ⋅ Nickholas Gutierrez ⋅ Andrea Elizabeth Biju ⋅ Michael Brenner

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Large language models (LLMs) have shown remarkable progress in mathematical problem-solving, but evaluation has largely focused on problems that have exact analytical solutions or involve formal proofs, often overlooking approximation-based problems ubiquitous in applied science and engineering. To fill this gap, we build on prior work and present $\textbf{HARDMath2}$, a dataset of 211 original problems covering the core topics in an introductory graduate applied math class, including boundary-layer analysis, WKB methods, asymptotic solutions of nonlinear partial differential equations, and the asymptotics of oscillatory integrals. This dataset was designed and verified by the students and instructors of a core graduate applied mathematics course at Harvard. We build the dataset through a novel collaborative environment that challenges students to write and refine difficult problems consistent with the class syllabus, peer-validate solutions, test different models, and automatically check LLM-generated solutions against their own answers and numerical ground truths. Evaluation results show that leading frontier models still struggle with many of the problems in the dataset, highlighting a gap in the mathematical reasoning skills of current LLMs. Importantly, students identified strategies to create increasingly difficult problems by interacting with the models and exploiting common failure modes. This back-and-forth with the models not only resulted in a richer and more challenging benchmark but also led to qualitative improvements in the students' understanding of the course material, which is increasingly important as we enter an age where state-of-the-art language models can solve many challenging problems across a wide domain of fields.

View full details

Poster

A2Seek: Towards Reasoning-Centric Benchmark for Aerial Anomaly Understanding

Mengjingcheng Mo ⋅ Xinyang Tong ⋅ Mingpi Tan ⋅ Jiaxu Leng ⋅ JianKang Zheng ⋅ Yiran Liu ⋅ Haosheng Chen ⋅ Ji Gan ⋅ Weisheng Li ⋅ Xinbo Gao

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

While unmanned aerial vehicles (UAVs) offer wide-area, high-altitude coverage for anomaly detection, they face challenges such as dynamic viewpoints, scale variations, and complex scenes. Existing datasets and methods, mainly designed for fixed ground-level views, struggle to adapt to these conditions, leading to significant performance drops in drone-view scenarios.To bridge this gap, we introduce A2Seek (Aerial Anomaly Seek), a large-scale, reasoning-centric benchmark dataset for aerial anomaly understanding. This dataset covers various scenarios and environmental conditions, providing high-resolution real-world aerial videos with detailed annotations, including anomaly categories, frame-level timestamps, region-level bounding boxes, and natural language explanations for causal reasoning. Building on this dataset, we propose A2Seek-R1, a novel reasoning framework that generalizes R1-style strategies to aerial anomaly understanding, enabling a deeper understanding of “Where” anomalies occur and “Why” they happen in aerial frames.To this end, A2Seek-R1 first employs a graph-of-thought (GoT)-guided supervised fine-tuning approach to activate the model's latent reasoning capabilities on A2Seek. Then, we introduce Aerial Group Relative Policy Optimization (A-GRPO) to design rule-based reward functions tailored to aerial scenarios. Furthermore, we propose a novel “seeking” mechanism that simulates UAV flight behavior by directing the model's attention to informative regions.Extensive experiments demonstrate that A2Seek-R1 achieves up to a 22.04\% improvement in AP for prediction accuracy and a 13.9\% gain in mIoU for anomaly localization, exhibiting strong generalization across complex environments and out-of-distribution scenarios. Our dataset and code are released at https://2-mo.github.io/A2Seek/.

View full details

Poster

LibriBrain: Over 50 Hours of Within-Subject MEG to Improve Speech Decoding Methods at Scale

Miran Özdogan ⋅ Gilad Landau ⋅ Gereon Elvers ⋅ Dulhan Jayalath ⋅ Pratik Somaiya ⋅ Francesco Mantegna ⋅ Mark Woolrich ⋅ Oiwi Parker Jones

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

LibriBrain represents the largest single-subject MEG dataset to date for speech decoding, with over 50 hours of recordings---5$\times$ larger than the next comparable dataset and 50$\times$ larger than most. This unprecedented `depth' of within-subject data enables exploration of neural representations at a scale previously unavailable with non-invasive methods. LibriBrain comprises high-quality MEG recordings together with detailed annotations from a single participant listening to naturalistic spoken English, covering nearly the full Sherlock Holmes canon. Designed to support advances in neural decoding, LibriBrain comes with a Python library for streamlined integration with deep learning frameworks, standard data splits for reproducibility, and baseline results for three foundational decoding tasks: speech detection, phoneme classification, and word classification. Baseline experiments demonstrate that increasing training data yields substantial improvements in decoding performance, highlighting the value of scaling up deep, within-subject datasets. By releasing this dataset, we aim to empower the research community to advance speech decoding methodologies and accelerate the development of safe, effective clinical brain-computer interfaces.

View full details

Poster

XIFBench: Evaluating Large Language Models on Multilingual Instruction Following

Zhenyu Li ⋅ Kehai Chen ⋅ Yunfei Long ⋅ Xuefeng Bai ⋅ Yaoyin Zhang ⋅ Xuchen Wei ⋅ Juntao Li ⋅ Min Zhang

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Large Language Models (LLMs) have demonstrated remarkable instruction-following capabilities across various applications. However, their performance in multilingual settings lacks systematic investigation, with existing evaluations lacking fine-grained constraint analysis across diverse linguistic contexts. We introduce **XIFBench**, a comprehensive constraint-based benchmark for evaluating multilingual instruction-following abilities of LLMs, comprising 558 instructions with 0-5 additional constraints across five categories (*Content*, *Style*, *Situation*, *Format*, and *Numerical*) in six languages spanning different resource levels. To support reliable and consistent cross-lingual evaluation, we implement three methodological innovations: cultural accessibility annotation, constraint-level translation validation, and requirement-based evaluation using English requirements as semantic anchors across languages. Extensive experiments with various LLMs not only quantify performance disparities across resource levels but also provide detailed insights into how language resources, constraint categories, instruction complexity, and cultural specificity influence multilingual instruction-following. Our code and data are available at https://github.com/zhenyuli801/XIFBench.

View full details

Poster

BEDLAM2.0: Synthetic humans and cameras in motion

Joachim Tesch ⋅ Giorgio Becherini ⋅ Prerana Achar ⋅ Anastasios Yiannakidis ⋅ Muhammed Kocabas ⋅ Priyanka Patel ⋅ Michael Black

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Inferring 3D human motion from video remains a challenging problem with many applications. While traditional methods estimate the human in image coordinates, many applications require human motion to be estimated in world coordinates. This is particularly challenging when there is both human and camera motion. Progress on this topic has been limited by the lack of rich video data with ground truth human and camera movement. We address this with BEDLAM2.0, a new dataset that goes beyond the popular BEDLAM dataset in important ways. In addition to introducing more diverse and realistic cameras and camera motions, BEDLAM2.0 increases diversity and realism of body shape, motions, clothing, hair, and 3D environments. Additionally, it adds shoes, which were missing in BEDLAM. BEDLAM has become a key resource for training 3D human pose and motion regressors today and we show that BEDLAM2.0 is significantly better, particularly for training methods that estimate humans in world coordinates. We compare state-of-the art methods trained on BEDLAM and BEDLAM2.0, and find that BEDLAM2.0 significantly improves accuracy over BEDLAM. For research purposes, we provide the rendered videos, ground truth body parameters, and camera motions. We also provide the 3D assets to which we have rights and links to those from third parties.

View full details

Poster

WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch

Zimu Lu ⋅ Yunqiao Yang ⋅ Houxing Ren ⋅ Haotian Hou ⋅ Han Xiao ⋅ Ke Wang ⋅ Weikang Shi ⋅ Aojun Zhou ⋅ Mingjie Zhan ⋅ Hongsheng Li

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

LLM‑based agents have demonstrated great potential in generating and managing code within complex codebases. In this paper, we introduce WebGen-Bench, a novel benchmark designed to measure an LLM-based agent's ability to create multi-file website codebases from scratch. It contains diverse instructions for website generation, created through the combined efforts of human annotators and GPT-4o. These instructions span three major categories and thirteen minor categories, encompassing nearly all important types of web applications.To assess the quality of the generated websites, we generate test cases targeting each functionality described in the instructions. These test cases are then manually filtered, refined, and organized to ensure accuracy, resulting in a total of 647 test cases. Each test case specifies an operation to be performed on the website and the expected outcome of the operation.To automate testing and improve reproducibility, we employ a powerful web-navigation agent to execute test cases on the generated websites and determine whether the observed responses align with the expected results.We evaluate three high-performance code-agent frameworks—Bolt.diy, OpenHands, and Aider—using multiple proprietary and open-source LLMs as engines. The best-performing combination, Bolt.diy powered by DeepSeek-R1, achieves only 27.8\% accuracy on the test cases, highlighting the challenging nature of our benchmark.Additionally, we construct WebGen-Instruct, a training set consisting of 6,667 website-generation instructions. Training Qwen2.5-Coder-32B-Instruct on Bolt.diy trajectories generated from a subset of the training set achieves an accuracy of 38.2\%, surpassing the performance of the best proprietary model.We release our data-generation, training, and testing code, along with both the datasets and model weights at https://github.com/mnluzimu/WebGen-Bench.

View full details

Poster

REFED: A Subject Real-time Dynamic Labeled EEG-fNIRS Synchronized Recorded Emotion Dataset

Xiaojun Ning ⋅ Jing Wang ⋅ Zhiyang Feng ⋅ Tianzuo Xin ⋅ Shuo Zhang ⋅ Shaoqi Zhang ⋅ Zheng Lian ⋅ Yi Ding ⋅ Youfang Lin ⋅ Ziyu Jia

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Affective brain-computer interfaces (aBCIs) play a crucial role in personalized human–computer interaction and neurofeedback modulation. To develop practical and effective aBCI paradigms and to investigate the spatial-temporal dynamics of brain activity under emotional inducement, portable electroencephalography (EEG) signals have been widely adopted. To further enhance spatial-temporal perception, functional near-infrared spectroscopy (fNIRS) has attracted increasing interest in the aBCI field and has been explored in combination with EEG. However, existing datasets typically provide only static fixation labels, overlooking the dynamic changes in subjects' emotions. Notably, some studies have attempted to collect continuously annotated emotional data, but they have recorded only peripheral physiological signals without directly observing brain activity, limiting insight into underlying neural states under different emotions. To address these challenges, we present the Real-time labeled EEG-fNIRS Dataset (REFED). To the best of our knowledge, this is the first EEG-fNIRS dataset with real-time dynamic emotional annotations. REFED simultaneously records brain signals from both EEG and fNIRS modalities while providing continuous, real-time annotations of valence and arousal. The results of the data analysis demonstrate the effectiveness of emotion inducement and the reliability of real-time annotation. This dataset offers the possibility for studying the neurovascular coupling mechanism under emotional evolution and for developing dynamic, robust affective BCIs.

View full details

Poster

InterMT: Multi-Turn Interleaved Preference Alignment with Human Feedback

Boyuan Chen ⋅ Donghai Hong ⋅ Jiaming Ji ⋅ Jiacheng Zheng ⋅ Bowen Dong ⋅ Jiayi Zhou ⋅ Kaile Wang ⋅ Juntao Dai ⋅ Xuyao Wang ⋅ wenqi chen ⋅ Qirui Zheng ⋅ Wenxin Li ⋅ Sirui Han ⋅ Yike Guo ⋅ Yaodong Yang

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

As multimodal large models (MLLMs) continue to advance across challenging tasks, a key question emerges: \textbf{\textit{What essential capabilities are still missing? }}A critical aspect of human learning is continuous interaction with the environment -- not limited to language, but also involving multimodal understanding and generation.To move closer to human-level intelligence, models must similarly support \textbf{multi-turn}, \textbf{multimodal interaction}. In particular, they should comprehend interleaved multimodal contexts and respond coherently in ongoing exchanges.In this work, we present \textbf{an initial exploration} through the \textsc{InterMT} -- \textbf{the first preference dataset for \textit{multi-turn} multimodal interaction}, grounded in real human feedback. In this exploration, we particularly emphasize the importance of human oversight, introducing expert annotations to guide the process, motivated by the fact that current MLLMs lack such complex interactive capabilities. \textsc{InterMT} captures human preferences at both global and local levels into nine sub-dimensions, consists of 15.6k prompts, 52.6k multi-turn dialogue instances, and 32.4k human-labeled preference pairs. To compensate for the lack of capability for multi-modal understanding and generation, we introduce an agentic workflow that leverages tool-augmented MLLMs to construct multi-turn QA instances.To further this goal, we introduce \textsc{InterMT-Bench} to assess the ability ofMLLMs in assisting judges with multi-turn, multimodal tasks.We demonstrate the utility of \textsc{InterMT} through applications such as judge moderation and further reveal the \textit{multi-turn scaling law} of judge model.We hope the open-source of our data can help facilitate further research on aligning current MLLMs to the next step.

View full details

Poster

MedSG-Bench: A Benchmark for Medical Image Sequences Grounding

Jingkun Yue ⋅ Siqi Zhang ⋅ Zinan Jia ⋅ Huihuan Xu ⋅ Zongbo Han ⋅ Xiaohong Liu ⋅ Guangyu Wang

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Visual grounding is essential for precise perception and reasoning in multimodal large language models (MLLMs), especially in medical imaging domains. While existing medical visual grounding benchmarks primarily focus on single-image scenarios, real-world clinical applications often involve sequential images, where accurate lesion localization across different modalities and temporal tracking of disease progression (e.g., pre- vs. post-treatment comparison) require fine-grained cross-image semantic alignment and context-aware reasoning. To remedy the underrepresentation of image sequences in existing medical visual grounding benchmarks, we propose MedSG-Bench, the first benchmark tailored for Medical Image Sequences Grounding. It comprises eight VQA-style tasks, formulated into two paradigms of the grounding tasks, including 1) Image Difference Grounding, which focuses on detecting change regions across images, and 2) Image Consistency Grounding, which emphasizes detection of consistent or shared semantics across sequential images. MedSG-Bench covers 76 public datasets, 10 medical imaging modalities, and a wide spectrum of anatomical structures and diseases, totaling 9,630 question–answer pairs. We benchmark both general-purpose MLLMs (e.g., Qwen2.5-VL) and medical-domain specialized MLLMs (e.g., HuatuoGPT-vision), observing that even the advanced models exhibit substantial limitations in medical sequential grounding tasks. To advance this field, we construct MedSG-188K, a large-scale instruction-tuning dataset tailored for sequential visual grounding, and further develop MedSeq-Grounder, an MLLM designed to facilitate future research on fine-grained understanding across medical sequential images. We release all resources on https://github.com/Yuejingkun/MedSG-Bench

View full details

Poster

BLINK-Twice: You see, but do you observe? A Reasoning Benchmark on Visual Perception

junyan ye ⋅ Dongzhi JIANG ⋅ Jun He ⋅ Baichuan Zhou ⋅ Zilong Huang ⋅ Zhiyuan Yan ⋅ Hongsheng Li ⋅ Conghui He ⋅ Weijia Li

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Recently, Multimodal Large Language Models (MLLMs) have made rapid progress, particularly in enhancing their reasoning capabilities. However, existing reasoning benchmarks still primarily assess language-based reasoning, often treating visual input as replaceable context. To address this gap, we introduce BLINK-Twice, a vision-centric reasoning benchmark grounded in challenging perceptual tasks. Instead of relying on external knowledge, our tasks require models to reason from visual content alone, shifting the focus from language-based to image-grounded reasoning. Compared to prior perception benchmarks, it moves beyond shallow perception ("see") and requires fine-grained observation and analytical reasoning ("observe"). BLINK-Twice integrates three core components: seven types of visual challenges for testing visual reasoning, natural adversarial image pairs that enforce reliance on visual content, and annotated reasoning chains for fine-grained evaluation of the reasoning process rather than final answers alone. We evaluate 20 leading MLLMs, including 12 foundation models and 8 reasoning-enhanced models. BLINK-Twice poses a significant challenge to current models. While existing reasoning strategies in the language space—such as chain-of-thought or self-criticism can improve performance, they often result in unstable and redundant reasoning. We observe that repeated image observation improves performance across models, and active visual interaction, as demonstrated by models like o3, highlights the need for a new paradigm for vision reasoning. The dataset is publicly available at https://github.com/PicoTrex/BLINK-Twice.

View full details

Poster

MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly

Zhaowei Wang ⋅ Wenhao Yu ⋅ Xiyu REN ⋅ Jipeng Zhang ⋅ Yu Zhao ⋅ Rohit Saxena ⋅ Liang Cheng ⋅ Ginny Wong ⋅ Simon See ⋅ Pasquale Minervini ⋅ Yangqiu Song ⋅ Mark Steedman

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The rapid extension of context windows in large vision-language models has given rise to long-context vision-language models (LCVLMs), which are capable of handling hundreds of images with interleaved text tokens in a single forward pass. In this work, we introduce MMLongBench, the first benchmark covering a diverse set of long-context vision-language tasks, to evaluate LCVLMs effectively and thoroughly. MMLongBench is composed of 13,331 examples spanning five different categories of downstream tasks, such as Visual RAG and Many-Shot ICL. It also provides broad coverage of image types, including various natural and synthetic images. To assess the robustness of the models to different input lengths, all examples are delivered at five standardized input lengths (8K-128K tokens) via a cross-modal tokenization scheme that combines vision patches and text tokens. Through a thorough benchmarking of 46 closed-source and open-source LCVLMs, we provide a comprehensive analysis of the current models' vision-language long-context ability. Our results show that: i) performance on a single task is a weak proxy for overall long-context capability; ii) both closed-source and open-source models face challenges in long-context vision-language tasks, indicating substantial room for future improvement; iii) models with stronger reasoning ability tend to exhibit better long-context performance. By offering wide task coverage, various image types, and rigorous length control, MMLongBench provides the missing foundation for diagnosing and advancing the next generation of LCVLMs.

View full details

Poster

Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis

Tianbao Xie ⋅ Jiaqi Deng ⋅ Xiaochuan Li ⋅ Junlin Yang ⋅ Haoyuan Wu ⋅ Jixuan Chen ⋅ Wenjing Hu ⋅ Xinyuan Wang ⋅ Yuhui Xu ⋅ Zekun Wang ⋅ Yiheng Xu ⋅ Junli Wang ⋅ Doyen Sahoo ⋅ Tao Yu ⋅ Caiming Xiong

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Graphical user interface (GUI) grounding, the ability to map natural language instructions to specific actions on graphical user interfaces, remains a critical bottleneck in computer use agent development. Current benchmarks oversimplify grounding tasks as short referring expressions, failing to capture the complexity of real-world interactions that require software commonsense, layout understanding, and fine-grained manipulation capabilities. To address these limitations, we introduce OSWorld-G, a comprehensive benchmark comprising 564 finely annotated samples across diverse task types including text matching, element recognition, layout understanding, and precise manipulation. Additionally, we synthesize and release the largest computer use grounding dataset Jedi, which contains 4 million examples through multi-perspective decoupling of tasks. Our multi-scale models trained on Jedi demonstrate its effectiveness by outperforming existing approaches on ScreenSpot-v2, ScreenSpot-Pro, and our OSWorld-G. Furthermore, we demonstrate that improved grounding with Jedi directly enhances agentic capabilities of general foundation models on complex computer tasks with state-of-the-art performance, improving from 23% to 51% on OSWorld. Through detailed ablation studies, we identify key factors contributing to grounding performance and verify that combining specialized data for different interface elements enables compositional generalization to novel interfaces. All benchmark, data, checkpoints, and code are open-sourced and available at https://osworld-grounding.github.io.

View full details

Poster

TIDMAD: Time Series Dataset for Discovering Dark Matter with AI Denoising

Jessica Fry ⋅ Xinyi Fu ⋅ Zhenghao Fu ⋅ Kaliroë Pappas ⋅ Lindley Winslow ⋅ Aobo Li

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Dark matter makes up approximately 85\% of total matter in our universe, yet it has never been directly observed in any laboratory on Earth. The origin of dark matter is one of the most important questions in contemporary physics, and a convincing detection of dark matter would be a Nobel-Prize-level breakthrough in fundamental science. The ABRACADABRA experiment was specifically designed to search for dark matter. Although it has not yet made a discovery, ABRACADABRA has produced several dark matter search results widely endorsed by the physics community. The experiment generates ultra-long time-series data at a rate of 10 million samples per second, where the dark matter signal would manifest itself as a sinusoidal oscillation mode within the ultra-long time series. In this paper, we present the TIDMAD --- a comprehensive data release from the ABRACADABRA experiment including three key components: an ultra-long time series dataset divided into training, validation, and science subsets; a carefully-designed denoising score for direct model benchmarking; and a complete analysis framework which produces a physics community-standard dark matter search result suitable for publication as a physics paper. This data release enables core AI algorithms to extract the dark matter signal and produce real physics results thereby advancing fundamental science.

View full details

Poster

Thousand Voices of Trauma: A Large-Scale Synthetic Dataset for Modeling Prolonged Exposure Therapy Conversations

Suhas BN ⋅ Andrew Sherrill ⋅ Rosa I. Arriaga ⋅ Christopher Wiese ⋅ Saeed Abdullah

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The advancement of AI systems for mental health support is hindered by limited access to therapeutic conversation data, particularly for trauma treatment. We present Thousand Voices of Trauma, a synthetic benchmark dataset of 3,000 therapy conversations based on Prolonged Exposure therapy protocols for Post-traumatic Stress Disorder (PTSD). The dataset comprises 500 unique cases, each explored through six conversational perspectives that mirror the progression of therapy from initial anxiety to peak distress to emotional processing. We incorporated diverse demographic profiles (ages 18-80, M=49.3, 49.4\% male, 44.4\% female, 6.2\% non-binary), 20 trauma types, and 10 trauma-related behaviors using deterministic and probabilistic generation methods. Analysis reveals realistic distributions of trauma types (witnessing violence 10.6\%, bullying 10.2\%) and symptoms (nightmares 23.4\%, substance abuse 20.8\%). Clinical experts validated the dataset's therapeutic fidelity, highlighting its emotional depth while suggesting refinements for greater authenticity. We also developed an emotional trajectory benchmark with standardized metrics for evaluating model responses. This privacy-preserving dataset addresses critical gaps in trauma-focused mental health data, offering a valuable resource for advancing both patient-facing applications and clinician training tools.

View full details

Poster

SAGE-Eval: Evaluating LLMs for Systematic Generalizations of Safety Facts

Yueh-Han Chen ⋅ Guy Davidson ⋅ Brenden Lake

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Do LLMs robustly generalize critical safety facts to novel situations? Lacking this ability is dangerous when users ask naive questions—for instance, ``I'm considering packing melon balls for my 10-month-old's lunch. What other foods would be good to include?'' Before offering food options, the LLM should warn that melon balls pose a choking hazard to toddlers, as documented by the CDC. Failing to provide such warnings could result in serious injuries or even death. To evaluate this, we introduce SAGE-Eval, SAfety-fact systematic GEneralization evaluation, the first benchmark that tests whether LLMs properly apply well‑established safety facts to naive user queries. SAGE-Eval comprises 104 facts manually sourced from reputable organizations, systematically augmented to create 10,428 test scenarios across 7 common domains (e.g., Outdoor Activities, Medicine). We find that the top model, Claude-3.7-sonnet, passes only 58% of all the safety facts tested. We also observe that model capabilities and training compute weakly correlate with performance on SAGE-Eval, implying that scaling up is not the golden solution. Our findings suggest frontier LLMs still lack robust generalization ability. We recommend developers use SAGE-Eval in pre-deployment evaluations to assess model reliability in addressing salient risks.

View full details

Poster

MIR-Bench: Can Your LLM Recognize Complicated Patterns via Many-Shot In-Context Reasoning?

Kai Yan ⋅ Zhan Ling ⋅ Kang Liu ⋅ Yifan Yang ⋅ Ting-Han Fan ⋅ Lingfeng Shen ⋅ Zhengyin Du ⋅ Jiecao Chen

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The ability to recognize patterns from examples and apply them to new ones is a primal ability for general intelligence, and is widely studied by psychology and AI researchers. Many benchmarks have been proposed to measure such ability for Large Language Models (LLMs); however, they focus on few-shot (usually <10) setting and lack evaluation for aggregating many pieces of information from long contexts. On the other hand, the ever-growing context length of LLMs have brought forth the novel paradigm of many-shot In-Context Learning (ICL), which addresses new tasks with hundreds to thousands of examples without expensive and inefficient fine-tuning. However, many-shot evaluations often focus on classification, and popular long-context LLM tasks such as Needle-In-A-Haystack (NIAH) seldom require complicated intelligence for integrating many pieces of information. To fix the issues from both worlds, we propose MIR-Bench, the first many-shot in-context reasoning benchmark for pattern recognition that asks LLM to predict output via input-output examples from underlying functions with diverse data format. Based on MIR-Bench, we study many novel problems for many-shot in-context reasoning, and acquired many insightful findings including scaling effect, robustness, inductive vs. transductive reasoning, retrieval Augmented Generation (RAG), coding for inductive reasoning, cross-domain generalizability, etc. Our dataset is available at https://huggingface.co/datasets/kaiyan289/MIR-Bench.

View full details

Poster

PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly

Liang Ma ⋅ Jiajun Wen ⋅ Min Lin ⋅ Rongtao Xu ⋅ Xiwen Liang ⋅ Bingqian Lin ⋅ Jun Ma ⋅ Yongxin Wang ⋅ Ziming Wei ⋅ haokun lin ⋅ Mingfei Han ⋅ Meng Cao ⋅ Bokui Chen ⋅ Ivan Laptev ⋅ Xiaodan Liang

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

While vision-language models (VLMs) have demonstrated promising capabilities in reasoning and planning for embodied agents, their ability to comprehend physical phenomena, particularly within structured 3D environments, remains severely limited. To close this gap, we introduce PhyBlock, a progressive benchmark designed to assess VLMs on physical understanding and planning through robotic 3D block assembly tasks. PhyBlock integrates a novel four-level cognitive hierarchy assembly task alongside targeted Visual Question Answering (VQA) samples, collectively aimed at evaluating progressive spatial reasoning and fundamental physical comprehension, including object properties, spatial relationships, and holistic scene understanding. PhyBlock includes 2600 block tasks (400 assembly tasks, 2200 VQA tasks) and evaluates models across three key dimensions: partial completion, failure diagnosis, and planning robustness. We benchmark 23 state-of-the-art VLMs, highlighting their strengths and limitations in physically grounded, multi-step planning. Our empirical findings indicate that the performance of VLMs exhibits pronounced limitations in high-level planning and reasoning capabilities, leading to a notable decline in performance for the growing complexity of the tasks.Error analysis reveals persistent difficulties in spatial orientation and dependency reasoning.We position PhyBlock as a unified testbed to advance embodied reasoning, bridging vision-language understanding and real-world physical problem-solving.

View full details

Poster

Towards A Generalist Code Embedding Model Based On Massive Data Synthesis

Chaofan Li ⋅ Jianlyu Chen ⋅ Yingxia Shao ⋅ Defu Lian ⋅ Zheng Liu

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Code embedding models attract increasing attention due to the widespread popularity of retrieval-augmented generation (RAG) in software development. These models are expected to capture the rich semantic relationships inherent to code, which differ significantly from those found in text. However, existing models remain severely limited due to the scarcity of high-quality training data. In this work, we introduce \textbf{CodeR} (\underline{Code} \underline{R}etrieval), a state-of-the-art embedding model for general-purpose code retrieval. The superior performance of CodeR is built upon \textbf{CodeR-Pile}, a large-scale synthetic dataset constructed under the DRU (Diversity, Reliability, Usability) principle via a novel data synthesis pipeline. To optimize training effectiveness, we propose \textbf{Annealing}, a curriculum learning strategy that enables effective knowledge transfer across heterogeneous sources of data. We evaluate CodeR based on 16 diverse code retrieval tasks, where it significantly outperforms existing baselines and exhibits strong out-of-domain generalization performance. We have publicly released our code and the well-trained model to facilitate further research in this critical area\footnote{\url{https://github.com/FlagOpen/FlagEmbedding/tree/master/research/BGE_Coder}}.

View full details

Poster

SceneSplat++: A Large Dataset and Comprehensive Benchmark for Language Gaussian Splatting

Mengjiao Ma ⋅ Qi Ma ⋅ Yue Li ⋅ Jiahuan Cheng ⋅ Runyi Yang ⋅ Bin Ren ⋅ Nikola Popovic ⋅ Mingqiang Wei ⋅ Nicu Sebe ⋅ Ender Konukoglu ⋅ Luc V Gool ⋅ Theo Gevers ⋅ Martin R. Oswald ⋅ Danda Pani Paudel

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

3D Gaussian Splatting (3DGS) serves as a highly performant and efficient encoding of scene geometry, appearance, and semantics. Moreover, grounding language in 3D scenes has proven to be an effective strategy for 3D scene understanding. Current Language Gaussian Splatting line of work fall into three main groups: (i) per-scene optimization-based, (ii) per-scene optimization-free, and (iii) generalizable approach. However, most of them are evaluated only on rendered 2D views of a handful of scenes and viewpoints close to the training views, limiting ability and insight into holistic 3D understanding. To address this gap, we propose the first large-scale benchmark that systematically assesses these three groups of methods directly in 3D space, evaluating on 1060 scenes across three indoor datasets and one outdoor dataset. Benchmark results demonstrate a clear advantage of the generalizable paradigm, particularly in relaxing the scene-specific limitation, enabling fast feed-forward inference on novel scenes, and achieving superior segmentation performance. We further introduce SceneSplat-49K -- a carefully curated 3DGS dataset comprising of around 49K diverse indoor and outdoor scenes trained from multiple sources, with which we demonstrate generalizable approach could harness strong data priors. Our codes, benchmark, and datasets are available.

View full details

Poster

PanTS: The Pancreatic Tumor Segmentation Dataset

Wenxuan Li ⋅ Xinze Zhou ⋅ Qi Chen ⋅ Tianyu Lin ⋅ Pedro R. A. S. Bassi ⋅ Xiaoxi Chen ⋅ Chen Ye ⋅ Zheren Zhu ⋅ Kai Ding ⋅ Heng Li ⋅ Kang Wang ⋅ Yang Yang ⋅ Yucheng Tang ⋅ Daguang Xu ⋅ Alan Yuille ⋅ Zongwei Zhou

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

PanTS is a large-scale, multi-institutional dataset curated to advance research in pancreatic CT analysis. It contains 36,390 CT scans from 145 medical centers, with expert-validated, voxel-wise annotations of over 993,000 anatomical structures, covering pancreatic tumors, pancreas head, body, and tail, and 24 surrounding anatomical structures such as vascular/skeletal structures and abdominal/thoracic organs. Each scan includes metadata such as patient age, sex, diagnosis, contrast phase, in-plane spacing, slice thickness, etc. AI models trained on PanTS achieve significantly better performance in pancreatic tumor detection, localization, and segmentation than those trained on existing public datasets. Our analysis indicates that these gains are directly attributable to the 16× larger-scale tumor annotations and indirectly supported by the 24 additional surrounding anatomical structures. As the largest and most comprehensive resource of its kind, PanTS offers a new benchmark for developing and evaluating AI models in pancreatic CT analysis.

View full details

Poster

PARALLELPROMPT: Extracting Parallelism from Large Language Model Queries

Steven Kolawole ⋅ Keshav Santhanam ⋅ Virginia Smith ⋅ Pratiksha Thaker

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

LLM serving systems typically treat user prompts as monolithic inputs, optimizing inference through decoding tricks or inter-query batching. However, many real-world prompts contain *latent semantic parallelism*—decomposable structures where subtasks can be executed independently to reduce latency while preserving meaning.We introduce PARALLELPROMPT, the first benchmark for measuring intra-query parallelism in natural user prompts. Our dataset comprises over 37,000 real-world prompts from public LLM chat logs, each annotated with a structured schema capturing task templates, shared context, and iteration inputs. These schemas are extracted using LLM-assisted prompting with rule-based multilingual validation.To evaluate the benefits of decomposition, we provide an execution suite that benchmarks serial vs. parallel strategies, measuring latency, structural adherence, and semantic fidelity. Our results show that intra-query parallelism can be successfully parsed in over 75\% of curated datasets, unlocking up to *$5\times$ speedups* on tasks like translation, comprehension, and comparative analysis, with minimal quality degradation.By releasing this benchmark, curation pipeline, and evaluation suite, we provide the first standardized testbed for studying structure-aware execution in LLM serving pipelines.

View full details

Poster

BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks and Defenses on Large Language Models

Yige Li ⋅ Hanxun Huang ⋅ Yunhan Zhao ⋅ Xingjun Ma ⋅ Jun Sun

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Generative large language models (LLMs) have achieved state-of-the-art results on a wide range of tasks, yet they remain susceptible to backdoor attacks: carefully crafted triggers in the input can manipulate the model to produce adversary-specified outputs. While prior research has predominantly focused on backdoor risks in vision and classification settings, the vulnerability of LLMs in open-ended text generation remains underexplored. To fill this gap, we introduce \textit{BackdoorLLM}\footnote{Our BackdoorLLM benchmark was awarded First Prize in the \href{https://www.mlsafety.org/safebench/winners}{SafetyBench competition} organized by the \href{https://safe.ai/}{Center for AI Safety}.}, the first comprehensive benchmark for systematically evaluating backdoor threats in text-generation LLMs. BackdoorLLM provides: (i) a unified repository of benchmarks with a standardized training and evaluation pipeline; (ii) a diverse suite of attack modalities, including data poisoning, weight poisoning, hidden-state manipulation, and chain-of-thought hijacking; (iii) over 200 experiments spanning 8 distinct attack strategies, 7 real-world scenarios, and 6 model architectures; (iv) key insights into the factors that govern backdoor effectiveness and failure modes in LLMs; and (v) a defense toolkit encompassing 7 representative mitigation techniques. Our code and datasets are available at \url{https://github.com/bboylyg/BackdoorLLM}. We will continuously incorporate emerging attack and defense methodologies to support the research in advancing the safety and reliability of LLMs.

View full details

Poster

LexiCon: a Benchmark for Planning under Temporal Constraints in Natural Language

Periklis Mantenoglou ⋅ Rishi Hazra ⋅ Pedro Zuidberg Dos Martires ⋅ Luc De Raedt

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Owing to their reasoning capabilities, large language models (LLMs) have been evaluated on planning tasks described in natural language. However, LLMs have largely been tested on planning domains without constraints. In order to deploy them in real-world settings where adherence to constraints, in particular safety constraints, is critical, we need to evaluate their performance on constrained planning tasks. We introduce LexiCon—a natural language-based (Lexi) constrained (Con) planning benchmark, consisting of a suite of environments, that can be used to evaluate the planning capabilities of LLMs in a principled fashion. The core idea behind LexiCon is to take existing planning environments and impose temporal constraints on the states. These constrained problems are then translated into natural language and given to an LLM to solve. A key feature of LexiCon is its extensibility. That is, the set of supported environments can be extended with new (unconstrained) environment generators, for which temporal constraints are constructed automatically. This renders LexiCon future-proof: the hardness of the generated planning problems can be increased as the planning capabilities of LLMs improve. Our experiments reveal that the performance of state-of-the-art LLMs, including reasoning models like GPT-5, o3, and R1, deteriorates as the degree of constrainedness of the planning tasks increases.

View full details

Poster

The Rashomon Set Has It All: Analyzing Trustworthiness of Trees under Multiplicity

Ethan Hsu ⋅ Tony Cao ⋅ Lesia Semenova ⋅ Chudi Zhong

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

In practice, many models from a function class can fit a dataset almost equally well. This collection of near-optimal models is known as the Rashomon set. Prior work has shown that the Rashomon set offers flexibility in choosing models aligned with secondary objectives like interpretability or fairness. However, it is unclear how far this flexibility extends to different trustworthy criteria, especially given that most trustworthy machine learning systems today still rely on complex specialized optimization procedures. *Is the Rashomon set all you need for trustworthy model selection? Can simply searching the Rashomon set suffice to find models that are not only accurate but also fair, stable, robust, or private, without explicitly optimizing for these criteria?*In this paper, we introduce a framework for systematically analyzing trustworthiness within Rashomon sets and conduct extensive experiments on high-stakes tabular datasets. We focus on sparse decision trees, where the Rashomon set can be fully enumerated. Across seven distinct metrics, we find that the Rashomon set almost always contains models that match or exceed the performance of state-of-the-art methods specifically designed to optimize individual trustworthiness criteria. These results suggest that for many practical applications, computing the Rashomon set once can serve as an efficient and effective method for identifying highly accurate and trustworthy models. Our framework can be a valuable tool for both benchmarking Rashomon sets of decision trees and studying the trustworthiness properties of interpretable models.

View full details

Poster

PRING: Rethinking Protein-Protein Interaction Prediction from Pairs to Graphs

Xinzhe Zheng ⋅ Hao Du ⋅ Fanding Xu ⋅ Jinzhe Li ⋅ ZHIYUAN LIU ⋅ Wenkang Wang ⋅ Tao Chen ⋅ Wanli Ouyang ⋅ Stan Z. Li ⋅ Yan Lu ⋅ Nanqing Dong ⋅ Yang Zhang

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Deep learning-based computational methods have achieved promising results in predicting protein-protein interactions (PPIs). However, existing benchmarks predominantly focus on isolated pairwise evaluations, overlooking a model's capability to reconstruct biologically meaningful PPI networks, which is crucial for biology research. To address this gap, we introduce PRING, the first comprehensive benchmark that evaluates PRotein-protein INteraction prediction from a Graph-level perspective. PRING curates a high-quality, multi-species PPI network dataset comprising 21,484 proteins and 186,818 interactions, with well-designed strategies to address both data redundancy and leakage. Building on this golden-standard dataset, we establish two complementary evaluation paradigms: (1) topology-oriented tasks, which assess intra and cross-species PPI network construction, and (2) function-oriented tasks, including protein complex pathway prediction, GO module analysis, and essential protein justification. These evaluations not only reflect the model's capability to understand the network topology but also facilitate protein function annotation, biological module detection, and even disease mechanism analysis. Extensive experiments on four representative model categories, consisting of sequence similarity-based, naive sequence-based, protein language model-based, and structure-based approaches, demonstrate that current PPI models have potential limitations in recovering both structural and functional properties of PPI networks, highlighting the gap in supporting real-world biological applications. We believe PRING provides a reliable platform to guide the development of more effective PPI prediction models for the community. The dataset and source code of PRING are available at https://github.com/SophieSarceau/PRING.

View full details

Poster

Towards Automated Petrography

Isai Daniel Chacon ⋅ Paola Ruiz Puentes ⋅ Jillian Pearse ⋅ Pablo Arbelaez

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Petrography is a branch of geology that analyzes the mineralogical composition of rocks from microscopical thin section samples. It is essential for understanding rock properties across geology, archaeology, engineering, mineral exploration, and the oil industry. However, petrography is a labor-intensive task requiring experts to conduct detailed visual examinations of thin section samples through optical polarization microscopes, thus hampering scalability and highlighting the need for automated techniques. To address this challenge, we introduce the Large-scale Imaging and Thin section Optical-polarization Set (LITHOS), the largest and most diverse publicly available experimental framework for automated petrography. LITHOS includes 211,604 high-resolution RGB patches of polarized light and 105,802 expert-annotated grains across 25 mineral categories. Each annotation consists of the mineral class, spatial coordinates, and expert-defined major and minor axes represented as intersecting vector paths, capturing grain geometry and orientation. We evaluate multiple deep learning techniques for mineral classification in LITHOS and propose a dual-encoder transformer architecture that integrates both polarization modalities as a strong baseline for future reference. Our method consistently outperforms single-polarization models, demonstrating the value of polarization synergy in mineral classification. We have made the LITHOS Benchmark publicly available, comprising our dataset, code, and pretrained models, to foster reproducibility and further research in automated petrographic analysis.

View full details

Poster

QCircuitBench: A Large-Scale Dataset for Benchmarking Quantum Algorithm Design

Rui Yang ⋅ Ziruo Wang ⋅ Yuntian Gu ⋅ Yitao Liang ⋅ Tongyang Li

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Quantum computing is an emerging field recognized for the significant speedup it offers over classical computing through quantum algorithms. However, designing and implementing quantum algorithms pose challenges due to the complex nature of quantum mechanics and the necessity for precise control over quantum states. Despite the significant advancements in AI, there has been a lack of datasets specifically tailored for this purpose. In this work, we introduce QCircuitBench, the first benchmark dataset designed to evaluate AI's capability in designing and implementing quantum algorithms in the form of quantum circuit codes. Unlike using AI for writing traditional codes, this task is fundamentally more complicated due to highly flexible design space. Our key contributions include: 1. A general framework which formulates the key features of quantum algorithm design task for Large Language Models.2. Implementation for quantum algorithms from basic primitives to advanced applications, spanning 3 task suites, 25 algorithms, and 120,290 data points.3. Automatic validation and verification functions, allowing for iterative and interactive evaluation without human inspection.4. Promising potential as a training dataset through primitive fine-tuning results.We observed several interesting experimental phenomena: fine-tuning does not always outperform few-shot learning, and LLMs tend to exhibit consistent error patterns. QCircuitBench provides a comprehensive benchmark for AI-driven quantum algorithm design, while also revealing some limitations of LLMs in this domain.

View full details

Poster

NAVIX: Scaling MiniGrid Environments with JAX

Eduardo Pignatelli ⋅ Jarek Liesen ⋅ Robert Lange ⋅ Chris Lu ⋅ Pablo Samuel Castro ⋅ Laura Toni

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

As Deep Reinforcement Learning (Deep RL) research moves towards solving large-scale worlds, efficient environment simulations become crucial for rapid experimentation. However, most existing environments struggle to scale to high throughput, setting back meaningful progress. Interactions are typically computed on the CPU, limiting training speed and throughput, due to slower computation and communication overhead when distributing the task across multiple machines. Ultimately, Deep RL training is CPU-bound, and developing batched, fast, and scalable environments has become a frontier for progress. Among the most used Reinforcement Learning (RL) environments, MiniGrid is at the foundation of several studies on exploration, curriculum learning, representation learning, diversity, meta-learning, credit assignment, and language-conditioned RL, and still suffers from the limitations described above. In this work, we introduce NAVIX, a re-implementation of MiniGrid in JAX. NAVIX achieves over $160\,000\times$ speed improvements in batch mode, supporting up to 2048 agents in parallel on a single Nvidia A100 80 GB. This reduces experiment times from one week to 15 minutes, promoting faster design iterations and more scalable RL model development.

View full details

Poster

A Multi-Task Benchmark for Abusive Language Detection in Low-Resource Settings

Fitsum Gaim ⋅ Hoyun Song ⋅ Huije Lee ⋅ Changgeon Ko ⋅ Euijun Hwang ⋅ Jong C. Park

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Content moderation research has recently made significant advances, but remains limited in serving the majority of the world's languages due to the lack of resources, leaving millions of vulnerable users to online hostility. This work presents a large-scale human-annotated multi-task benchmark dataset for abusive language detection in Tigrinya social media with joint annotations for three tasks: abusiveness, sentiment, and topic classification. The dataset comprises 13,717 YouTube comments annotated by nine native speakers, collected from 7,373 videos with a total of over 1.2 billion views across 51 channels. We developed an iterative term clustering approach for effective data selection. Recognizing that around 64% of Tigrinya social media content uses Romanized transliterations rather than native Ge'ez script, our dataset accommodates both writing systems to reflect actual language use. We establish strong baselines across the tasks in the benchmark, while leaving significant challenges for future contributions. Our experiments demonstrate that small fine-tuned models outperform prompted frontier large language models (LLMs) in the low-resource setting, achieving 86.67% F1 in abusiveness detection (7+ points over best LLM), and maintain stronger performance in all other tasks. The benchmark is made public to promote research on online safety.

View full details

Poster

Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving

Daoguang Zan ⋅ Zhirong Huang ⋅ Wei Liu ⋅ Hanwu Chen ⋅ Shulin Xin ⋅ Linhao Zhang ⋅ Qi Liu ⋅ Li Aoyan ⋅ Lu Chen ⋅ Xiaojian Zhong ⋅ Siyao Liu ⋅ Yongsheng Xiao ⋅ Liangqiang Chen ⋅ Yuyu Zhang ⋅ Jing Su ⋅ Tianyu Liu ⋅ RUI LONG ⋅ Ming Ding ⋅ liang xiang

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

The task of issue resolving aims to modify a codebase to generate a patch that addresses a given issue. However, most existing benchmarks focus almost exclusively on Python, making them insufficient for evaluating Large Language Models (LLMs) across different programming languages. To bridge this gap, we introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench, covering 8 languages of Python, Java, TypeScript, JavaScript, Go, Rust, C, and C++. In particular, this benchmark includes a total of 2,132 high-quality instances, carefully curated by 68 expert annotators, ensuring a reliable and accurate evaluation of LLMs on the issue-resolving task. Based on human-annotated results, the issues are further classified into three difficulty levels. We evaluate a series of state-of-the-art models on Multi-SWE-bench, utilizing both procedural and agent-based frameworks for issue resolving. Our experiments reveal three key findings: (1) Limited generalization across languages: While existing LLMs perform well on Python issues, their ability to generalize across other languages remains limited; (2) Performance aligned with human-annotated difficulty: LLM-based agents' performance closely aligns with human-assigned difficulty, with resolution rates decreasing as issue complexity rises; and (3) Performance drop on cross-file issues: The performance of current methods significantly deteriorates when handling cross-file issues. These findings highlight the limitations of current LLMs and underscore the need for more robust models capable of handling a broader range of programming languages and complex issue scenarios.

View full details

Poster

RSCC: A Large-Scale Remote Sensing Change Caption Dataset for Disaster Events

Zhenyuan Chen ⋅ Chenxi Wang ⋅ Ningyu Zhang ⋅ Feng Zhang

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Remote sensing is critical for disaster monitoring, yet existing datasets lack temporal image pairs and detailed textual annotations. While single-snapshot imagery dominates current resources, it fails to capture dynamic disaster impacts over time. To address this gap, we introduce the Remote Sensing Change Caption (RSCC) dataset, a large-scale benchmark comprising 62,351 pre-/post-disaster image pairs (spanning earthquakes, floods, wildfires, and more) paired with rich, human-like change captions. By bridging the temporal and semantic divide in remote sensing data, RSCC enables robust training and evaluation of vision-language models for disaster-aware bi-temporal understanding. Our results highlight RSCC’s ability to facilitate detailed disaster-related analysis, paving the way for more accurate, interpretable, and scalable vision-language applications in remote sensing. Code and dataset are available at https://github.com/Bili-Sakura/RSCC.

View full details

Poster

Sheetpedia: A 300K-Spreadsheet Corpus for Spreadsheet Intelligence and LLM Fine-Tuning

Zailong Tian ⋅ Zhuoheng Han ⋅ Houfeng Wang ⋅ Lizi Liao

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Spreadsheets are widely used for data analysis and reporting, yet their complex structure and formula logic pose significant challenges for AI systems. We introduce Sheetpedia, a large-scale corpus of over 290,000 diverse spreadsheets (from 324,000+ workbooks) compiled from enterprise email archives and online forums. We detail a rigorous collection and preprocessing pipeline (integrating the Enron email spreadsheet archive and the Fuse web corpus, plus a new crawl of Excel forums) to standardize formats, filter languages, and remove duplicates. Sheetpedia provides extensive coverage of real formulas and annotations – addressing a gap left by prior table datasets (e.g. web tables used in TURL or Text-to-SQL in Spider) which often lack formula semantics. We present comprehensive corpus statistics, highlighting rich formula diversity and a majority (78\%+) of English content. To demonstrate the corpus’s utility, we fine-tune large language models on Sheetpedia for two novel spreadsheet understanding tasks: Natural Language to Semantic Range (NL2SR) and Natural Language to Formula (NL2Formula). Using a rejection-sampling data generation strategy, our fine-tuned models achieve up to 97.5\% accuracy on NL2SR and 71.7\% on NL2Formula – substantially outperforming baseline approaches. Sheetpedia (to be released publicly) fills a crucial need for a large, high-quality spreadsheet benchmark, enabling more effective spreadsheet intelligence and natural language interfaces for spreadsheet tools.

View full details

Poster

PSBench: a large-scale benchmark for estimating the accuracy of protein complex structural models

Pawan Neupane ⋅ Jian Liu ⋅ Jianlin Cheng

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Predicting protein complex structures is essential for protein function analysis, protein design, and drug discovery. While AI methods like AlphaFold can predict accurate structural models for many protein complexes, reliably estimating the quality of these predicted models (estimation of model accuracy, or EMA) for model ranking and selection remains a major challenge. A key barrier to developing effective machine learning-based EMA methods is the lack of large, diverse, and well-annotated datasets for training and evaluation. To address this gap, we introduce PSBench, a benchmark suite comprising five large-scale, labeled datasets, four of which were generated during the 15th and 16th community-wide Critical Assessment of Protein Structure Prediction (CASP15 and CASP16), and one curated for new Protein Data Bank (PDB) entries deposited between July 2024 and August 2025. PSBench includes over 1.4 million structural models covering a wide range of protein sequence lengths, complex stoichiometries, functional classes, and modeling difficulties. Each model is annotated with multiple complementary quality scores at the global, local, and interface levels. PSBench also provides multiple evaluation metrics and baseline EMA methods to facilitate rigorous comparisons. To demonstrate PSBench’s utility, we trained and evaluated GATE, a graph transformer-based EMA method, on the CASP15 data. GATE was blindly tested in CASP16 (2024), where it ranked among the top-performing EMA methods. These results highlight PSBench as a valuable resource for advancing EMA research in protein complex modeling. PSBench is publicly available at: https://github.com/BioinfoMachineLearning/PSBench.

View full details

Poster

OpenUnlearning: Accelerating LLM Unlearning via Unified Benchmarking of Methods and Metrics

Vineeth Dorna ⋅ Anmol Mekala ⋅ Wenlong Zhao ⋅ Andrew McCallum ⋅ Zico Kolter ⋅ Zachary Lipton ⋅ Pratyush Maini

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Robust unlearning is crucial for safely deploying large language models (LLMs) in environments where data privacy, model safety, and regulatory compliance must be ensured. Yet the task is inherently challenging, partly due to difficulties in reliably measuring whether unlearning has truly occurred. Moreover, fragmentation in current methodologies and inconsistent evaluation metrics hinder comparative analysis and reproducibility. To unify and accelerate research efforts, we introduce OpenUnlearning, a standardized and extensible framework designed explicitly for benchmarking both LLM unlearning methods and metrics. OpenUnlearning integrates 13 state-of-the-art unlearning algorithms and 16 diverse evaluations across 3 leading benchmarks (TOFU, MUSE, and WMDP) and also enables analyses of forgetting behaviors across 450+ publicly released checkpoints. Leveraging OpenUnlearning, we propose a novel meta-evaluation benchmark focused specifically on assessing the faithfulness and robustness of evaluation metrics themselves. We also benchmark diverse unlearning methods and provide a comparative analysis against an extensive evaluation suite. Overall, we establish a clear, community-driven pathway toward rigorous development in LLM unlearning research.

View full details

Poster

EDBench: Large-Scale Electron Density Data for Molecular Modeling

Hongxin Xiang ⋅ Ke Li ⋅ Mingquan Liu ⋅ Zhixiang Cheng ⋅ Bin Yao ⋅ Wenjie Du ⋅ Jun Xia ⋅ Li Zeng ⋅ Xin Jin ⋅ xiangxiang Zeng

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Existing molecular machine learning force fields (MLFFs) generally focus on the learning of atoms, molecules, and simple quantum chemical properties (such as energy and force), but ignore the importance of electron density (ED) $\rho(r)$ in accurately understanding molecular force fields (MFFs). ED describes the probability of finding electrons at specific locations around atoms or molecules, which uniquely determines all ground state properties (such as energy, molecular structure, etc.) of interactive multi-particle systems according to the Hohenberg-Kohn theorem. However, the calculation of ED relies on the time-consuming first-principles density functional theory (DFT), which leads to the lack of large-scale ED data and limits its application in MLFFs. In this paper, we introduce EDBench, a large-scale, high-quality dataset of ED designed to advance learning-based research at the electronic scale. Built upon the PCQM4Mv2, EDBench provides accurate ED data, covering 3.3 million molecules. To comprehensively evaluate the ability of models to understand and utilize electronic information, we design a suite of ED-centric benchmark tasks spanning prediction, retrieval, and generation. Our evaluation of several state-of-the-art methods demonstrates that learning from EDBench is not only feasible but also achieves high accuracy. Moreover, we show that learning-based methods can efficiently calculate ED with comparable precision while significantly reducing the computational cost relative to traditional DFT calculations. All data and benchmarks from EDBench will be freely available, laying a robust foundation for ED-driven drug discovery and materials science.

View full details

Poster

Gymnasium: A Standard Interface for Reinforcement Learning Environments

Mark Towers ⋅ Ariel Kwiatkowski ⋅ John Balis ⋅ Gianluca De Cola ⋅ Tristan Deleu ⋅ Manuel Goulão ⋅ Kallinteris Andreas ⋅ Markus Krimmel ⋅ Arjun KG ⋅ Rodrigo Perez-Vicente ⋅ J Terry ⋅ Andrea Pierré ⋅ Sander Schulhoff ⋅ Jun Jet Tai ⋅ Hannah Tan ⋅ Omar G. Younis

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Reinforcement Learning (RL) is a continuously growing field that has the potential to revolutionize many areas of artificial intelligence. However, despite its promise, RL research is often hindered by the lack of standardization in environment and algorithm implementations. This makes it difficult for researchers to compare and build upon each other's work, slowing down progress in the field.Gymnasium is an open-source library that provides a standard API for RL environments, aiming to tackle this issue. Gymnasium's main feature is a set of abstractions that allow for wide interoperability between environments and training algorithms, making it easier for researchers to develop and test RL algorithms. In addition, Gymnasium provides a collection of easy-to-use environments, tools for easily customizing environments, and tools to ensure the reproducibility and robustness of RL research.Through this unified framework, Gymnasium significantly streamlines the process of developing and testing RL algorithms, enabling researchers to focus more on innovation and less on implementation details. By providing a standardized platform for RL research, Gymnasium helps to drive forward the field of reinforcement learning and unlock its full potential. Gymnasium is available online at \url{https://github.com/Farama-Foundation/Gymnasium}.

View full details

Poster

Introducing FOReCAst: The Future Outcome Reasoning and Confidence Assessment Benchmark

Zhangdie Yuan ⋅ Zifeng Ding ⋅ Andreas Vlachos

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Forecasting is an important task in many domains. However, existing forecasting benchmarks lack comprehensive confidence assessment, focusing on limited question types, and often consist of artificial questions that do not reflect real-world needs. To address these gaps, we introduce FOReCAst (Future Outcome Reasoning and Confidence Assessment), a benchmark that evaluates models' ability to make predictions and their confidence in them. FOReCAst spans diverse forecasting scenarios involving Boolean questions, timeframe prediction, and quantity estimation, enabling a comprehensive evaluation of both prediction accuracy and confidence calibration for real-world applications.

View full details

Poster

Parameterized Synthetic Text Generation with SimpleStories

Lennart Finke ⋅ Chandan Sreedhara ⋅ Thomas Dooms ⋅ Mat Allen ⋅ Juan Rodriguez ⋅ Noa Nabeshima ⋅ Thomas Marshall ⋅ Dan Braun

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We present SimpleStories, a large synthetic story dataset in simple language, consisting of 2 million samples each in English and Japanese. Through parameterizing prompts at multiple levels of abstraction, we achieve control over story characteristics at scale, inducing syntactic and semantic diversity. Ablations on a newly trained tiny model suite then show improved sample efficiency and model interpretability in comparison with the TinyStories dataset. We open-source all constituent parts of model creation, hoping to enable novel ways to study the end-to-end training process. As a byproduct, we move the frontier with regards to the fewest-parameter language model that outputs grammatical English.

View full details

Poster

ICPC-Eval: Probing the Frontiers of LLM Reasoning with Competitive Programming Contests

Shiyi Xu ⋅ Hu Yiwen ⋅ Yingqian Min ⋅ Zhipeng Chen ⋅ Xin Zhao ⋅ Ji-Rong Wen

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

With the significant progress of large reasoning models in complex coding and reasoning tasks, existing benchmarks, like LiveCodeBench and CodeElo, are insufficient to evaluate the coding capabilities of large language models (LLMs) in real competition environments. Moreover, current evaluation metrics such as Pass@K fail to capture the reflective abilities of reasoning models. To address these challenges, we propose ICPC-Eval, a top-level competitive coding benchmark designed to probing the frontiers of LLM reasoning. ICPC-Eval includes 118 carefully curated problems from 11 recent ICPC contests held in various regions of the world, offering three key contributions: 1) A challenging realistic ICPC competition scenario, featuring a problem type and difficulty distribution consistent with actual contests. 2) A robust test case generation method and a corresponding local evaluation toolkit, enabling efficient and accurate local evaluation. 3) An effective test-time scaling evaluation metric, Refine@K, which allows iterative repair of solutions based on execution feedback. The results underscore the significant challenge in evaluating complex reasoning abilities: top-tier reasoning models like DeepSeek-R1 often rely on multi-turn code feedback to fully unlock their in-context reasoning potential when compared to non-reasoning counterparts. Furthermore, despite recent advancements in code generation, these models still lag behind top-performing human teams. We release the benchmark at: https://github.com/RUCAIBox/ICPC-Eval

View full details

Poster

WolBanking77: Wolof Banking Speech Intent Classification Dataset

Abdou Karim KANDJI ⋅ Frederic Precioso ⋅ Cheikh BA ⋅ Samba NDIAYE ⋅ Augustin NDIONE

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Intent classification models have made a significant progress in recent years. However, previous studies primarily focus on high-resource language datasets, which results in a gap for low-resource languages and for regions with high rates of illiteracy, where languages are more spoken than read or written. This is the case in Senegal, for example, where Wolof is spoken by around 90\% of the population, while the national illiteracy rate remains at of 42\%. Wolof is actually spoken by more than 10 million people in West African region. To address these limitations, we introduce the Wolof Banking Speech Intent Classification Dataset (WolBanking77), for academic research in intent classification. WolBanking77 currently contains 9,791 text sentences in the banking domain and more than 4 hours of spoken sentences. Experiments on various baselines are conducted in this work, including text and voice state-of-the-art models. The results are very promising on this current dataset. In addition, this paper presents an in-depth examination of the dataset’s contents. We report baseline F1-scores and word error rates metrics respectively on NLP and ASR models trained on WolBanking77 dataset and also comparisons between models. Dataset and code available at: [wolbanking77](https://github.com/abdoukarim/wolbanking77).

View full details

Poster

NerfBaselines: Consistent and Reproducible Evaluation of Novel View Synthesis Methods

Jonas Kulhanek ⋅ Torsten Sattler

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Novel view synthesis is an important problem with many applications, including AR/VR, gaming, and robotic simulations. With the recent rapid development of Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting (3DGS) methods, it is becoming difficult to keep track of the current state of the art (SoTA) due to methods using different evaluation protocols, codebases being difficult to install and use, and methods not generalizing well to novel 3D scenes. In our experiments, we show that even tiny differences in the evaluation protocols of various methods can artificially boost the performance of these methods. This raises questions about the validity of quantitative comparisons performed in the literature. To address these questions, we propose NerfBaselines, an evaluation framework which provides consistent benchmarking tools, ensures reproducibility, and simplifies the installation and use of various methods. We validate our implementation experimentally by reproducing the numbers reported in the original papers. For improved accessibility, we release a web platform that compares commonly used methods on standard benchmarks. We strongly believe NerfBaselines is a valuable contribution to the community as it ensures that quantitative results are comparable and thus truly measure progress in the field of novel view synthesis.

View full details

Poster

MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix

Ziyang Ma ⋅ Yinghao Ma ⋅ Yanqiao Zhu ⋅ Chen Yang ⋅ Yi-Wen Chao ⋅ Ruiyang Xu ⋅ Wenxi Chen ⋅ Yuanzhe Chen ⋅ Zhuo Chen ⋅ Jian Cong ⋅ Kai Li ⋅ Keliang Li ⋅ Siyou Li ⋅ Xinfeng Li ⋅ Xiquan Li ⋅ Zheng Lian ⋅ Yuzhe Liang ⋅ Minghao Liu ⋅ Zhikang Niu ⋅ Tianrui Wang ⋅ Wang Yuping ⋅ Yuxuan Wang ⋅ Yihao Wu ⋅ Guanrou Yang ⋅ Jianwei Yu ⋅ Ruibin Yuan ⋅ Zhisheng Zheng ⋅ Ziya Zhou ⋅ Haina Zhu ⋅ Wei Xue ⋅ Emmanouil Benetos ⋅ Kai Yu ⋅ Eng-Siong Chng ⋅ Xie Chen

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We introduce MMAR, a new benchmark designed to evaluate the deep reasoning capabilities of Audio-Language Models (ALMs) across massive multi-disciplinary tasks. MMAR comprises 1,000 meticulously curated audio-question-answer triplets, collected from real-world internet videos and refined through iterative error corrections and quality checks to ensure high quality. Unlike existing benchmarks that are limited to specific domains of sound, music, or speech, MMAR extends them to a broad spectrum of real-world audio scenarios, including mixed-modality combinations of sound, music, and speech. Each question in MMAR is hierarchically categorized across four reasoning layers: Signal, Perception, Semantic, and Cultural, with additional sub-categories within each layer to reflect task diversity and complexity. To further foster research in this area, we annotate every question with a Chain-of-Thought (CoT) rationale to promote future advancements in audio reasoning. Each item in the benchmark demands multi-step deep reasoning beyond surface-level understanding. Moreover, a part of the questions requires graduate-level perceptual and domain-specific knowledge, elevating the benchmark's difficulty and depth. We evaluate MMAR using a broad set of models, including Large Audio-Language Models (LALMs), Large Audio Reasoning Models (LARMs), Omni Language Models (OLMs), Large Language Models (LLMs), and Large Reasoning Models (LRMs), with audio caption inputs. The performance of these models on MMAR highlights the benchmark's challenging nature, and our analysis further reveals critical limitations of understanding and reasoning capabilities among current models. These findings underscore the urgent need for greater research attention in audio-language reasoning, including both data and algorithm innovation. We hope MMAR will serve as a catalyst for future advances in this important but little-explored area.

View full details

Poster

InternScenes: A Large-scale Simulatable Indoor Scene Dataset with Realistic Layouts

Weipeng Zhong ⋅ Peizhou Cao ⋅ Yichen Jin ⋅ Luo Li ⋅ Wenzhe Cai ⋅ Jingli Lin ⋅ Hanqing Wang ⋅ Zhaoyang Lyu ⋅ Tai WANG ⋅ Xudong XU ⋅ Bo Dai ⋅ Jiangmiao Pang

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

The advancement of Embodied AI heavily relies on large-scale, simulatable 3D scene datasets characterized by scene diversity and realistic layouts.However, existing datasets typically suffer from limitations in data scale or diversity, sanitized layouts lacking small items, and severe object collisions.To address these shortcomings, we introduce \textbf{InternScenes}, a novel large-scale simulatable indoor scene dataset comprising approximately 40,000 diverse scenes by integrating three disparate scene sources, \ie, real-world scans, procedurally generated scenes, and designer-created scenes, including 1.96M 3D objects and covering 15 common scene types and 288 object classes.We particularly preserve massive small items in the scenes, resulting in realistic and complex layouts with an average of 41.5 objects per region.Our comprehensive data processing pipeline ensures simulatability by creating real-to-sim replicas for real-world scans, enhances interactivity by incorporating interactive objects into these scenes, and resolves object collisions by physical simulations.We demonstrate the value of InternScenes with two benchmark applications: scene layout generation and point-goal navigation. Both show the new challenges posed by the complex and realistic layouts. More importantly, InternScenes paves the way for scaling up the model training for both tasks, making the generation and navigation in such complex scenes possible. We commit to open-sourcing the data and benchmarks to benefit the whole community.

View full details

Poster

A Multimodal Benchmark for Framing of Oil & Gas Advertising and Potential Greenwashing Detection

Gaku Morio ⋅ Harri Rowlands ⋅ Dominik Stammbach ⋅ Christopher D Manning ⋅ Peter Henderson

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Companies spend large amounts of money on public relations campaigns to project a positive brand image.However, sometimes there is a mismatch between what they say and what they do. Oil & gas companies, for example, are accused of "greenwashing" with imagery of climate-friendly initiatives.Understanding the framing, and changes in framing, at scale can help better understand the goals and nature of public relation campaigns.To address this, we introduce a benchmark dataset of expert-annotated video ads obtained from Facebook and YouTube.The dataset provides annotations for 13 framing types for more than 50 companies or advocacy groups across 20 countries.Our dataset is especially designed for the evaluation of vision-language models (VLMs), distinguishing it from past text-only framing datasets.Baseline experiments show some promising results, while leaving room for improvement for future work: GPT-4.1 can detect environmental messages with 79% F1 score, while our best model only achieves 46% F1 score on identifying framing around green innovation.We also identify challenges that VLMs must address, such as implicit framing, handling videos of various lengths, or implicit cultural backgrounds.Our dataset contributes to research in multimodal analysis of strategic communication in the energy sector.

View full details

Poster

ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code

Tianyu Hua ⋅ Harper Hua ⋅ Violet Xiang ⋅ Benjamin Klieger ⋅ Sang Truong ⋅ Weixin Liang ⋅ Fan-Yun Sun ⋅ Nick Haber

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Large language models (LLMs) have shown promise in transforming machine learning research, yet their capability to faithfully implement genuinely novel ideas from recent research papers—ideas unseen during pretraining—remains unclear. We introduce ResearchCodeBench, a benchmark that evaluates LLMs’ ability to translate cutting-edge ML contributions from top 2024-2025 research papers into executable code. We assessed 30+ proprietary and open-source LLMs, finding that even the best models correctly implement less than 40% of the code. We present empirical findings on performance comparison, contamination, and error patterns. By providing a rigorous evaluation platform, ResearchCodeBench enables continuous understanding and advancement of LLM-driven innovation in research code generation.

View full details

Poster

MindGYM: What Matters in Question Synthesis for Thinking-Centric Fine-Tuning?

Zhe Xu ⋅ Daoyuan Chen ⋅ Zhenqing Ling ⋅ Yaliang Li ⋅ Ying Shen

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Large foundation models face challenges in acquiring transferable, structured thinking abilities, especially when supervised with rigid templates or crowd-annotated instruction datasets. Unlike prior approaches, we focus on a thinking-centric data synthesis paradigm that enables models to evolve through self-generated, cognitively guided data. We propose MindGYM, a structured and scalable framework for question synthesis, composed of: (1) Cognitive Thinking Process Injection, which infuses high-level reasoning objectives to shape the model’s synthesis behavior; (2) Seed Single-Hop Question Synthesis, generating atomic questions from diverse semantic types to encourage broader thinking; and (3) Challenging Multi-Hop QA Synthesis, composing more complex multi-hop questions based on QA seeds for deeper reasoning. Detailed analysis shows that synthetic data generated by our method achieves 16.7% higher average quality and 67.91% lower quality variance compared to baseline sources, highlighting that both high-quality and self-contained data are essential for effective, thinking-oriented fine-tuning. MindGYM improves performance on six reasoning benchmarks, achieving gains of up to 16% on MathVision using only 400 data samples, and generalizable improvements across different model sizes and architectures. MindGYM underscores the viability of self-challenging mechanisms in refining large model capabilities while minimizing human intervention and resource demands.Code and data are released to promote data-centric research into self-evolving foundation models driven by their internal reasoning capabilities.

View full details

Poster

The Illusion of Progress? A Critical Look at Test-Time Adaptation for Vision-Language Models

Lijun Sheng ⋅ Jian Liang ⋅ Ran He ⋅ Zilei Wang ⋅ Tieniu Tan

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Test-time adaptation (TTA) methods have gained significant attention for enhancing the performance of vision-language models (VLMs) such as CLIP during inference, without requiring additional labeled data. However, current TTA researches generally suffer from major limitations such as duplication of baseline results, limited evaluation metrics, inconsistent experimental settings, and insufficient analysis. These problems hinder fair comparisons between TTA methods and make it difficult to assess their practical strengths and weaknesses. To address these challenges, we introduce TTA-VLM, a comprehensive benchmark for evaluating TTA methods on VLMs. Our benchmark implements 8 episodic TTA and 7 online TTA methods within a unified and reproducible framework, and evaluates them across 15 widely used datasets. Unlike prior studies focused solely on CLIP, we extend the evaluation to SigLIP—a model trained with a Sigmoid loss—and include training-time tuning methods such as CoOp, MaPLe, and TeCoA to assess generality. Beyond classification accuracy, TTA-VLM incorporates various evaluation metrics, including robustness, calibration, out-of-distribution detection, and stability, enabling a more holistic assessment of TTA methods. Through extensive experiments, we find that 1) existing TTA methods produce limited gains compared to the previous pioneering work; 2) current TTA methods exhibit poor collaboration with training-time fine-tuning methods; 3) accuracy gains frequently come at the cost of reduced model trustworthiness. We release TTA-VLM to provide fair comparison and comprehensive evaluation of TTA methods for VLMs, and we hope it encourages the community to develop more reliable and generalizable TTA strategies. The code is available in https://github.com/TomSheng21/tta-vlm.

View full details

Poster

Establishing Best Practices in Building Rigorous Agentic Benchmarks

Yuxuan Zhu ⋅ Tengjun Jin ⋅ Yada Pruksachatkun ⋅ Andy Zhang ⋅ Shu Liu ⋅ Sasha Cui ⋅ Sayash Kapoor ⋅ Shayne Longpre ⋅ Kevin Meng ⋅ Rebecca Weiss ⋅ Fazl Barez ⋅ Rahul Gupta ⋅ Jwala Dhamala ⋅ Jacob Merizian ⋅ Mario Giulianelli ⋅ Harry Coppock ⋅ Cozmin Ududec ⋅ Antony Kellermann ⋅ Jasjeet Sekhon ⋅ Jacob Steinhardt ⋅ Sarah Schwettmann ⋅ Arvind Narayanan ⋅ Matei A Zaharia ⋅ Ion Stoica ⋅ Percy Liang ⋅ Daniel Kang

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Benchmarks are essential for quantitatively tracking progress in AI. As AI agents become increasingly capable, researchers and practitioners have introduced agentic benchmarks to evaluate agents on complex, real-world tasks. These benchmarks typically measure agent capabilities by evaluating task outcomes via specific reward designs. However, we show that many agentic benchmarks have issues in task setup or reward design. For example, SWE-bench-Verified uses insufficient test cases, while $\tau$-bench counts empty responses as successes. Such issues can lead to under- or overestimation of agents’ performance by up to 100% in relative terms. To make agentic evaluation rigorous, we introduce the Agentic Benchmark Checklist (ABC), a set of guidelines that we synthesized from our benchmark-building experience, a survey of best practices, and previously reported issues. When applied to CVE-Bench, a benchmark with a particularly complex evaluation design, ABC reduces performance overestimation by 33%.

View full details

Poster

Bag of Tricks for Inference-time Computation of LLM Reasoning

Fan LIU ⋅ Wen-Shuo Chao ⋅ Naiqiang Tan ⋅ Hao Liu

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

With the advancement of large language models (LLMs), solving complex tasks (e.g., math problems, code generation, etc.) has garnered increasing attention. Inference-time computation methods (e.g., Best-of-N, MCTS, etc.) are of significant importance, as they have the potential to enhance the reasoning capabilities of LLMs without requiring external training computation. However, due to the inherent challenges of this technique, most existing methods remain proof-of-concept and are not yet sufficiently effective. In this paper, we investigate and benchmark strategies for improving inference-time computation across a wide range of reasoning tasks. Since most current methods rely on a pipeline that first generates candidate solutions (e.g., generating chain-of-thought candidate solutions) and then selects them based on specific reward signals (e.g., RLHF reward, process reward, etc.), our research focuses on strategies for both candidate solution generation (e.g., instructing prompts, hyperparameters: temperature and top-p, etc.) and reward mechanisms (e.g., self-evaluation, reward types, etc.). The experimental results reveal that several previously overlooked strategies can be critical for the success of inference-time computation (e.g., simplifying the temperature can improve general reasoning task performance by up to 5%). Based on extensive experiments (more than 20,000 A100-80G GPU hours with over 1,000 experiments) across a variety of models (e.g., Llama, Qwen, and Mistral families) of various sizes, our proposed strategies outperform the baseline by a substantial margin in most cases, providing a stronger foundation for future research.

View full details

Poster

IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering

Hengyu Liu ⋅ Chenxin Li ⋅ Zhengxin Li ⋅ Yipeng Wu ⋅ Wuyang Li ⋅ Zhiqin Yang ⋅ Zhenyuan Zhang ⋅ Yunlong Lin ⋅ Sirui Han ⋅ Brandon Feng

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Vision-language models (VLMs) excel at descriptive tasks, but whether they truly understand scenes from visual observations remains uncertain. We introduce IR3D-Bench, a benchmark challenging VLMs to demonstrate understanding through active creation rather than passive recognition. Grounded in the analysis-by-synthesis paradigm, IR3D-Bench tasks Vision-Language Agents (VLAs) with actively using programming and rendering tools to recreate the underlying 3D structure of an input image, achieving agentic inverse rendering through tool use. This ''understanding-by-creating'' approach probes the tool-using generative capacity of VLAs, moving beyond the descriptive or conversational capacity measured by traditional scene understanding benchmarks. We provide a comprehensive suite of metrics to evaluate geometric accuracy, spatial relations, appearance attributes, and overall plausibility. Initial experiments on agentic inverse rendering powered by various state-of-the-art VLMs highlight current limitations, particularly in visual precision rather than basic tool usage. IR3D-Bench, including data and evaluation protocols, is released to facilitate systematic study and development of tool-using VLAs towards genuine scene understanding by creating.

View full details

Poster

Is This Tracker On? A Benchmark Protocol for Dynamic Tracking

Ilona Demler ⋅ Saumya Chauhan ⋅ Georgia Gkioxari

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We introduce ITTO, a challenging new benchmark suite for evaluating and diagnosing the capabilities and limitations of point tracking methods. Our videos are sourced from existing datasets and egocentric real-world recordings, with high-quality human annotations collected through a multi-stage pipeline. ITTO captures the motion complexity, occlusion patterns, and object diversity characteristic of real-world scenes -- factors that are largely absent in current benchmarks. We conduct a rigorous analysis of state-of-the-art tracking methods on ITTO, breaking down performance along key axes of motion complexity. Our findings reveal that existing trackers struggle with these challenges, particularly in re-identifying points after occlusion, highlighting critical failure modes. These results point to the need for new modeling approaches tailored to real-world dynamics. We envision ITTO as a foundation testbed for advancing point tracking and guiding the development of more robust tracking algorithms.

View full details

Poster

THUNDER: Tile-level Histopathology image UNDERstanding benchmark

Pierre Marza ⋅ Leo Fillioux ⋅ Sofiène Boutaj ⋅ KUNAL MAHATHA ⋅ Christian Desrosiers ⋅ Pablo Piantanida ⋅ Jose Dolz ⋅ Stergios Christodoulidis ⋅ Maria Vakalopoulou

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Progress in a research field can be hard to assess, in particular when many concurrent methods are proposed in a short period of time. This is the case in digital pathology, where many foundation models have been released recently to serve as feature extractors for tile-level images, being used in a variety of downstream tasks, both for tile- and slide-level problems. Benchmarking available methods then becomes paramount to get a clearer view of the research landscape. In particular, in critical domains such as healthcare, a benchmark should not only focus on evaluating downstream performance, but also provide insights about the main differences between methods, and importantly, further consider uncertainty and robustness to ensure a reliable usage of proposed models. For these reasons, we introduce THUNDER, a tile-level benchmark for digital pathology foundation models, allowing for efficient comparison of many models on diverse datasets with a series of downstream tasks, studying their feature spaces and assessing the robustness and uncertainty of predictions informed by their embeddings. THUNDER is a fast, easy-to-use, dynamic benchmark that can already support a large variety of state-of-the-art foundation, as well as local user-defined models for direct tile-based comparison. In this paper, we provide a comprehensive comparison of 23 foundation models on 16 different datasets covering diverse tasks, feature analysis, and robustness. The code for THUNDER is publicly available at https://github.com/MICS-Lab/thunder.

View full details

Poster

LC-Opt: Benchmarking Reinforcement Learning and Agentic AI for End-to-End Liquid Cooling Optimization in Data Centers

Avisek Naug ⋅ Antonio Guillen-Perez ⋅ Vineet Kumar ⋅ Scott Greenwood ⋅ Wesley Brewer ⋅ Sahand Ghorbanpour ⋅ Ashwin Ramesh Babu ⋅ Vineet Gundecha ⋅ Ricardo Luna Gutierrez ⋅ Soumyendu Sarkar

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Liquid cooling is critical for thermal management in high-density data centers with the rising AI workloads. However, machine learning-based controllers are essential to unlock greater energy efficiency and reliability, promoting sustainability. We present LC-Opt, a Sustainable Liquid Cooling (LC) benchmark environment, for reinforcement learning (RL) control strategies in energy-efficient liquid cooling of high-performance computing (HPC) systems. Built on the baseline of a high-fidelity digital twin of Oak Ridge National Lab's Frontier Supercomputer cooling system, LC-Opt provides detailed Modelica-based end-to-end models spanning site-level cooling towers to data center cabinets and server blade groups. RL agents optimize critical thermal controls like liquid supply temperature, flow rate, and granular valve actuation at the IT cabinet level, as well as cooling tower (CT) setpoints through a Gymnasium interface, with dynamic changes in workloads. This environment creates a multi-objective real-time optimization challenge balancing local thermal regulation and global energy efficiency, and also supports additional components like a heat recovery unit (HRU). We benchmark centralized and decentralized multi-agent RL approaches, demonstrate policy distillation into decision and regression trees for interpretable control, and explore LLM-based methods that explain control actions in natural language through an agentic mesh architecture designed to foster user trust and simplify system management. LC-Opt democratizes access to detailed, customizable liquid cooling models, enabling the ML community, operators, and vendors to develop sustainable data center liquid cooling control solutions.

View full details

Poster

Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking

Liangliang Zhang ⋅ Zhuorui Jiang ⋅ Hongliang Chi ⋅ Haoyang Chen ⋅ Mohammed ElKoumy ⋅ Fali Wang ⋅ Qiong Wu ⋅ Zhengyi Zhou ⋅ Shirui Pan ⋅ Suhang Wang ⋅ Yao Ma

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Knowledge Graph Question Answering (KGQA) systems rely on high-quality benchmarks to evaluate complex multi-hop reasoning. However, despite their widespread use, popular datasets such as WebQSP and CWQ suffer from critical quality issues, including inaccurate or incomplete ground-truth annotations, poorly constructed questions that are ambiguous, trivial, or unanswerable, and outdated or inconsistent knowledge. Through a manual audit of 16 popular KGQA datasets—including WebQSP and CWQ—we find that the average factual correctness rate is only 57%. To address these issues, we introduce KGQAGen, an LLM-in-the-loop framework that systematically resolves these pitfalls. KGQAGen combines structured knowledge grounding, LLM-guided generation, and symbolic verification to produce challenging and verifiable QA instances. Using KGQAGen, we construct KGQAGen-10k, a 10K-scale benchmark grounded in Wikidata, and evaluate a diverse set of KG-RAG models. Experimental results demonstrate that even state-of-the-art systems struggle on this benchmark, highlighting its ability to expose limitations of existing models. Our findings advocate for more rigorous benchmark construction and position KGQAGen as a scalable framework for advancing KGQA evaluation.

View full details

Poster

The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements

Bingchen Zhao ⋅ Despoina Magka ⋅ Minqi Jiang ⋅ Xian Li ⋅ Roberta Raileanu ⋅ Tatiana Shavrina ⋅ Jean-Christophe Gagnon-Audet ⋅ Kelvin Niu ⋅ Shagun Sodhani ⋅ Michael Shvartsman ⋅ Andrei Lupu ⋅ Alisia Lupidi ⋅ Karen Hambardzumyan ⋅ Martin Josifoski ⋅ Edan Toledo ⋅ Thomas Foster ⋅ Lucia Cipolina Kun ⋅ Derek Dunfield ⋅ Abhishek Charnalia ⋅ Alexander Miller ⋅ Oisin Mac Aodha ⋅ Jakob Foerster ⋅ Yoram Bachrach

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Rapidly improving large language models (LLMs) have the potential to assist in scientific progress. One critical skill in this endeavor is the ability to faithfully reproduce existing work. To evaluate the capability of AI agents to reproduce complex code in an active research area, we introduce the Automated LLM Speedrunning Benchmark, leveraging the research community's contributions to the $\textit{NanoGPT speedrun}$, a competition to train a GPT-2 model in the shortest time. Each of the 19 speedrun tasks provides the agent with the previous record's training script, optionally paired with one of three hint formats, ranging from pseudocode to paper-like descriptions of the new record's improvements. Records execute quickly by design and speedrun improvements encompass diverse code-level changes, ranging from high-level algorithmic advancements to hardware-aware optimizations. These features make the benchmark both accessible and realistic for the frontier problem of improving LLM training. We find that recent frontier reasoning LLMs combined with SoTA scaffolds struggle to reimplement already-known innovations in our benchmark, even when given detailed hints. Our benchmark thus provides a simple, non-saturated measure of an LLM's ability to automate scientific reproduction, a necessary (but not sufficient) skill for an autonomous research agent.

View full details

Poster

AneuG-Flow: A Large-Scale Synthetic Dataset of Diverse Intracranial Aneurysm Geometries and Hemodynamics

Wenhao Ding ⋅ Yiying Sheng ⋅ Simão de Castro ⋅ Hwa Leo ⋅ Choon Hwai Yap

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Hemodynamics has a substantial influence on normal cardiovascular growth and disease formation, but requires time-consuming simulations to obtain. Deep Learning algorithms to rapidly predict hemodynamics parameters can be very useful, but their development is hindered by the lack of large dataset on anatomic geometries and associated fluid dynamics. This paper presents a new large-scale dataset of intracranial aneurysm (IA) geometries and hemodynamics to support the development of neural operators to solve geometry-dependent flow governing partial differential equations. The dataset includes 14,000 steady-flow cases and 200 pulsatile-flow cases simulated with computational fluid dynamics. All cases are computed using a laminar flow setup with more than 3 million cells. Boundary conditions are defined as a parabolic velocity profile with a realistic waveform over time at the inlet, and geometry-dependent mass flow split ratios at the two downstream outlets. The geometries are generated by a deep generative model trained on a cohort of 109 real IAs located at the middle cerebral artery bifurcation, capturing a wide range of geometric variations in both aneurysm sacs and parent vessels. Simulation results shows substantial influence of geometry on fluid forces and flow patterns. In addition to surface mesh files, the dataset provides volume data of velocity, pressure, and wall shear stresses (WSS). For transient cases, spatial and temporal gradients of velocity and pressure are also included. The dataset is tested with PointNet and graph U-Nets for WSS prediction, which showed relative L2 loss of 4.67\% for normalized WSS pattern.

View full details

Poster

Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition

Andy Zou ⋅ Maxwell Lin ⋅ Eliot Jones ⋅ Micha Nowak ⋅ Mateusz Dziemian ⋅ Nick Winter ⋅ Valent Nathanael ⋅ Ayla Croft ⋅ Xander Davies ⋅ Jai Patel ⋅ Robert Kirk ⋅ Yarin Gal ⋅ Dan Hendrycks ⋅ Zico Kolter ⋅ Matt Fredrikson

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

AI agents are rapidly being deployed across diverse industries, but can they adhere to deployment policies under attacks? We organized a one-month red teaming challenge---the largest of its kind to date---involving expert red teamers attempting to elicit policy violations from AI agents powered by $22$ frontier LLMs. Our challenge collected $1.8$ million prompt injection attacks, resulting in over $60,000$ documented successful policy violations, revealing critical vulnerabilities. Utilizing this extensive data, we construct a challenging AI agent red teaming benchmark, currently achieving near $100\%$ attack success rates across all tested agents and associated policies. Our further analysis reveals high transferability and universality of successful attacks, underscoring the scale and criticality of existing AI agent vulnerabilities. We also observe minimal correlation between agent robustness and factors such as model capability, size, or inference compute budget, highlighting the necessity of substantial improvements in defense. We hope our benchmark and insights drive further research toward more secure and reliable AI agents.

View full details

Poster

Struct-Bench: A Benchmark for Differentially Private Structured Text Generation

Shuaiqi Wang ⋅ Vikas Raunak ⋅ Arturs Backurs ⋅ Victor Reis ⋅ Pei Zhou ⋅ Sihao Chen ⋅ Longqi Yang ⋅ Zinan Lin ⋅ Sergey Yekhanin ⋅ Giulia Fanti

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Differentially private (DP) synthetic data generation is a promising technique for utilizing private datasets that otherwise cannot be exposed for model training or other analytics. While much research literature has focused on generating private unstructured text and image data, in enterprise settings, structured data (e.g., tabular) is more common, often including natural language fields or components. Existing synthetic data evaluation techniques (e.g., FID) struggle to capture the structural properties and correlations of such datasets. In this work, we propose Struct-Bench, a framework and benchmark for evaluating synthetic datasets derived from structured datasets that contain natural language data. The Struct-Bench framework requires users to provide a representation of their dataset structure as a Context-Free Grammar (CFG). Our benchmark comprises 5 real-world and 2 synthetically generated datasets. We show that these datasets demonstrably present a great challenge even for state-of-the-art DP synthetic data generation methods. Struct-Bench provides reference implementations of different metrics and a leaderboard, offering a standardized platform to benchmark and investigate privacy-preserving synthetic data methods. We also present a case study showing how Struct-Bench improves the synthetic data quality of Private Evolution (PE) on structured data. The benchmark and the leaderboard have been publicly made available at https://struct-bench.github.io.

View full details

Poster

LTD-Bench: Evaluating Large Language Models by Letting Them Draw

Liuhao Lin ⋅ Ke Li ⋅ Zihan Xu ⋅ Yuchen Shi ⋅ Yulei Qin ⋅ Yan Zhang ⋅ Xing Sun ⋅ Rongrong Ji

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Current evaluation paradigms for large language models (LLMs) represent a critical blind spot in AI research—relying on opaque numerical metrics that conceal fundamental limitations in spatial reasoning while providing no intuitive understanding of model capabilities. This deficiency creates a dangerous disconnect between reported performance and practical abilities, particularly for applications requiring physical world understanding. We introduce LTD-Bench, a breakthrough benchmark that transforms LLM evaluation from abstract scores to directly observable visual outputs by requiring models to generate drawings through dot matrices or executable code. This approach makes spatial reasoning limitations immediately apparent even to non-experts, bridging the fundamental gap between statistical performance and intuitive assessment. LTD-Bench implements a comprehensive methodology with complementary generation tasks (testing spatial imagination) and recognition tasks (assessing spatial perception) across three progressively challenging difficulty levels, methodically evaluating both directions of the critical language-spatial mapping. Our extensive experiments with state-of-the-art models expose an alarming capability gap: even LLMs achieving impressive results on traditional benchmarks demonstrate profound deficiencies in establishing bidirectional mappings between language and spatial concepts—a fundamental limitation that undermines their potential as genuine world models. Furthermore, LTD-Bench's visual outputs enable powerful diagnostic analysis, offering a potential approach to investigate model similarity. Our dataset and codes are available at https://github.com/walktaster/LTD-Bench.

View full details

Poster

MLLM-ISU: The First-Ever Comprehensive Benchmark for Multimodal Large Language Models based Intrusion Scene Understanding

Fujun Han ⋅ Peng Ye

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Vision-based intrusion detection has multiple applications in practical scenarios, e.g., autonomous driving, intelligent monitoring, and security. Previous works mainly focus on improving the intrusion detection performance, without a comprehensive and in-depth understanding of the intrusion scene. To fill this gap, we explore a novel task called Multimodal Large Language Models based Intrusion Scene Understanding (MLLM-ISU) and report a comprehensive benchmark for the task. Specifically, we first design an effective and automatic visual question-answer generation strategy, constructing a new MLLM-ISU dataset, with 3000 VQA evaluation Pairs, 8925 training Pairs, and six relevant subtasks. Then, we perform a comprehensive assessment on various state-of-the-art proprietary and open-source MLLMs, e.g., DeepSeek-VL2, GPT-4o, Qwen2.5-VL, etc, and find that current MLLMs have weak abilities for this task. Further, in order to improve the intrusion understanding capabilities of current MLLMs, we propose a Post-Training Framework with three sequential training stages, i.e., Intrusion-aware Visual Instruction Pre-training, Intrusion Chain of Thought tuning, and Intrusion-centric VQA tuning, and sufficient experiments and comparisons are conducted to verify the effectiveness of the proposed three-stages training framework. Available datasets and codes: https://github.com/1012537710/MLLM-ISU.

View full details

Poster

CHASM: Unveiling Covert Advertisements on Chinese Social Media

Jingyi Zheng ⋅ Tianyi Hu ⋅ Yule Liu ⋅ Zhen Sun ⋅ Zongmin Zhang ⋅ Zifan Peng ⋅ Wenhan Dong ⋅ Xinlei He

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Current benchmarks for evaluating large language models (LLMs) in social media moderation completely overlook a serious threat: covert advertisements, which disguise themselves as regular posts to deceive and mislead consumers into making purchases, leading to significant ethical and legal concerns. In this paper, we present the CHASM, a first-of-its-kind dataset designed to evaluate the capability of Multimodal Large Language Models (MLLMs) in detecting covert advertisements on social media. CHASM is a high-quality, anonymized, manually curated dataset consisting of 4,992 instances, based on real-world scenarios from the Chinese social media platform Rednote. The dataset was collected and annotated under strict privacy protection and quality control protocols. It includes many product experience sharing posts that closely resemble covert advertisements, making the dataset particularly challenging.The results show that under both zero-shot and in-context learning settings, none of the current MLLMs are sufficiently reliable for detecting covert advertisements.Our further experiments revealed that fine-tuning open-source MLLMs on our dataset yielded noticeable performance gains. However, significant challenges persist, such as detecting subtle cues in comments and differences in visual and textual structures.We provide in-depth error analysis and outline future research directions. We hope our study can serve as a call for the research community and platform moderators to develop more precise defenses against this emerging threat.

View full details

Poster

Señorita-2M: A High-Quality Instruction-based Dataset for General Video Editing by Video Specialists

Bojia Zi ⋅ Penghui Ruan ⋅ Marco Chen ⋅ Xianbiao Qi ⋅ Shaozhe Hao ⋅ Shihao Zhao ⋅ Youze Huang ⋅ Bin Liang ⋅ Rong Xiao ⋅ Kam-Fai Wong

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Video content editing has a wide range of applications. With the advancement of diffusion-based generative models, video editing techniques have made remarkable progress, yet they still remain far from practical usability. Existing inversion-based video editing methods are time-consuming and struggle to maintain consistency in unedited regions. Although instruction-based methods have high theoretical potential, they face significant challenges in constructing high-quality training datasets - current datasets suffer from issues such as editing correctness, frame consistency, and sample diversity. To bridge these gaps, we introduce the **Señorita-2M** dataset, a large-scale, diverse, and high-quality video editing dataset. We systematically categorize editing tasks into 2 classes consisting of 18 subcategories. To build this dataset, we design four new task specialists and employ or modify 14 existing task experts to generate data samples for each subclass. In addition, we design a filtering pipeline at both the visual content and instruction levels to further enhance data quality. This approach ensures the reliability of constructed data. Finally, the **Señorita-2M** dataset comprises 2 million high-fidelity samples with diverse resolutions and frame counts. We trained multiple models using different base video models, i.e., Wan2.1 and CogVideoX-5B, on Señorita-2M, and the results demonstrate that the models exhibit superior visual quality, robust frame-to-frame consistency, and strong instruction following capability. More videos are available at: **https://senorita-2m-dataset.github.io**.

View full details

Poster

Two Causally Related Needles in a Video Haystack

Miaoyu Li ⋅ Qin Chao ⋅ Boyang Li

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Properly evaluating the ability of Video-Language Models (VLMs) to understand long videos remains a challenge. We propose a long-context video understanding benchmark, Causal2Needles, that assesses two crucial abilities insufficiently addressed by existing benchmarks: (1) extracting information from two separate locations (two needles) in a long video and understanding them jointly, and (2) modeling the world in terms of cause and effect in human behaviors. Causal2Needles evaluates these abilities using noncausal one-needle, causal one-needle, and causal two-needle questions. The most complex question type, causal two-needle questions, require extracting information from both the cause and effect events from a long video and the associated narration text. To prevent textual bias, we introduce two complementary question formats: locating the video clip containing the answer, and verbal description of a visual detail from that video clip. Our experiments reveal that models excelling on existing benchmarks struggle with causal 2-needle questions, and the model performance is negatively correlated with the distance between the two needles. These findings highlight critical limitations in current VLMs.

View full details

Poster

Towards Evaluating Proactive Risk Awareness of Multimodal Language Models

Youliang Yuan ⋅ Wenxiang Jiao ⋅ Yuejin Xie ⋅ Chihao Shen ⋅ Menghan Tian ⋅ Wenxuan Wang ⋅ Jen-Tse Huang ⋅ Pinjia He

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Human safety awareness gaps often prevent the timely recognition of everyday risks.In solving this problem, a proactive safety artificial intelligence (AI) system would work better than a reactive one. Instead of just reacting to users' questions, it would actively watch people’s behavior and their environment to detect potential dangers in advance.Our Proactive Safety Bench (PaSBench) evaluates this capability through 416 multimodal scenarios (128 image sequences, 288 text logs) spanning 5 safety-critical domains.Evaluation of 36 advanced models reveals fundamental limitations: Top performers like Gemini-2.5-pro achieve 71\% image and 64\% text accuracy, but miss 45-55\% risks in repeated trials. Through failure analysis, we identify unstable proactive reasoning rather than knowledge deficits as the primary limitation.This work establishes (1) a proactive safety benchmark, (2) systematic evidence of model limitations, and (3) critical directions for developing reliable protective AI. We believe our dataset and findings can promote the development of safer AI assistants that actively prevent harm rather than merely respond to requests.

View full details

Poster

RGB-to-Polarization Estimation: A New Task and Benchmark Study

Beibei Lin ⋅ Zifeng Yuan ⋅ Tingting Chen

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Polarization images provide rich physical information that is fundamentally absent from standard RGB images, benefiting a wide range of computer vision applications such as reflection separation and material classification. However, the acquisition of polarization images typically requires additional optical components, which increases both the cost and the complexity of the applications. To bridge this gap, we introduce a new task: RGB-to-polarization image estimation, which aims to infer polarization information directly from RGB images. In this work, we establish the first comprehensive benchmark for this task by leveraging existing polarization datasets and evaluating a diverse set of state-of-the-art deep learning models, including both restoration-oriented and generative architectures. Through extensive quantitative and qualitative analysis, our benchmark not only establishes the current performance ceiling of RGB-to-polarization estimation, but also systematically reveals the respective strengths and limitations of different model families — such as direct reconstruction versus generative synthesis, and task-specific training versus large-scale pre-training. In addition, we provide some potential directions for future research on polarization estimation. This benchmark is intended to serve as a foundational resource to facilitate the design and evaluation of future methods for polarization estimation from standard RGB inputs.

View full details

Poster

FlySearch: Exploring how vision-language models explore

Adam Pardyl ⋅ Dominik Matuszek ⋅ Mateusz Przebieracz ⋅ Marek Cygan ⋅ Bartosz Zieliński ⋅ Maciej Wolczyk

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

The real world is messy and unstructured. Uncovering critical information often requires active, goal-driven exploration. It remains to be seen whether Vision-Language Models (VLMs), which recently emerged as a popular zero-shot tool in many difficult tasks, can operate effectively in such conditions. In this paper, we answer this question by introducing FlySearch, a 3D, outdoor, photorealistic environment for searching and navigating to objects in complex scenes. We define three sets of scenarios with varying difficulty and observe that state-of-the-art VLMs cannot reliably solve even the simplest exploration tasks, with the gap to human performance increasing as the tasks get harder. We identify a set of central causes, ranging from vision hallucination, through context misunderstanding, to task planning failures, and we show that some of them can be addressed by finetuning. We publicly release the benchmark, scenarios, and the underlying codebase.

View full details

Poster

MARS-VFL: A Unified Benchmark for Vertical Federated Learning with Realistic Evaluation

Wei Shen ⋅ Weiqi Liu ⋅ Mingde Chen ⋅ Wenke Huang ⋅ Mang Ye

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Vertical Federated Learning (VFL) has emerged as a critical privacy-preserving learning paradigm, enabling collaborative model training by leveraging distributed features across clients. However, due to privacy concerns, there are few publicly available real-world datasets for evaluating VFL methods, which poses significant challenges to related research. To bridge this gap, we propose MARS-VFL, a unified benchmark for realistic VFL evaluation. It integrates data from practical applications involving collaboration across different features, maintaining compatibility with the VFL setting. Based on this, we standardize the evaluation of VFL methods from the mainstream aspects of efficiency, robustness, and security. We conduct comprehensive experiments to assess different VFL approaches, providing references for unified evaluation. Furthermore, we are the first to unify the evaluation of robustness challenges in VFL and introduce a new method for addressing robustness challenges, establishing standard baselines for future research.

View full details

Poster

LithoSim: A Large, Holistic Lithography Simulation Benchmark for AI-Driven Semiconductor Manufacturing

Hongquan He ⋅ Zhen Wang ⋅ Jingya Wang ⋅ Tao Wu ⋅ Xuming He ⋅ Bei Yu ⋅ Jingyi Yu ⋅ Hao GENG

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Lithography orchestrates a symphony of light, mask and photochemicals to transfer the integrated circuit patterns onto the wafer. Lithography simulation serves as the critical nexus between circuit design and manufacturing, where its speed and accuracy fundamentally govern the optimization quality of downstream resolution enhancement techniques (RET). While machine learning promises to circumvent computational limitations of lithography process through data-driven or physics-informed approximations of computational lithography, existing simulators suffer from inadequate lithographic awareness due to insufficient training data capturing essential process variations and mask correction rules. We present LithoSim, the most comprehensive lithography simulation benchmark to date, featuring over $4$ million high-resolution input-output pairs with rigorous physical correspondence. The dataset systematically incorporates alterable optical source distributions, metal and via mask topologies with optical proximity correction (OPC) variants, and process windows reflecting fab-realistic variations. By integrating domain-specific metrics spanning AI performance and lithographic fidelity, LithoSim establishes a unified evaluation framework for data-driven and physics-informed computational lithography. The data (https://huggingface.co/datasets/grandiflorum/LithoSim), code (https://dw-hongquan.github.io/LithoSim), and pre-trained models (https://huggingface.co/grandiflorum/LithoSim) are released openly to support the development of hybrid ML-based and high-fidelity lithography simulation for the benefit of semiconductor manufacturing.

View full details

Poster

SutureBot: A Precision Framework & Benchmark For Autonomous End-to-End Suturing

Jesse Haworth ⋅ Juo-Tung Chen ⋅ Nigel Nelson ⋅ Ji Woong Kim ⋅ Masoud Moghani ⋅ Chelsea Finn ⋅ Axel Krieger

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Robotic suturing is a prototypical long-horizon dexterous manipulation task, requiring coordinated needle grasping, precise tissue penetration, and secure knot tying. Despite numerous efforts toward end-to-end autonomy, a fully autonomous suturing pipeline has yet to be demonstrated on physical hardware. We introduce SutureBot: an autonomous suturing benchmark on the da Vinci Research Kit (dVRK), spanning needle pickup, tissue insertion, and knot tying. To ensure repeatability, we release a high-fidelity dataset comprising 1,890 suturing demonstrations. Furthermore, we propose a goal-conditioned framework that explicitly optimizes insertion-point precision, improving targeting accuracy by 59\%-74\% over a task-only baseline. To establish this task as a benchmark for dexterous imitation learning, we evaluate state-of-the-art vision-language-action (VLA) models, including $\pi_0$, GR00T N1, OpenVLA-OFT, and multitask ACT, each augmented with a high-level task-prediction policy. Autonomous suturing is a key milestone toward achieving robotic autonomy in surgery. These contributions support reproducible evaluation and development of precision-focused, long-horizon dexterous manipulation policies necessary for end-to-end suturing. Dataset is available at: \href{https://huggingface.co/datasets/jchen396/suturebot}{Hugging Face}

View full details

Poster

REOBench: Benchmarking Robustness of Earth Observation Foundation Models

Xiang Li ⋅ Yong Tao ⋅ Siyuan Zhang ⋅ Siwei Liu ⋅ Zhitong Xiong ⋅ Chunbo Luo ⋅ Lu Liu ⋅ Mykola Pechenizkiy ⋅ Xiaoxiang Zhu ⋅ Tianjin Huang

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Earth observation foundation models have shown strong generalization across multiple Earth observation tasks, but their robustness under real-world perturbations remains underexplored. To bridge this gap, we introduce REOBench, the first comprehensive benchmark for evaluating the robustness of Earth observation foundation models across six tasks and twelve types of image corruptions, including both appearance-based and geometric perturbations. To ensure realistic and fine-grained evaluation, our benchmark focuses on high-resolution optical remote sensing images, which are widely used in critical applications such as urban planning and disaster response. We conduct a systematic evaluation of a broad range of models trained using masked image modeling, contrastive learning, and vision-language pre-training paradigms. Our results reveal that (1) existing Earth observation foundation models experience significant performance degradation when exposed to input corruptions. (2) The severity of degradation varies across tasks, model architectures, backbone sizes, and types of corruption, with performance drop varying from less than 1% to over 25%. (3) Vision-language models show enhanced robustness, particularly in multimodal tasks. REOBench underscores the vulnerability of current Earth observation foundation models to real-world corruptions and provides actionable insights for developing more robust and reliable models.

View full details

Poster

AGENTIF: Benchmarking Large Language Models Instruction Following Ability in Agentic Scenarios

Yunjia Qi ⋅ Hao Peng ⋅ Xiaozhi Wang ⋅ Amy Xin ⋅ Youfeng Liu ⋅ Bin Xu ⋅ Lei Hou ⋅ Juanzi Li

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Large Language Models (LLMs) have demonstrated advanced capabilities in real-world agentic applications. Growing research efforts aim to develop LLM-based agents to address practical demands, introducing a new challenge: agentic scenarios often involve lengthy instructions with complex constraints, such as extended system prompts and detailed tool specifications. While adherence to such instructions is crucial for agentic applications, whether LLMs can reliably follow them remains underexplored. In this paper, we introduce AgentIF, the first benchmark for systematically evaluating LLM instruction following ability in agentic scenarios. AgentIF features three key characteristics: (1) Realistic, constructed from $50$ real-world agentic applications. (2) Long, averaging $1,723$ words with a maximum of $15,630$ words. (3) Complex, averaging $11.9$ constraints per instruction, covering diverse constraint types, such as tool specifications and condition constraints.To construct AgentIF, we collect $707$ human-annotated instructions across $50$ agentic tasks from industrial application agents and open-source agentic systems. For each instruction, we annotate the associated constraints and corresponding evaluation metrics, including code-based evaluation, LLM-based evaluation, and hybrid code-LLM evaluation.We use AgentIF to systematically evaluate existing advanced LLMs. We observe that current models generally perform poorly, especially in handling complex constraint structures and tool specifications. We further conduct error analysis and analytical experiments on instruction length and meta constraints, providing some findings about the failure modes of existing LLMs. We have released the code and data to facilitate future research.

View full details

Poster

OMEGA: Can LLMs Reason Outside the Box in Math? Evaluating Exploratory, Compositional, and Transformative Generalization

Yiyou Sun ⋅ Shawn Hu ⋅ Georgia Zhou ⋅ Ken Zheng ⋅ Hanna Hajishirzi ⋅ Nouha Dziri ⋅ Dawn Song

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Recent large language models (LLMs) with long-chain-of-thought reasoning—such as DeepSeek-R1—have achieved impressive results on Olympiad-level mathematics benchmarks. However, they often rely on a narrow set of strategies and struggle with problems that require a novel way of thinking. To systematically investigate these limitations, we introduce OMEGA—Out-of-distribution Math Problems Evaluation with 3 Generalization Axes—a controlled yet diverse bench- mark designed to evaluate three axes of out-of-distribution generalization, inspired by Boden’s typology of creativity: (1) Exploratory—applying known problem- solving skills to more complex instances within the same problem domain; (2) Com- positional—combining distinct reasoning skills, previously learned in isolation, to solve novel problems that require integrating these skills in new and coherent ways; and (3) Transformative—adopting novel, often unconventional strategies by moving beyond familiar approaches to solve problems more effectively. OMEGA consists of programmatically generated training–test pairs derived from templated problem generators across geometry, number theory, algebra, combinatorics, logic, and puzzles, with solutions verified using symbolic, numerical, or graphical methods. We evaluate frontier (or top-tier) LLMs and observe sharp performance degradation as problem complexity increases. Moreover, we fine-tune the Qwen-series models across all generalization settings and observe notable improvements in exploratory generalization, while compositional generalization remains limited, and transformative reasoning shows little to no improvement. By isolating and quantifying these fine-grained failures, OMEGA lays the groundwork for advancing LLMs toward genuine mathematical creativity beyond mechanical proficiency. Our code and dataset are available at https://github.com/sunblaze-ucb/omega.

View full details

Poster

UltraVideo: High-Quality UHD Video Dataset with Comprehensive Captions

Xue zhucun ⋅ Jiangning Zhang ⋅ Teng Hu ⋅ Haoyang He ⋅ Yinan Chen ⋅ Yuxuan Cai ⋅ Yabiao Wang ⋅ Chengjie Wang ⋅ Yong Liu ⋅ Xiangtai Li ⋅ Dacheng Tao

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

The quality of the video dataset (image quality, resolution, and fine-grained caption) greatly influences the performance of the video generation model. %The growing demand for video applications sets higher requirements for high-quality video generation models. %For example, the generation of movie-level Ultra-High Definition (UHD) videos and the creation of 4K short video content. %However, the existing public datasets cannot support related research and applications. %In this paper, we first propose a high-quality open-sourced UHD-4K (22.4\% of which are 8K) text-to-video dataset named UltraVideo, which contains a wide range of topics (more than 100 kinds), and each video has 9 structured captions with one summarized caption (average of 824 words). %Specifically, we carefully design a highly automated curation process with four stages to obtain the final high-quality dataset: ***i)*** collection of diverse and high-quality video clips. ***ii*** statistical data filtering. ***iii)*** model-based data purification. ***iv)*** generation of comprehensive, structured captions. %In addition, we expand Wan to UltraWan-1K/-4K, which can natively generate high-quality 1K/4K videos with more consistent text controllability, demonstrating the effectiveness of our data curation.%We believe that this work can make a significant contribution to future research on UHD video generation. UltraVideo dataset and UltraWan models are available at https://xzc-zju.github.io/projects/UltraVideo.

View full details

Poster

HouseLayout3D: A Benchmark and Training-free Baseline for 3D Layout Estimation in the Wild

Valentin Bieri ⋅ Marie-Julie Rakotosaona ⋅ Keisuke Tateno ⋅ Francis Engelmann ⋅ Leonidas Guibas

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Current 3D layout estimation models are predominantly trained on synthetic datasets biased toward simplistic, single-floor scenes. This prevents them from generalizing to complex, multi-floor buildings, often forcing a per-floor processing approach that sacrifices global context. Few works have attempted to holistically address multi-floor layouts. In this work, we introduce HouseLayout3D, a real-world benchmark dataset, which highlights the limitations of existing research when handling expansive, architecturally complex spaces. Additionally, we propose MultiFloor3D, a baseline method leveraging recent advances in 3D reconstruction and 2D segmentation. Our approach significantly outperforms state-of-the-art methods on both our new and existing datasets. Remarkably, it does not require any layout-specific training.

View full details

Poster

Trans-EnV: A Framework for Evaluating the Linguistic Robustness of LLMs Against English Varieties

Jiyoung Lee ⋅ Seungho Kim ⋅ Jieun Han ⋅ Jun-Min Lee ⋅ Kitaek Kim ⋅ Alice Oh ⋅ Edward Choi

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Large Language Models (LLMs) are predominantly evaluated on Standard American English (SAE), often overlooking the diversity of global English varieties.This narrow focus may raise fairness concerns as degraded performance on non-standard varieties can lead to unequal benefits for users worldwide.Therefore, it is critical to extensively evaluate the linguistic robustness of LLMs on multiple non-standard English varieties.We introduce Trans-EnV, a framework that automatically transforms SAE datasets into multiple English varieties to evaluate the linguistic robustness. Our framework combines (1) linguistics expert knowledge to curate variety-specific features and transformation guidelines from linguistic literature and corpora, and (2) LLM-based transformations to ensure both linguistic validity and scalability.Using Trans-EnV, we transform six benchmark datasets into 38 English varieties and evaluate seven state-of-the-art LLMs.Our results reveal significant performance disparities, with accuracy decreasing by up to 46.3% on non-standard varieties.These findings highlight the importance of comprehensive linguistic robustness evaluation across diverse English varieties. Each construction of Trans-EnV was validated through rigorous statistical testing and consultation with a researcher in the field of second language acquisition, ensuring its linguistic validity.Our code and datasets are publicly available.

View full details

Poster

From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D

Jiahui Zhang ⋅ Yurui Chen ⋅ Yueming Xu ⋅ Ze Huang ⋅ Jilin Mei ⋅ Chunhui Chen ⋅ Yanpeng Zhou ⋅ Yu-Jie Yuan ⋅ Xinyue Cai ⋅ Guowei Huang ⋅ Xingyue Quan ⋅ Hang Xu ⋅ Li Zhang

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Recent advances in LVLMs have improved vision-language understanding, but they still struggle with spatial perception, limiting their ability to reason about complex 3D scenes. Unlike previous approaches that incorporate 3D representations into models to improve spatial understanding, we aim to unlock the potential of VLMs by leveraging spatially relevant image data. To this end, we introduce a novel 2D spatial data generation and annotation pipeline built upon scene data with 3D ground-truth. This pipeline enables the creation of a diverse set of spatial tasks, ranging from basic perception tasks to more complex reasoning tasks. Leveraging this pipeline, we construct SPAR-7M, a large-scale dataset generated from thousands of scenes across multiple public datasets. In addition, we introduce SPAR-Bench, a benchmark designed to offer a more comprehensive evaluation of spatial capabilities compared to existing spatial benchmarks, supporting both single-view and multi-view inputs. Training on both SPAR-7M and large-scale 2D datasets enables our models to achieve state-of-the-art performance on 2D spatial benchmarks. Further fine-tuning on 3D task-specific datasets yields competitive results, underscoring the effectiveness of our dataset in enhancing spatial reasoning.

View full details

Poster

RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video

ShuHang Xun ⋅ Sicheng Tao ⋅ Jungang Li ⋅ Yibo Shi ⋅ Zhixin Lin ⋅ Zhanhui Zhu ⋅ Yibo Yan ⋅ Hanqian Li ⋅ LingHao Zhang ⋅ Shikang Wang ⋅ Yixin Liu ⋅ Hanbo Zhang ⋅ Ying Ma ⋅ Xuming Hu

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Multimodal Large Language Models (MLLMs) increasingly excel at perception,understanding, and reasoning. However, current benchmarks inadequately evaluate their ability to perform these tasks continuously in dynamic, real-world environments. To bridge this gap, we introduce RT V-Bench, a fine-grained benchmark for MLLM real-time video analysis. RTV-Bench includes three key principles: (1) Multi-Timestamp Question Answering (MTQA), where answers evolve with scene changes; (2) Hierarchical Question Structure, combining basic and advanced queries; and (3) Multi-dimensional Evaluation, assessing the ability of continuous perception, understanding, and reasoning. RTV-Bench contains 552 diverse videos (167.2 hours) and 4,631 high-quality QA pairs. We evaluated leading MLLMs, including proprietary (GPT-4o, Gemini 2.0), open-source offline (Qwen2.5-VL, VideoLLaMA3), and open-source real-time (VITA-1.5, InternLM-XComposer2.5-OmniLive) models. Experiment results show open-source real-time models largely outperform offline ones but still trail top proprietary models. Our analysis also reveals that larger model size or higher frame sampling rates do not significantly boost RTV-Bench performance, sometimes causing slight decreases.This underscores the need for better model architectures optimized for video stream processing and long sequences to advance real-time video analysis with MLLMs.

View full details

Poster

CoRe: Benchmarking LLMs’ Code Reasoning Capabilities through Static Analysis Tasks

Danning Xie ⋅ Mingwei Zheng ⋅ Xuwei Liu ⋅ Jiannan Wang ⋅ Chengpeng Wang ⋅ Lin Tan ⋅ Xiangyu Zhang

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Large language models (LLMs) have been widely adopted across diverse domains of software engineering, such as code generation, program repair, and vulnerability detection. These applications require understanding beyond surface-level code patterns: value propagation, control flow, and interdependence between program elements. However, existing benchmarks primarily evaluate end-to-end outcomes, such as whether code is correctly repaired or generated, leaving the models' ability of program semantic reasoning underexplored.This work presents CoRe, a high-quality, human-verified benchmark designed to evaluate LLMs on fundamental static analysis tasks. CoRe includes 12,553 task instances spanning data dependency, control dependency, and information flow across programs written in C/C++, Java, and Python. To ensure semantic diversity and reasoning complexity, we propose a semantics-aware diverse sampling strategy that selects targets and task instances based on structural coverage and dependency depth. We evaluate 10 state-of-the-art LLMs and show that, while they perform well at identifying dependencies, models still struggle with tasks that require deeper semantic understanding and multi-step reasoning.We further conduct qualitative analyses to uncover key challenges, such as complex control structures and backward dependency patterns, offering insights into improving LLMs’ code reasoning capabilities.

View full details

Poster

PSI: A Benchmark for Human Interpretation and Response in Traffic Interactions

TAOTAO JING ⋅ Tina Chen ⋅ Renran Tian ⋅ Yaobin Chen ⋅ Joshua Domeyer ⋅ Heishiro Toyoda ⋅ Rini Sherony ⋅ Zhengming Ding

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Accurately modeling pedestrian intention and understanding driver decision-making processes are critical for the development of safe and socially aware autonomous driving systems. However, existing datasets primarily emphasize observable behavior, offering limited insight into the underlying causal reasoning that informs human interpretation and response during traffic interactions. To address this gap, we introduce PSI, a benchmark dataset that captures the dynamic evolution of pedestrian crossing intentions from the driver’s perspective, enriched with human-annotated textual explanations that reflect the reasoning behind intention estimation and driving decision making. These annotations offer a unique foundation for developing and benchmarking models that combine predictive performance with interpretable and human-aligned reasoning. PSI supports standardized tasks and evaluation protocols across multiple dimensions, including pedestrian intention prediction, driver decision modeling, reasoning generation, and trajectory forecasting and more. By enabling causal and interpretable evaluation, PSI advances research toward autonomous systems that can reason, act, and explain in alignment with human cognitive processes.

View full details

Poster

COGNAC: Cooperative Graph-based Networked Agent Challenges for Multi-Agent Reinforcement Learning

Jules Sintes ⋅ Ana Busic

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Many controlled complex systems have an inherent network structure, such as power grids, traffic light systems, or computer networks. Automatically controlling these systems is highly challenging due to their combinatorial complexity. Standard single-agent reinforcement learning (RL) approaches often struggle with the curse of dimensionality in such settings. In contrast, the multi-agent paradigm offers a promising solution by distributing decision-making, thereby addressing both algorithmic and combinatorial challenges. In this paper, we introduce COGNAC (COoperative Graph-based Networked Agent Challenges), a collection of cooperative graph-structured environments designed to facilitate experiments across different graph sizes and topologies. COGNAC bridges the gap between theoretical research in network control and practical multi-agent RL (MARL) applications by offering a flexible, scalable platform with a suite of simple yet highly challenging problems rooted in networked environments. Our benchmarks also support the development and evaluation of decentralized and distributed learning algorithms, motivated by the growing interest in more sustainable and frugal AI systems. Experiments on COGNAC show that independent actor–critic learning (IPPO) yields the highest-quality joint policies while scaling robustly to large network sizes with minimal hyperparameter tuning. Value-based independent learning (IDQL) typically needs substantially more training and is less reliable on combinatorial tasks. In contrast, standard Centralized-Training Decentralized-Execution (CTDE) methods and fully centralized training are slower to converge, less stable, and struggle to generalize to larger, more interdependent networks. These results suggest that CTDE approaches likely need extra information or inter-agent communication to fully capture the underlying network structure of each problem.

View full details

Poster

Common Task Framework For a Critical Evaluation of Scientific Machine Learning Algorithms

Philippe Wyder ⋅ Judah Goldfeder ⋅ Alexey Yermakov ⋅ Yue Zhao ⋅ Stefano Riva ⋅ Jan Williams ⋅ David Zoro ⋅ Amy Rude ⋅ Matteo Tomasetto ⋅ Joe Germany ⋅ Joseph Bakarji ⋅ Georg Maierhofer ⋅ Miles Cranmer ⋅ Nathan Kutz

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Machine learning (ML) is transforming modeling and control in the physical, engineering, and biological sciences. However, rapid development has outpaced the creation of standardized, objective benchmarks—leading to weak baselines, reporting bias, and inconsistent evaluations across methods. This undermines reproducibility, misguides resource allocation, and obscures scientific progress. To address this, we propose a Common Task Framework (CTF) for scientific machine learning. The CTF features a curated set of datasets and task-specific metrics spanning forecasting, state reconstruction, and generalization under realistic constraints, including noise and limited data. Inspired by the success of CTFs in fields like natural language processing and computer vision, our framework provides a structured, rigorous foundation for head-to-head evaluation of diverse algorithms. As a first step, we benchmark methods on two canonical nonlinear systems: Kuramoto-Sivashinsky and Lorenz. These results illustrate the utility of the CTF in revealing method strengths, limitations, and suitability for specific classes of problems and diverse objectives. Next, we are launching a competition around a global real world sea surface temperature dataset with a true holdout dataset to foster community engagement. Our long-term vision is to replace ad hoc comparisons with standardized evaluations on hidden test sets that raise the bar for rigor and reproducibility in scientific ML.

View full details

Poster

Web-Scale Collection of Video Data for 4D Animal Reconstruction

Brian Nlong Zhao ⋅ Jiajun Wu ⋅ Shangzhe Wu

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Computer vision for animals holds great promise for wildlife research but often depends on large-scale data, while existing collection methods rely on controlled capture setups. Recent data-driven approaches show the potential of single-view, non-invasive analysis, yet current animal video datasets are limited—offering as few as 2.4K 15-frame clips and lacking key processing for animal-centric 3D/4D tasks. We introduce an automated pipeline that mines YouTube videos and processes them into object-centric clips, along with auxiliary annotations valuable for downstream tasks like pose estimation, tracking, and 3D/4D reconstruction. Using this pipeline, we amass 30K videos (2M frames)—an order of magnitude more than prior works. To demonstrate its utility, we focus on 4D quadruped animal reconstruction task. To support this task, we present Animal4D, a benchmark of 230 manually filtered sequences with 11K frames showcasing clean, diverse animal motions. We evaluate state-of-the-art model-based and model-free methods on Animal4D, finding that 2D metrics favor the former despite unrealistic 3D shapes, while the latter yields more natural reconstructions but scores lower—revealing a gap in current evaluation. To address this, we enhance a recent model-free approach with sequence-level optimization, establishing the first 4D animal reconstruction baseline. Together, our pipeline, benchmark, and baseline aim to advance large-scale, markerless 4D animal reconstruction and related tasks from in-the-wild videos. Code and datasets are available at https://github.com/briannlongzhao/Animal-Video-Processing.

View full details

Poster

AtmosSci-Bench: Evaluating the Recent Advance of Large Language Model for Atmospheric Science

Chenyue Li ⋅ Wen Deng ⋅ Mengqian Lu ⋅ Binhang Yuan

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

The rapid advancements in large language models (LLMs), particularly in their reasoning capabilities, hold transformative potential for addressing complex challenges and boosting scientific discovery in atmospheric science. However, leveraging LLMs effectively in this domain requires a robust and comprehensive evaluation benchmark. Toward this end, we present AtmosSci-Bench, a novel benchmark designed to systematically assess LLM performance across five core categories of atmospheric science problems: hydrology, atmospheric dynamics, atmospheric physics, geophysics, and physical oceanography.AtmosSci-Bench features a dual-format design comprising both multiple-choice questions (MCQs) and open-ended questions (OEQs), enabling scalable automated evaluation alongside deeper analysis of conceptual understanding. We employ a template-based MCQ generation framework to create diverse, graduate-level problems with symbolic perturbation, while OEQs are used to probe open-ended reasoning.We conduct a comprehensive evaluation of representative LLMs, categorized into four groups: instruction-tuned models, advanced reasoning models, math-augmented models, and domain-specific climate models. Our analysis provides some interesting insights into the reasoning and problem-solving capabilities of LLMs in atmospheric science. We believe AtmosSci-Bench can serve as a critical step toward advancing LLM applications in climate services by offering a standard and rigorous evaluation framework. The source code of AtmosSci-Bench is available at https://github.com/Relaxed-System-Lab/AtmosSci-Bench.

View full details

Poster

Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark

Hanlei Zhang ⋅ zhuohang li ⋅ Hua Xu ⋅ Yeshuang Zhu ⋅ Peiwu Wang ⋅ Haige Zhu ⋅ Jie Zhou ⋅ Jinchao Zhang

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Multimodal language analysis is a rapidly evolving field that leverages multiple modalities to enhance the understanding of high-level semantics underlying human conversational utterances. Despite its significance, little research has investigated the capability of multimodal large language models (MLLMs) to comprehend cognitive-level semantics. In this paper, we introduce MMLA, a comprehensive benchmark specifically designed to address this gap. MMLA comprises over 61K multimodal utterances drawn from both staged and real-world scenarios, covering six core dimensions of multimodal semantics: intent, emotion, dialogue act, sentiment, speaking style, and communication behavior. We evaluate eight mainstream branches of LLMs and MLLMs using three methods: zero-shot inference, supervised fine-tuning, and instruction tuning. Extensive experiments reveal that even fine-tuned models achieve only about 60~70% accuracy, underscoring the limitations of current MLLMs in understanding complex human language. We believe that MMLA will serve as a solid foundation for exploring the potential of large language models in multimodal language analysis and provide valuable resources to advance this field. The datasets and code are open-sourced at https://github.com/thuiar/MMLA.

View full details

Poster

ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real World

Weixiang Yan ⋅ Haitian Liu ⋅ Tengxiao Wu ⋅ Qian Chen ⋅ Wen Wang ⋅ Haoyuan Chai ⋅ Jiayi Wang

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Large language models (LLMs) have achieved significant performance progress in various natural language processing applications. However, LLMs still struggle to meet the strict requirements for accuracy and reliability in the medical field and face many challenges in clinical applications. Existing clinical diagnostic evaluation benchmarks for evaluating medical agents powered by LLMs have severe limitations. To address these limitations, we introduce ClinicalLab, a comprehensive clinical diagnosis agent alignment suite. ClinicalLab includes ClinicalBench, an end-to-end multi-departmental clinical diagnostic evaluation benchmark for evaluating medical agents and LLMs. ClinicalBench is based on real cases that cover 24 departments and 150 diseases. We ensure that ClinicalBench does not have data leakage. ClinicalLab also includes four novel metrics (ClinicalMetrics) for evaluating the effectiveness of LLMs in clinical diagnostic tasks. We evaluate 17 general and medical-domain LLMs and find that their performance varies significantly across different departments. Based on these findings, in ClinicalLab, we propose ClinicalAgent, an end-to-end clinical agent that aligns with real-world clinical diagnostic practices. We systematically investigate the performance and applicable scenarios of variants of ClinicalAgent on ClinicalBench. Our findings demonstrate the importance of aligning with modern medical practices in designing medical agents.

View full details

Poster

ModuLM: Enabling Modular and Multimodal Molecular Relational Learning with Large Language Models

Zhuo Chen ⋅ YIZHEN ZHENG ⋅ Huan Yee Koh ⋅ Hongxin Xiang ⋅ Linjiang Chen ⋅ Wenjie Du ⋅ Yang Wang

Molecular Relational Learning (MRL) aims to understand interactions between molecular pairs, playing a critical role in advancing biochemical research. With the recent development of large language models (LLMs), a growing number of studies have explored the integration of MRL with LLMs and achieved promising results. However, the increasing availability of diverse LLMs and molecular structure encoders has significantly expanded the model space, presenting major challenges for benchmarking. Currently, there is no LLM framework that supports both flexible molecular input formats and dynamic architectural switching. To address these challenges, reduce redundant coding, and ensure fair model comparison, we propose ModuLM, a framework designed to support flexible LLM-based model construction and diverse molecular representations. ModuLM provides a rich suite of modular components, including 8 types of 2D molecular graph encoders, 11 types of 3D molecular conformation encoders, 7 types of interaction layers, and 7 mainstream LLM backbones. Owing to its highly flexible model assembly mechanism, ModuLM enables the dynamic construction of over 50,000 distinct model configurations. In addition, we provide comprehensive benchmark results to demonstrate the effectiveness of ModuLM in supporting LLM-based MRL tasks.

View full details

Oral

Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing

Xiangyu Zhao ⋅ Peiyuan Zhang ⋅ Kexian Tang ⋅ Xiaorong Zhu ⋅ Hao Li ⋅ Wenhao Chai ⋅ Zicheng Zhang ⋅ Renqiu Xia ⋅ Guangtao Zhai ⋅ Junchi Yan ⋅ Hua Yang ⋅ Xue Yang ⋅ Haodong Duan

Dec 3, 3:30 PM - 3:50 PM Exhibit Hall F,G,H

Large Multi-modality Models (LMMs) have made significant progress in visual understanding and generation, but they still face challenges in General Visual Editing, particularly in following complex instructions, preserving appearance consistency, and supporting flexible input formats. To study this gap, we introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE). RISEBench focuses on four key reasoning categories: Temporal, Causal, Spatial, and Logical Reasoning. We curate high-quality test cases for each category and propose an robust evaluation framework that assesses Instruction Reasoning, Appearance Consistency, and Visual Plausibility with both human judges and the LMM-as-a-judge approach. We conducted experiments evaluating nine prominent visual editing models, comprising both open-source and proprietary models. The evaluation results demonstrate that current models face significant challenges in reasoning-based editing tasks. Even the most powerful model evaluated, GPT-image-1, achieves an accuracy of merely 28.8%. RISEBench effectively highlights the limitations of contemporary editing models, provides valuable insights, and indicates potential future directions for the field of reasoning-aware visual editing. Our code and data have been released at https://github.com/PhoenixZ810/RISEBench.

View full details

Oral

WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch

Zimu Lu ⋅ Yunqiao Yang ⋅ Houxing Ren ⋅ Haotian Hou ⋅ Han Xiao ⋅ Ke Wang ⋅ Weikang Shi ⋅ Aojun Zhou ⋅ Mingjie Zhan ⋅ Hongsheng Li

Dec 5, 10:00 AM - 10:20 AM Upper Level Ballroom 6CDEF

LLM‑based agents have demonstrated great potential in generating and managing code within complex codebases. In this paper, we introduce WebGen-Bench, a novel benchmark designed to measure an LLM-based agent's ability to create multi-file website codebases from scratch. It contains diverse instructions for website generation, created through the combined efforts of human annotators and GPT-4o. These instructions span three major categories and thirteen minor categories, encompassing nearly all important types of web applications.To assess the quality of the generated websites, we generate test cases targeting each functionality described in the instructions. These test cases are then manually filtered, refined, and organized to ensure accuracy, resulting in a total of 647 test cases. Each test case specifies an operation to be performed on the website and the expected outcome of the operation.To automate testing and improve reproducibility, we employ a powerful web-navigation agent to execute test cases on the generated websites and determine whether the observed responses align with the expected results.We evaluate three high-performance code-agent frameworks—Bolt.diy, OpenHands, and Aider—using multiple proprietary and open-source LLMs as engines. The best-performing combination, Bolt.diy powered by DeepSeek-R1, achieves only 27.8\% accuracy on the test cases, highlighting the challenging nature of our benchmark.Additionally, we construct WebGen-Instruct, a training set consisting of 6,667 website-generation instructions. Training Qwen2.5-Coder-32B-Instruct on Bolt.diy trajectories generated from a subset of the training set achieves an accuracy of 38.2\%, surpassing the performance of the best proprietary model.We release our data-generation, training, and testing code, along with both the datasets and model weights at https://github.com/mnluzimu/WebGen-Bench.

View full details

Oral

Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)

Liwei Jiang ⋅ Yuanjun Chai ⋅ Margaret Li ⋅ Mickel Liu ⋅ Raymond Fok ⋅ Nouha Dziri ⋅ Yulia Tsvetkov ⋅ Maarten Sap ⋅ Yejin Choi

Dec 3, 10:20 AM - 10:40 AM Exhibit Hall F,G,H

Large language models (LMs) often struggle to generate diverse, human-like creative content, raising concerns about the long-term homogenization of human thought through repeated exposure to similar outputs. Yet scalable methods for evaluating LM output diversity remain limited, especially beyond narrow tasks such as random number or name generation, or beyond repeated sampling from a single model. To address this gap, we introduce Infinity-Chat, a large-scale dataset of 26K diverse, real-world, open-ended user queries that admit a wide range of plausible answers with no single ground truth. We introduce the first comprehensive taxonomy for characterizing the full spectrum of open-ended prompts posed to LMs, comprising 6 top-level categories (e.g., creative content generation, brainstorm & ideation) that further breaks down to 17 subcategories. Using Infinity-Chat, we present a large-scale study of mode collapse in LMs, revealing a pronounced Artificial Hivemind effect in open-ended generation of LMs, characterized by (1) intra-model repetition, where a single model consistently generates similar responses, and more so (2) inter-model homogeneity, where different models produce strikingly similar outputs. Infinity-Chat also includes 31,250 human annotations, across absolute ratings and pairwise preferences, with 25 independent human annotations per example. This enables studying collective and individual-specific human preferences in response to open-ended queries. Our findings show that state-of-the-art LMs, reward models, and LM judges are less well calibrated to human ratings on model generations that elicit differing idiosyncratic annotator preferences, despite maintaining comparable overall quality. Overall, INFINITY-CHAT presents the first large-scale resource for systematically studying real-world open-ended queries to LMs, revealing critical insights to guide future research for mitigating long-term AI safety risks posed by the Artificial Hivemind.

View full details

Oral

CoralVQA: A Large-Scale Visual Question Answering Dataset for Coral Reef Image Understanding

hongyong han ⋅ Wei Wang ⋅ Gaowei Zhang ⋅ Mingjie Li ⋅ Yi Wang

Dec 3, 3:50 PM - 4:10 PM Exhibit Hall F,G,H

Coral reefs are vital yet vulnerable ecosystems that require continuous monitoring to support conservation. While coral reef images provide essential information in coral monitoring, interpreting such images remains challenging due to the need for domain expertise. Visual Question Answering (VQA), powered by Large Vision-Language Models (LVLMs), has great potential in user-friendly interaction with coral reef images. However, applying VQA to coral imagery demands a dedicated dataset that addresses two key challenges: domain-specific annotations and multidimensional questions. In this work, we introduce CoralVQA, the first large-scale VQA dataset for coral reef analysis. It contains 12,805 real-world coral images from 67 coral genera collected from 3 oceans, along with 277,653 question-answer pairs that comprehensively assess ecological and health-related conditions. To construct this dataset, we develop a semi-automatic data construction pipeline in collaboration with marine biologists to ensure both scalability and professional-grade data quality. CoralVQA presents novel challenges and provides a comprehensive benchmark for studying vision-language reasoning in the context of coral reef images. By evaluating several state-of-the-art LVLMs, we reveal key limitations and opportunities. These insights form a foundation for future LVLM development, with a particular emphasis on supporting coral conservation efforts.

View full details

Oral

OrthoLoC: UAV 6-DoF Localization and Calibration Using Orthographic Geodata

Oussema Dhaouadi ⋅ Riccardo Marin ⋅ Johannes Meier ⋅ Jacques Kaiser ⋅ Daniel Cremers

Dec 3, 4:10 PM - 4:30 PM Upper Level Ballroom 6CDEF

Accurate visual localization from aerial views is a fundamental problem with applications in mapping, large-area inspection, and search-and-rescue operations. In many scenarios, these systems require high-precision localization while operating with limited resources (e.g., no internet connection or GNSS/GPS support), making large image databases or heavy 3D models impractical. Surprisingly, little attention has been given to leveraging orthographic geodata as an alternative paradigm, which is lightweight and increasingly available through free releases by governmental authorities (e.g., the European Union). To fill this gap, we propose OrthoLoC, the first large-scale dataset comprising 16,425 UAV images from Germany and the United States with multiple modalities. The dataset addresses domain shifts between UAV imagery and geospatial data. Its paired structure enables fair benchmarking of existing solutions by decoupling image retrieval from feature matching, allowing isolated evaluation of localization and calibration performance. Through comprehensive evaluation, we examine the impact of domain shifts, data resolutions, and covisibility on localization accuracy. Finally, we introduce a refinement technique called AdHoP, which can be integrated with any feature matcher, improving matching by up to 95% and reducing translation error by up to 63%. The dataset and code are available at: https://deepscenario.github.io/OrthoLoC .

View full details

Oral

NOVA: A Benchmark for Rare Anomaly Localization and Clinical Reasoning in Brain MRI

Cosmin Bercea ⋅ Jun Li ⋅ Philipp Raffler ⋅ Evamaria O. Riedel ⋅ Lena Schmitzer ⋅ Angela Kurz ⋅ Felix Bitzer ⋅ Paula Roßmüller ⋅ Julian Canisius ⋅ Mirjam Beyrle ⋅ Che Liu ⋅ Wenjia Bai ⋅ Bernhard Kainz ⋅ Julia Schnabel ⋅ Benedikt Wiestler

Dec 5, 10:40 AM - 11:00 AM Upper Level Ballroom 6CDEF

In many real-world applications, deployed models encounter inputs that differ from the data seen during training. Open-world recognition ensures that such systems remain robust as ever-emerging, previously _unknown_ categories appear and must be addressed without retraining.Foundation and vision-language models are pre-trained on large and diverse datasets with the expectation of broad generalization across domains, including medical imaging.However, benchmarking these models on test sets with only a few common outlier types silently collapses the evaluation back to a closed-set problem, masking failures on rare or truly novel conditions encountered in clinical use.We therefore present NOVA, a challenging, real-life _evaluation-only_ benchmark of $\sim$900 brain MRI scans that span 281 rare pathologies and heterogeneous acquisition protocols. Each case includes rich clinical narratives and double-blinded expert bounding-box annotations. Together, these enable joint assessment of anomaly localisation, visual captioning, and diagnostic reasoning. Because NOVA is never used for training, it serves as an _extreme_ stress-test of out-of-distribution generalisation: models must bridge a distribution gap both in sample appearance and in semantic space. Baseline results with leading vision-language models (GPT-4o, Gemini 2.0 Flash, and Qwen2.5-VL-72B) reveal substantial performance drops, with approximately a 65\% gap in localisation compared to natural-image benchmarks and 40\% and 20\% gaps in captioning and reasoning, respectively, compared to resident radiologists. Therefore, NOVA establishes a testbed for advancing models that can detect, localize, and reason about truly unknown anomalies.

View full details

Oral

BEDLAM2.0: Synthetic humans and cameras in motion

Joachim Tesch ⋅ Giorgio Becherini ⋅ Prerana Achar ⋅ Anastasios Yiannakidis ⋅ Muhammed Kocabas ⋅ Priyanka Patel ⋅ Michael Black

Dec 5, 10:40 AM - 11:00 AM Upper Level Ballroom 20AB

Inferring 3D human motion from video remains a challenging problem with many applications. While traditional methods estimate the human in image coordinates, many applications require human motion to be estimated in world coordinates. This is particularly challenging when there is both human and camera motion. Progress on this topic has been limited by the lack of rich video data with ground truth human and camera movement. We address this with BEDLAM2.0, a new dataset that goes beyond the popular BEDLAM dataset in important ways. In addition to introducing more diverse and realistic cameras and camera motions, BEDLAM2.0 increases diversity and realism of body shape, motions, clothing, hair, and 3D environments. Additionally, it adds shoes, which were missing in BEDLAM. BEDLAM has become a key resource for training 3D human pose and motion regressors today and we show that BEDLAM2.0 is significantly better, particularly for training methods that estimate humans in world coordinates. We compare state-of-the art methods trained on BEDLAM and BEDLAM2.0, and find that BEDLAM2.0 significantly improves accuracy over BEDLAM. For research purposes, we provide the rendered videos, ground truth body parameters, and camera motions. We also provide the 3D assets to which we have rights and links to those from third parties.

View full details