NeurIPS 2025 San Diego Spotlights

Skip to yearly menu bar Skip to main content

Poster

Error Broadcast and Decorrelation as a Potential Artificial and Natural Learning Mechanism

Mete Erdogan ⋅ Cengiz Pehlevan ⋅ Alper Erdogan

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We introduce *Error Broadcast and Decorrelation* (EBD), a novel learning framework for neural networks that addresses credit assignment by directly broadcasting output errors to individual layers, circumventing weight transport of backpropagation. EBD is rigorously grounded in the stochastic orthogonality property of Minimum Mean Square Error estimators. This fundamental principle states that the error of an optimal estimator is orthogonal to functions of the input. Guided by this insight, EBD defines layerwise loss functions that directly penalize correlations between layer activations and output errors, thereby establishing a principled foundation for error broadcasting. This theoretically sound mechanism naturally leads to the experimentally observed three-factor learning rule and integrates with biologically plausible frameworks to enhance performance and plausibility. Numerical experiments demonstrate EBD’s competitive or better performance against other error-broadcast methods on benchmark datasets. Our findings establish EBD as an efficient, biologically plausible, and principled alternative for neural network training.

View full details

Poster

Accelerating Visual-Policy Learning through Parallel Differentiable Simulation

Haoxiang You ⋅ Yilang Liu ⋅ Ian Abraham

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

In this work, we propose a computationally efficient algorithm for visual policy learning that leverages differentiable simulation and first-order analytical policy gradients. Our approach decouple the rendering process from the computation graph, enabling seamless integration with existing differentiable simulation ecosystems without the need for specialized differentiable rendering software. This decoupling not only reduces computational and memory overhead but also effectively attenuates the policy gradient norm, leading to more stable and smoother optimization. We evaluate our method on standard visual control benchmarks using modern GPU-accelerated simulation. Experiments show that our approach significantly reduces wall-clock training time and consistently outperforms all baseline methods in terms of final returns. Notably, on complex tasks such as humanoid locomotion, our method achieves a $4\times$ improvement in final return, and successfully learns a humanoid running policy within 4 hours on a single GPU. Videos and code are available on https://haoxiangyou.github.io/Dva_website

View full details

Poster

Bridging Symmetry and Robustness: On the Role of Equivariance in Enhancing Adversarial Robustness

Longwei Wang ⋅ Ifrat Ikhtear Uddin ⋅ Prof. KC Santosh (PhD) ⋅ Chaowei Zhang ⋅ Xiao Qin ⋅ Yang Zhou

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Adversarial examples reveal critical vulnerabilities in deep neural networks by exploiting their sensitivity to imperceptible input perturbations. While adversarial training remains the predominant defense strategy, it often incurs significant computational cost and may compromise clean-data accuracy. In this work, we investigate an architectural approach to adversarial robustness by embedding group-equivariant convolutions—specifically, rotation- and scale-equivariant layers—into standard convolutional neural networks (CNNs). These layers encode symmetry priors that align model behavior with structured transformations in the input space, promoting smoother decision boundaries and greater resilience to adversarial attacks. We propose and evaluate two symmetry-aware architectures: a parallel design that processes standard and equivariant features independently before fusion, and a cascaded design that applies equivariant operations sequentially. Theoretically, we demonstrate that such models reduce hypothesis space complexity, regularize gradients, and yield tighter certified robustness bounds under the CLEVER (Cross Lipschitz Extreme Value for nEtwork Robustness) framework. Empirically, our models consistently improve adversarial robustness and generalization across CIFAR-10, CIFAR-100, and CIFAR-10C under both FGSM and PGD attacks, without requiring adversarial training. These findings underscore the potential of symmetry-enforcing architectures as efficient and principled alternatives to data augmentation-based defenses.

View full details

Poster

Benchmarking Egocentric Multimodal Goal Inference for Assistive Wearable Agents

Vijay Veerabadran ⋅ Fanyi Xiao ⋅ Nitin Kamra ⋅ Pedro Matias ⋅ Joy Chen ⋅ Caley Drooff ⋅ Brett Roads ⋅ Riley J Williams ⋅ Ethan Henderson ⋅ Xuanyi Zhao ⋅ Kevin Carlberg ⋅ Joseph Tighe ⋅ Karl Ridgeway

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

There has recently been a surge of interest in Wearable Assistant Agents: agents embodied in a wearable form factor such as smart glasses, who can take actions toward a user’s stated goal — a high-level language-expressed command such as “where did I leave my keys?”, “Text Alice I will be late”, or “What’s the weather in Cancun?”. In this work, we consider the complementary problem of eliminating the effort required to interact with such an agent by proactively inferring the user’s goal from multimodal contextual observations. As vision-language models (VLMs) hold strong potential to ultimately solve this problem, our work focuses on creating a strong benchmark to measure progress toward this end. Given the limited prior work in this area, establishing the benchmark required collecting a novel multimodal goal-inference dataset; our dataset comprises ~30 hours of data from 363 participants across 3,482 recordings, featuring ground-truth reference goals alongside accompanying visual, audio, digital, and longitudinal contextual observations. We ran a human predictability study, where we found that humans set a strong baseline that comprises a de facto upper bound on model performance: they show multiple choice question (MCQ) accuracy of 93%, with the best VLM achieving about 84% accuracy. However, MCQ assesses discrimination, not the model’s ultimate task of generating the goal through open-ended text generation. Through a meta-evaluation, we find that a VLM judging the generated goals is as good as a human judge if it has access to a human-authored script of the video or a correct reference goal. Finally, we evaluate several families of modern vision-language models on the benchmark, showing that larger models have a significant performance advantage, but are still far from being practically useful, as they produce relevant goals only ~57% of the time. The best-performing smaller models—whose size makes them better suited to wearable applications—perform significantly worse than their counterparts, generating ~49% accuracy on the benchmark. Through a modality ablation, we show that models benefit from extra information in relevant modalities with minimal performance degradation from irrelevant modalities, but don’t gain as much when noisy modalities are included (e.g., in the case of digital context when most of the app state is irrelevant).

View full details

Poster

Affine-Invariant Global Non-Asymptotic Convergence Analysis of BFGS under Self-Concordance

Qiujiang Jin ⋅ Aryan Mokhtari

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

In this paper, we establish global non-asymptotic convergence guarantees for the BFGS quasi-Newton method without requiring strong convexity or the Lipschitz continuity of the gradient or Hessian. Instead, we consider the setting where the objective function is strictly convex and strongly self-concordant. For an arbitrary initial point and any arbitrary positive-definite initial Hessian approximation, we prove global linear and superlinear convergence guarantees for BFGS when the step size is determined using a line search scheme satisfying the weak Wolfe conditions. Moreover, all our global guarantees are affine-invariant, with the convergence rates depending solely on the initial error and the strongly self-concordant constant. Our results extend the global non-asymptotic convergence theory of BFGS beyond traditional assumptions and, for the first time, establish affine-invariant convergence guarantees—aligning with the inherent affine invariance of the BFGS method.

View full details

Poster

Dual Data Alignment Makes AI-Generated Image Detector Easier Generalizable

Ruoxin Chen ⋅ Junwei Xi ⋅ Zhiyuan Yan ⋅ Ke-Yue Zhang ⋅ Shuang Wu ⋅ Jingyi Xie ⋅ Xu Chen ⋅ Lei Xu ⋅ Isabel Guan ⋅ Taiping Yao ⋅ Shouhong Ding

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The rapid increase in AI-generated images (AIGIs) underscores the need for detection methods. Existing detectors are often trained on biased datasets, leading to overfitting on spurious correlations between non-causal image attributes and real/synthetic labels. While these biased features enhance performance on the training data, they result in substantial performance degradation when tested on unbiased datasets. A common solution is to perform data alignment through generative reconstruction, matching the content between real and synthetic images. However, we find that pixel-level alignment alone is inadequate, as the reconstructed images still suffer from frequency-level misalignment, perpetuating spurious correlations. To illustrate, we observe that reconstruction models restore the high-frequency details lost in real images, inadvertently creating a frequency-level misalignment, where synthetic images appear to have richer high-frequency content than real ones. This misalignment leads to models associating high-frequency features with synthetic labels, further reinforcing biased cues. To resolve this, we propose Dual Data Alignment (DDA), which aligns both the pixel and frequency domains. DDA generates synthetic images that closely resemble real ones by fusing real and synthetic image pairs in both domains, enhancing the detector's ability to identify forgeries without relying on biased features. Moreover, we introduce two new test sets: DDA-COCO, containing DDA-aligned synthetic images, and EvalGEN, featuring the latest generative models. Our extensive evaluations demonstrate that a detector trained exclusively on DDA-aligned MSCOCO improves across diverse benchmarks. Code is available at https://github.com/roy-ch/Dual-Data-Alignment.

View full details

Poster

Private Set Union with Multiple Contributions

Travis Dick ⋅ Haim Kaplan ⋅ Alex Kulesza ⋅ Uri Stemmer ⋅ Ziteng Sun ⋅ Ananda Theertha Suresh

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

In the private set union problem each user owns a bag of at most $k$ items (from some large universe of items), and we are interested in computing the union of the items in the bags of all of the users. This is trivial without privacy, but a differentially private algorithm must be careful about reporting items contained in only a small number of bags. We consider differentially private algorithms that always report a subset of the union, and define the utility of an algorithm to be the expected size of the subset that it reports. Because the achievable utility varies significantly with the dataset, we introduce the *utility ratio*, which normalizes utility by a dataset-specific upper bound and characterizes a mechanism by its lowest normalized utility across all datasets. We then develop algorithms with guaranteed utility ratios and complement them with bounds on the best possible utility ratio. Prior work has shown that a single algorithm can be simultaneously optimal for all datasets when $k=1$, but we show that instance-optimal algorithms do not exist when $k>1$, and characterize how performance degrades as $k$ grows. At the same time, we design a private algorithm that achieves the maximum possible utility, regardless of $k$, when the item histogram matches a prior prediction (for instance, from a previous data release) and degrades gracefully with the $L_\infty$ distance between the prediction and the actual histogram when the prediction is imperfect.

View full details

Poster

GeoRemover: Removing Objects and Their Causal Visual Artifacts

Zixin Zhu ⋅ Haoxiang Li ⋅ Xuelu Feng ⋅ He Wu ⋅ Chunming Qiao ⋅ Junsong Yuan

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Towards intelligent image editing, object removal should eliminate both the target object and its causal visual artifacts, such as shadows and reflections. However, existing image appearance-based methods either follow strictly mask-aligned training and fail to remove these casual effects which are not explicitly masked, or adopt loosely mask-aligned strategies that lack controllability and may unintentionally over-erase other objects. We identify that these limitations stem from ignoring the causal relationship between an object’s geometry presence and its visual effects. To address this limitation, we propose a geometry-aware two-stage framework that decouples object removal into (1) geometry removal and (2) appearance rendering. In the first stage, we remove the object directly from the geometry (e.g., depth) using strictly mask-aligned supervision, enabling structure-aware editing with strong geometric constraints. In the second stage, we render a photorealistic RGB image conditioned on the updated geometry, where causal visual effects are considered implicitly as a result of the modified 3D geometry. To guide learning in the geometry removal stage, we introduce a preference-driven objective based on positive and negative sample pairs, encouraging the model to remove objects as well as their causal visual artifacts while avoiding new structural insertions. Extensive experiments demonstrate that our method achieves state-of-the-art performance in removing both objects and their associated artifacts on two popular benchmarks. The project page is available at https://buxiangzhiren.github.io/GeoRemover.

View full details

Poster

Temperature is All You Need for Generalization in Langevin Dynamics and other Markov Processes

Itamar Harel ⋅ Yonathan Wolanowsky ⋅ Gal Vardi ⋅ Nati Srebro ⋅ Daniel Soudry

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We analyze the generalization gap (gap between the training and test errors) when training a potentially over-parametrized model using a Markovian stochastic training algorithm, initialized from some distribution $\theta_0 \sim p_0$. We focus on Langevin dynamics with a positive temperature $\beta^{-1}$, i.e. gradient descent on a training loss $L$ with infinitesimal step size, perturbed with $\beta^{-1}$-variances Gaussian noise, and lightly regularized or bounded. There, we bound the generalization gap, *at any time during training*, by $\sqrt{(\beta\mathbb{E} L (\theta_0) + \log(1/\delta))/N}$ with probability $1-\delta$ over the dataset, where $N$ is the sample size, and $\mathbb{E} L(\theta_0)=O(1)$ with standard initialization scaling. In contrast to previous guarantees, we have no dependence on either training time or reliance on mixing, nor a dependence on dimensionality, gradient norms, or any other properties of the loss or model. This guarantee follows from a general analysis of any Markov process-based training that has a Gibbs-style stationary distribution. The proof is surprisingly simple, once we observe that the marginal distribution divergence from initialization remains bounded, as implied by a generalized second law of thermodynamics.

View full details

Poster

SATURN: SAT-based Reinforcement Learning to Unleash LLMs Reasoning

Huanyu Liu ⋅ Ge Li ⋅ Jia Li ⋅ Hao Zhu ⋅ Kechi Zhang ⋅ Yihong Dong

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

How to design reinforcement learning (RL) tasks that effectively unleash the reasoning capability of large language models (LLMs) remains an open question. Existing RL tasks (e.g., math, programming, and constructing reasoning tasks) suffer from three key limitations: (1) Scalability. They rely heavily on human annotation or expensive LLM synthesis to generate sufficient training data. (2) Verifiability. LLMs' outputs are hard to verify automatically and reliably. (3) Controllable Difficulty. Most tasks lack fine-grained difficulty control, making it hard to train LLMs to develop reasoning ability from easy to hard. To address these limitations, we propose Saturn, a SAT-based RL framework that uses Boolean Satisfiability (SAT) problems to train and evaluate LLMs reasoning. Saturn enables scalable task construction, rule-based verification, and precise difficulty control. Saturn designs a curriculum learning pipeline that continuously improves LLMs' reasoning capability by constructing SAT tasks of increasing difficulty and training LLMs from easy to hard. To ensure stable training, we design a principled mechanism to control difficulty transitions. We introduce Saturn-2.6k, a dataset of 2,660 SAT problems with varying difficulty. It supports the evaluation of how LLM reasoning changes with problem difficulty. We apply Saturn to DeepSeek-R1-Distill-Qwen and obtain Saturn-1.5B and Saturn-7B. We achieve several notable results: (1) On SAT problems, Saturn-1.5B and Saturn-7B achieve average pass@3 improvements of +14.0 and +28.1, respectively. (2) On math and programming tasks, Saturn-1.5B and Saturn-7B improve average scores by +4.9 and +1.8 on benchmarks (e.g., AIME, LiveCodeBench). (3) Compared to the state-of-the-art (SOTA) approach in constructing RL tasks, Saturn achieves further improvements of +8.8\%. We release the source code, data, and models to support future research.

View full details

Poster

SparseMVC: Probing Cross-view Sparsity Variations for Multi-view Clustering

Ruimeng Liu ⋅ Xin Zou ⋅ Chang Tang ⋅ Xiao Zheng ⋅ Xingchen Hu ⋅ Kun Sun ⋅ Xinwang Liu

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Existing multi-view clustering methods employ various strategies to address data-level sparsity and view-level dynamic fusion. However, we identify a critical yet overlooked issue: varying sparsity across views. Cross-view sparsity variations lead to encoding discrepancies, heightening sample-level semantic heterogeneity and making view-level dynamic weighting inappropriate. To tackle these challenges, we propose Adaptive Sparse Autoencoders for Multi-View Clustering (SparseMVC), a framework with three key modules. Initially, the sparse autoencoder probes the sparsity of each view and adaptively adjusts encoding formats via an entropy-matching loss term, mitigating cross-view inconsistencies. Subsequently, the correlation-informed sample reweighting module employs attention mechanisms to assign weights by capturing correlations between early-fused global and view-specific features, reducing encoding discrepancies and balancing contributions. Furthermore, the cross-view distribution alignment module aligns feature distributions during the late fusion stage, accommodating datasets with an arbitrary number of views. Extensive experiments demonstrate that SparseMVC achieves state-of-the-art clustering performance. Our framework advances the field by extending sparsity handling from the data-level to view-level and mitigating the adverse effects of encoding discrepancies through sample-level dynamic weighting. The source code is publicly available at https://github.com/cleste-pome/SparseMVC.

View full details

Poster

Caption This, Reason That: VLMs Caught in the Middle

Zihan Weng ⋅ Lucas Gomez ⋅ Taylor Webb ⋅ Pouya Bashivan

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Vision-Language Models (VLMs) have shown remarkable progress in visual understanding in recent years. Yet, they still lag behind human capabilities in specific visual tasks such as counting or relational reasoning. To understand the underlying limitations, we adopt methodologies from cognitive science, analyzing VLM performance along core cognitive axes: Perception, Attention, and Memory. Using a suite of tasks targeting these abilities, we evaluate state-of-the-art VLMs, including GPT-4o. Our analysis reveals distinct cognitive profiles: while advanced models approach ceiling performance on some tasks (e.g. category identification), a significant gap persists, particularly in tasks requiring spatial understanding or selective attention. Investigating the source of these failures and potential methods for improvement, we employ a vision-text decoupling analysis, finding that models struggling with direct visual reasoning show marked improvement when reasoning over their own generated text captions. These experiments reveal a strong need for improved VLM Chain-of-Thought (CoT) abilities, even in models that consistently exceed human performance. Furthermore, we demonstrate the potential of targeted fine-tuning on composite visual reasoning tasks and show that fine-tuning smaller VLMs moderately improves core cognitive abilities. While this improvement does not translate to large enhancements on challenging, out-of-distribution benchmarks, we show broadly that VLM performance on our datasets strongly correlates with performance on established benchmarks like MMMU-Pro and VQAv2. Our work provides a detailed analysis of VLM cognitive strengths and weaknesses and identifies key bottlenecks in simultaneous perception and reasoning while also providing an effective and simple solution.

View full details

Poster

JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation

Kai Liu ⋅ Jungang Li ⋅ Yuchong Sun ⋅ Shengqiong Wu ⋅ jianzhang gao ⋅ Daoan Zhang ⋅ Wei Zhang ⋅ Sheng Jin ⋅ Sicheng Yu ⋅ Geng Zhan ⋅ Jiayi Ji ⋅ Fan Zhou ⋅ Liang Zheng ⋅ Shuicheng Yan ⋅ Hao Fei ⋅ Tat-Seng Chua

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

This paper presents JavisGPT, the first unified multimodal large language model (MLLM) for joint audio-video (JAV) comprehension and generation. JavisGPT has a concise encoder-LLM-decoder architecture, which has a SyncFusion module for spatio-temporal audio-video fusion and synchrony-aware learnable queries to bridge a pretrained JAV-DiT generator. This design enables temporally coherent video-audio understanding and generation from multimodal instructions. We design an effective three-stage training pipeline consisting of multimodal pretraining, audio-video fine-tuning, and large-scale instruction-tuning, to progressively build multimodal comprehension and generation from existing vision-language models. For instruction tuning, we construct JavisInst-Omni, a high-quality instruction dataset with over 200K GPT-4o-curated audio-video-text dialogues that cover diverse and multi-level comprehension and generation scenarios. On JAV comprehension and generation benchmarks, our experiments show that JavisGPT outperforms existing MLLMs, particularly in complex and temporally synchronized settings.

View full details

Poster

A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone

Jitai Hao ⋅ Qiang Huang ⋅ Hao Liu ⋅ Xinyan Xiao ⋅ Zhaochun Ren ⋅ Jun Yu

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Training high-performing Small Language Models (SLMs) remains computationally expensive, even with knowledge distillation and pruning from larger teacher models. Existing approaches often face three key challenges: (1) information loss from hard pruning, (2) inefficient alignment of representations, and (3) underutilization of informative activations, particularly from Feed-Forward Networks (FFNs). To address these challenges, we introduce \textbf{Low-Rank Clone (LRC)}, an efficient pre-training method that constructs SLMs aspiring to behavioral equivalence with strong teacher models. LRC trains a set of low-rank projection matrices that jointly enable soft pruning by compressing teacher weights, and activation clone by aligning student activations, including FFN signals, with those of the teacher. This unified design maximizes knowledge transfer while removing the need for explicit alignment modules. Extensive experiments with open-source teachers such as Llama-3.2-3B-Instruct and Qwen2.5-3B/7B-Instruct show that LRC matches or surpasses the performance of state-of-the-art models trained on trillions of tokens--using only 20B tokens, achieving over \textbf{1,000$\times$} greater training efficiency. Our codes and model checkpoints are available at https://github.com/CURRENTF/LowRankClone and https://huggingface.co/JitaiHao/LRC-4B-Base.

View full details

Poster

Open-Insect: Benchmarking Open-Set Recognition of Novel Species in Biodiversity Monitoring

Yuyan Chen ⋅ Nico Lang ⋅ B. Schmidt ⋅ Aditya Jain ⋅ Yves Basset ⋅ Sara Beery ⋅ Maxim Larrivee ⋅ David Rolnick

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Global biodiversity is declining at an unprecedented rate, yet little information isknown about most species and how their populations are changing. Indeed, some90% Earth’s species are estimated to be completely unknown. Machine learning hasrecently emerged as a promising tool to facilitate long-term, large-scale biodiversitymonitoring, including algorithms for fine-grained classification of species fromimages. However, such algorithms typically are not designed to detect examplesfrom categories unseen during training – the problem of open-set recognition(OSR) – limiting their applicability for highly diverse, poorly studied taxa such asinsects. To address this gap, we introduce Open-Insect, a large-scale, fine-graineddataset to evaluate unknown species detection across different geographic regionswith varying difficulty. We benchmark 38 OSR algorithms across three categories:post-hoc, training-time regularization, and training with auxiliary data, finding thatsimple post-hoc approaches remain a strong baseline. We also demonstrate how toleverage auxiliary data to improve species discovery in regions with limited data.Our results provide timely insights to guide the development of computer visionmethods for biodiversity monitoring and species discovery.

View full details

Poster

TREND: Unsupervised 3D Representation Learning via Temporal Forecasting for LiDAR Perception

Runjian Chen ⋅ Hyoungseob Park ⋅ Bo Zhang ⋅ Wenqi Shao ⋅ Ping Luo ⋅ Alex Wong

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Labeling LiDAR point clouds is notoriously time-and-energy-consuming, which spurs recent unsupervised 3D representation learning methods to alleviate the labeling burden in LiDAR perception via pretrained weights. Existing work focus on either masked auto encoding or contrastive learning on LiDAR point clouds, which neglects the temporal LiDAR sequence that naturally accounts for object motion (and their semantics). Instead, we propose TREND, short for Temporal REndering with Neural fielD, to learn 3D representation via forecasting the future observation in an unsupervised manner. TREND integrates forecasting for 3D pre-training through a Recurrent Embedding scheme to generate 3D embeddings across time and a Temporal LiDAR Neural Field specifically designed for LiDAR modality to represent the 3D scene, with which we compute the loss using differentiable rendering. We evaluate TREND on 3D object detection and LiDAR semantic segmentation tasks on popular datasets, including Once, Waymo, NuScenes, and SemanticKITTI. TREND generally improves from-scratch models across datasets and tasks and brings gains of 1.77\% mAP on Once and 2.11\% mAP on NuScenes, which are up to 400\% more improvement compared to previous SOTA unsupervised 3D pre-training methods. Codes and models will be available.

View full details

Poster

Do-PFN: In-Context Learning for Causal Effect Estimation

Jake Robertson ⋅ Arik Reuter ⋅ Siyuan Guo ⋅ Noah Hollmann ⋅ Frank Hutter ⋅ Bernhard Schölkopf

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Causal effect estimation is critical to a range of scientific disciplines. Existing methods for this task either require interventional data, knowledge about the ground-truth causal graph, or rely on assumptions such as unconfoundedness, restricting their applicability in real-world settings. In the domain of tabular machine learning, Prior-data fitted networks (PFNs) have achieved state-of-the-art predictive performance, having been pre-trained on synthetic causal data to solve tabular prediction problems via in-context learning. To assess whether this can be transferred to the problem of causal effect estimation, we pre-train PFNs on synthetic data drawn from a wide variety of causal structures, including interventions, to predict interventional outcomes given observational data. Through extensive experiments in synthetic and semi-synthetic settings, we show that our approach allows for the accurate estimation of causal effects without knowledge of the underlying causal graph.

View full details

Poster

GeoSVR: Taming Sparse Voxels for Geometrically Accurate Surface Reconstruction

Jiahe Li ⋅ Jiawei Zhang ⋅ Youmin Zhang ⋅ Xiao Bai ⋅ Jin Zheng ⋅ Xiaohan Yu ⋅ Lin Gu

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Reconstructing accurate surfaces with radiance fields has achieved remarkable progress in recent years. However, prevailing approaches, primarily based on Gaussian Splatting, are increasingly constrained by representational bottlenecks. In this paper, we introduce GeoSVR, an explicit voxel-based framework that explores and extends the under-investigated potential of sparse voxels for achieving accurate, detailed, and complete surface reconstruction. As strengths, sparse voxels support preserving the coverage completeness and geometric clarity, while corresponding challenges also arise from absent scene constraints and locality in surface refinement. To ensure correct scene convergence, we first propose a Voxel-Uncertainty Depth Constraint that maximizes the effect of monocular depth cues while presenting a voxel-oriented uncertainty to avoid quality degradation, enabling effective and robust scene constraints yet preserving highly accurate geometries. Subsequently, Sparse Voxel Surface Regularization is designed to enhance geometric consistency for tiny voxels and facilitate the voxel-based formation of sharp and accurate surfaces. Extensive experiments demonstrate our superior performance compared to existing methods across diverse challenging scenarios, excelling in geometric accuracy, detail preservation, and reconstruction completeness while maintaining high efficiency. Code is available at https://github.com/Fictionarry/GeoSVR.

View full details

Poster

What Makes a Reward Model a Good Teacher? An Optimization Perspective

Noam Razin ⋅ Zixuan Wang ⋅ Hubert Strauss ⋅ Stanley Wei ⋅ Jason Lee ⋅ Sanjeev Arora

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The success of Reinforcement Learning from Human Feedback (RLHF) critically depends on the quality of the reward model. However, while this quality is primarily evaluated through accuracy, it remains unclear whether accuracy fully captures what makes a reward model an effective teacher. We address this question from an optimization perspective. First, we prove that regardless of how accurate a reward model is, if it induces low reward variance, then the RLHF objective suffers from a flat landscape. Consequently, even a perfectly accurate reward model can lead to extremely slow optimization, underperforming less accurate models that induce higher reward variance. We additionally show that a reward model that works well for one language model can induce low reward variance, and thus a flat objective landscape, for another. These results establish a fundamental limitation of evaluating reward models solely based on accuracy or independently of the language model they guide. Experiments using models of up to 8B parameters corroborate our theory, demonstrating the interplay between reward variance, accuracy, and reward maximization rate. Overall, our findings highlight that beyond accuracy, a reward model needs to induce sufficient variance for efficient optimization.

View full details

Poster

CLiFT: Compressive Light-Field Tokens for Compute Efficient and Adaptive Neural Rendering

Zhengqing Wang ⋅ Yuefan Wu ⋅ Jiacheng Chen ⋅ Fuyang Zhang ⋅ Yasutaka Furukawa

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

This paper proposes a neural rendering approach that represents a scene as "compressed light-field tokens (CLiFTs)", retaining rich appearance and geometric information of a scene. CLiFT enables compute-efficient rendering by compressed tokens, while being capable of changing the number of tokens to represent a scene or render a novel view with one trained network. Concretely, given a set of images, multi-view encoder tokenizes the images with the camera poses. Latent-space K-means selects a reduced set of rays as cluster centroids using the tokens. The multi-view ``condenser'' compresses the information of all the tokens into the centroid tokens to construct CLiFTs. At test time, given a target view and a compute budget (i.e., the number of CLiFTs), the system collects the specified number of nearby tokens and synthesizes a novel view using a compute-adaptive renderer. trained to handle a variable number of tokens. Extensive experiments on RealEstate10K and DL3DV datasets quantitatively and qualitatively validate our approach, achieving significant data reduction with comparable rendering quality and the highest overall rendering score, while providing trade-offs of data size, rendering quality, and rendering speed.

View full details

Poster

Path-Enhanced Contrastive Learning for Recommendation

Haoran Sun ⋅ Fei Xiong ⋅ Yuanzhe Hu ⋅ Liang Wang

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Collaborative filtering (CF) methods are now facing the challenge of data sparsity in recommender systems. In order to reduce the effect of data sparsity, researchers proposed contrastive learning methods to extract self-supervised signals from raw data. Contrastive learning methods address this problem by graph augmentation and maximizing the consistency of node representations between different augmented graphs. However, these methods tends to unintentionally distance the target node from its path nodes on the interaction path, thus limiting its effectiveness. In this regard, we propose a solution that uses paths as samples in the contrastive loss function. In order to obtain the path samples, we design a path sampling method. In addition to the contrast of the relationship between the target node and the nodes within the path (intra-path contrast), we also designed a method of contrasting the relationship between the paths (inter-path contrast) to better pull the target node and its path nodes closer to each other. We use Simplifying and Powering Graph Convolution Network (LightGCN) as the basis and combine with a new path-enhanced graph approach proposed for graph augmentation. It effectively improves the performance of recommendation models. Our proposed Path Enhanced Contrastive Loss (PECL) model replaces the common contrastive loss function with our novel loss function, showing significant performance improvement. Experiments on three real-world datasets demonstrate the effectiveness of our model.

View full details

Poster

To Think or Not To Think: A Study of Thinking in Rule-Based Visual Reinforcement Fine-Tuning

Ming Li ⋅ Jike Zhong ⋅ Shitian Zhao ⋅ Yuxiang Lai ⋅ Haoquan Zhang ⋅ Wang Bill Zhu ⋅ Kaipeng Zhang

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

This paper investigates the role of explicit thinking process in rule-based reinforcement fine-tuning (RFT) for multi-modal large language models (MLLMs). We first extend \textit{Thinking-RFT} to image classification task, using verifiable rewards for fine-tuning~(FT). Experiments show {Thinking-RFT} significantly outperforms supervised FT and yields a cross-dataset generalization effect. We then rethink and question whether explicit thinking in RFT is always necessary and beneficial. Challenging the convention that explicit thinking is crucial for the success of RFT, we introduce \textit{No-Thinking-RFT}, exploring RFT without thinking by introducing a simple equality accuracy reward. We evaluate No-Thinking-RFT on six diverse tasks across different model sizes and types. Experiment results reveal four key findings: \textbf{(1).} Visual perception tasks do not require thinking during RFT, as No-Thinking-RFT consistently outperforms or matches Thinking-RFT across model sizes and types. \textbf{(2).} Models with limited capabilities struggle to generate high-quality CoT for RFT, making Thinking-RFT less effective than No-Thinking-RFT. \textbf{(3).} There are inconsistencies between the answers in the thinking tags and answer tags for some responses of Thinking-RFT, which show lower average accuracy than the overall accuracy. \textbf{(4).} The performance gain of No-Thinking-RFT mainly stems from improved learning during no thinking FT and the avoidance of inference overthinking, as evidenced by the partial gains from appending empty thinking tags at inference time of Thinking-RFT. We hypothesize that explicit thinking before verifiable answers may hinder reward convergence and reduce performance in certain scenarios. To test this, we propose \textit{Think-After-Answer}, which places thinking after the answer to mitigate this effect for experimental verification. Lastly, we conduct a pilot study to explore whether MLLMs can learn when to think during RFT, introducing an \textit{Adaptive-Thinking} method. Experiments show that model converges to either thinking or not depending on model capability, achieving comparable or better performance than both Thinking and No-Thinking-RFT. Our findings suggest MLLMs can adaptively decide to think or not based on their capabilities and task complexity, offering insights into the thinking process in RFT.

View full details

Poster

SQS: Enhancing Sparse Perception Models via Query-based Splatting in Autonomous Driving

Haiming Zhang ⋅ Yiyao Zhu ⋅ Wending Zhou ⋅ Xu Yan ⋅ Yingjie CAI ⋅ Bingbing Liu ⋅ Shuguang Cui ⋅ Zhen Li

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Sparse Perception Models (SPMs) adopt a query-driven paradigm that forgoes explicit dense BEV or volumetric construction, enabling highly efficient computation and accelerated inference. In this paper, we introduce SQS, a novel query-based splatting pre-training specifically designed to advance SPMs in autonomous driving. SQS introduces a plug-in module that predicts 3D Gaussian representations from sparse queries during pre-training, leveraging self-supervised splatting to learn fine-grained contextual features through the reconstruction of multi-view images and depth maps. During fine-tuning, the pre-trained Gaussian queries are seamlessly integrated into downstream networks via query interaction mechanisms that explicitly connect pre-trained queries with task-specific queries, effectively accommodating the diverse requirements of occupancy prediction and 3D object detection. Extensive experiments on autonomous driving benchmarks demonstrate that SQS delivers considerable performance gains across multiple query-based 3D perception tasks, notably in occupancy prediction and 3D object detection, outperforming prior state-of-the-art pre-training approaches by a significant margin (i.e., +1.3 mIoU on occupancy prediction and +1.0 NDS on 3D detection).

View full details

Poster

Improving Evolutionary Multi-View Classification via Eliminating Individual Fitness Bias

Xinyan Liang ⋅ Shuai Li ⋅ Qian Guo ⋅ Yuhua Qian ⋅ Bingbing Jiang ⋅ Tingjin Luo ⋅ Liang Du

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Evolutionary multi-view classification (EMVC) methods have gained wide recognition due to their adaptive mechanisms. Fitness evaluation (FE), which aims to calculate the classification performance of each individual in the population and provide reliable performance ranking for subsequent operations, is a core step in such methods. Its accuracy directly determines the correctness of the evolutionary direction. That is, when FE fails to correctly reflect the superiority-inferiority relationship among individuals, it will lead to confusion in individual performance ranking, which in turn misleads the evolutionary direction and results in trapping into local optima. This paper is the first to identify the aforementioned issue in the field of EMVC and call it as fitness evaluation bias (FEB). FEB may be caused by a variety of factors, and this paper approaches the issue from the perspective of view information content: existing methods generally adopt joint training strategies, which restrict the exploration of key information in views with low information content. This makes it difficult for multi-view model (MVM) to achieve optimal performance during convergence, which in turn leads to FE failing to accurately reflect individual performance rankings and ultimately triggering FEB. To address this issue, we propose an evolutionary multi-view classification via eliminating individual fitness bias (EFB-EMVC) method, which alleviates the FEB issue by introducing evolutionary navigators for each MVM, thereby providing more accurate individual ranking. Experimental results fully verify the effectiveness of the proposed method in alleviating the FEB problem, and the EMVC method equipped with this strategy exhibits more superior performance compared with the original EMVC method. (The code is available at https://github.com/LiShuailzn/Neurips-2025-EFB-EMVC)

View full details

Poster

RobustMerge: Parameter-Efficient Model Merging for MLLMs with Direction Robustness

Fanhu Zeng ⋅ Haiyang Guo ⋅ Fei Zhu ⋅ Li Shen ⋅ Hao Tang

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Fine-tuning pre-trained models with custom data leads to numerous expert models on specific tasks. Merging models into one universal model to empower multi-task ability refraining from data leakage has gained popularity. With the expansion in data and model size, parameter-efficient tuning becomes the common practice for obtaining task-specific models efficiently. However, few methods are dedicated to efficient merging, and existing methods designed for full fine-tuning merging fail under efficient merging. To address the issue, we analyze from low-rank decomposition and reveal that direction robustness during merging is crucial for merging efficient modules. We furthermore uncover that compensating for the gap between stark singular values contributes to direction robustness. Therefore, we propose RobustMerge, a training-free parameter-efficient merging method with complementary parameter adaptation to maintain direction robustness. Specifically, we (1) prune parameters and scale coefficients from inter-parameter relations for singular values to maintain direction stability away from task interference, and (2) perform cross-task normalization to enhance unseen task generalization. We establish a benchmark consisting of diverse multimodal tasks, on which we conduct experiments to certify the outstanding performance and generalizability of our method. Additional studies and extensive analyses further showcase the effectiveness.

View full details

Poster

Enhancing LLM Watermark Resilience Against Both Scrubbing and Spoofing Attacks

Huanming Shen ⋅ Baizhou Huang ⋅ Xiaojun Wan

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Watermarking is widely regarded as a promising defense against the misuse of large language models (LLMs); however, existing methods are fundamentally constrained by their vulnerability to scrubbing and spoofing attacks. This vulnerability stems from an inherent trade-off governed by watermark window size: smaller windows resist scrubbing better but are easier to reverse-engineer, enabling low-cost statistics-based spoofing attacks. This work expands the trade-off boundary by introducing a novel mechanism, equivalent texture keys, where multiple tokens within a watermark window can independently support the detection. Based on the redundancy, we propose a watermark scheme with **S**ub-vocabulary decomposed **E**quivalent t**E**xture **K**ey (**SEEK**). SEEK achieves a Pareto improvement, enhancing robustness to scrubbing attacks without sacrificing resistance to spoofing.

View full details

Poster

ALINE: Joint Amortization for Bayesian Inference and Active Data Acquisition

Daolang Huang ⋅ Xinyi Wen ⋅ Ayush Bharti ⋅ Samuel Kaski ⋅ Luigi Acerbi

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Many critical applications, from autonomous scientific discovery to personalized medicine, demand systems that can both strategically acquire the most informative data and instantaneously perform inference based upon it. While amortized methods for Bayesian inference and experimental design offer part of the solution, neither approach is optimal in the most general and challenging task, where new data needs to be collected for instant inference. To tackle this issue, we introduce the Amortized Active Learning and Inference Engine (ALINE), a unified framework for amortized Bayesian inference and active data acquisition. ALINE leverages a transformer architecture trained via reinforcement learning with a reward based on self-estimated information gain provided by its own integrated inference component. This allows it to strategically query informative data points while simultaneously refining its predictions. Moreover, ALINE can selectively direct its querying strategy towards specific subsets of model parameters or designated predictive tasks, optimizing for posterior estimation, data prediction, or a mixture thereof. Empirical results on regression-based active learning, classical Bayesian experimental design benchmarks, and a psychometric model with selectively targeted parameters demonstrate that ALINE delivers both instant and accurate inference along with efficient selection of informative points.

View full details

Poster

SceneDesigner: Controllable Multi-Object Image Generation with 9-DoF Pose Manipulation

Zhenyuan Qin ⋅ Xincheng Shuai ⋅ Henghui Ding

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Controllable image generation has attracted increasing attention in recent years, enabling users to manipulate visual content such as identity and style. However, achieving simultaneous control over the 9D poses (location, size, and orientation) of multiple objects remains an open challenge. Despite recent progress, existing methods often suffer from limited controllability and degraded quality, falling short of comprehensive multi-object 9D pose control. To address these limitations, we propose ***SceneDesigner***, a method for accurate and flexible multi-object 9-DoF pose manipulation. ***SceneDesigner*** incorporates a branched network to the pre-trained base model and leverages a new representation, ***CNOCS map***, which encodes 9D pose information from the camera view. This representation exhibits strong geometric interpretation properties, leading to more efficient and stable training. To support training, we construct a new dataset, ***ObjectPose9D***, which aggregates images from diverse sources along with 9D pose annotations. To further address data imbalance issues, particularly performance degradation on low-frequency poses, we introduce a two-stage training strategy with reinforcement learning, where the second stage fine-tunes the model using a reward-based objective on rebalanced data. At inference time, we propose ***Disentangled Object Sampling***, a technique that mitigates insufficient object generation and concept confusion in complex multi-object scenes. Moreover, by integrating user-specific personalization weights, ***SceneDesigner*** enables customized pose control for reference subjects. Extensive qualitative and quantitative experiments demonstrate that ***SceneDesigner*** significantly outperforms existing approaches in both controllability and quality.

View full details

Poster

Jacobian-Based Interpretation of Nonlinear Neural Encoding Model

Xiaohui Gao ⋅ Haoran Yang ⋅ cheng yue ⋅ Mengfei Zuo ⋅ Yiheng Liu ⋅ Peiyang Li ⋅ Xiaohui Gao

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

In recent years, the alignment between artificial neural network (ANN) embeddings and blood oxygenation level dependent (BOLD) responses in functional magnetic resonance imaging (fMRI) via neural encoding models has significantly advanced research on neural representation mechanisms and interpretability in the brain. However, these approaches remain limited in characterizing the brain’s inherently nonlinear response properties. To address this, we propose the Jacobian-based Nonlinearity Evaluation (JNE), an interpretability metric for nonlinear neural encoding models. JNE quantifies nonlinearity by statistically measuring the dispersion of local linear mappings (Jacobians) from model representations to predicted BOLD responses, thereby approximating the nonlinearity of BOLD signals. Centered on proposing JNE as a novel interpretability metric, we validated its effectiveness through controlled simulation experiments on various activation functions and network architectures, and further verified it on real fMRI data, demonstrating a hierarchical progression of nonlinear characteristics from primary to higher-order visual cortices, consistent with established cortical organization. We further extended JNE with Sample-Specificity (JNE-SS), revealing stimulus-selective nonlinear response patterns in functionally specialized brain regions. As the first interpretability metric for quantifying nonlinear responses, JNE provides new insights into brain information processing. Code available at https://github.com/Gaitxh/JNE.

View full details

Poster

G-Memory: Tracing Hierarchical Memory for Multi-Agent Systems

Guibin Zhang ⋅ Muxin Fu ⋅ Kun Wang ⋅ Frank Wan ⋅ Miao Yu ⋅ Shuicheng Yan

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Large language model (LLM)-powered multi-agent systems (MAS) have demonstrated cognitive and execution capabilities that far exceed those of single LLM agents, yet their capacity for self-evolution remains hampered by underdeveloped memory architectures. Upon close inspection, we are alarmed to discover that prevailing MAS memory mechanisms (1) are overly simplistic, completely disregarding the nuanced inter-agent collaboration trajectories, and (2) lack cross-trial and agent-specific customization, in stark contrast to the expressive memory developed for single agents. To bridge this gap, we introduce G-Memory, a hierarchical, agentic memory system for MAS inspired by organizational memory theory, which manages the lengthy MAS interaction via a three-tier graph hierarchy: insight, query, and interaction graphs. Upon receiving a new user query, G-Memory performs bi-directional memory traversal to retrieve both \textit{high-level, generalizable insights} that enable the system to leverage cross-trial knowledge, and \textit{fine-grained, condensed interaction trajectories} that compactly encode prior collaboration experiences. Upon task execution, the entire hierarchy evolves by assimilating new collaborative trajectories, nurturing the progressive evolution of agent teams. Extensive experiments across five benchmarks, three LLM backbones, and three popular MAS frameworks demonstrate that G-Memory improves success rates in embodied action and accuracy in knowledge QA by up to $20.89\\%$ and $10.12\\%$, respectively, without any modifications to the original frameworks.

View full details

Poster

Puppeteer: Rig and Animate Your 3D Models

Chaoyue Song ⋅ Xiu Li ⋅ Fan Yang ⋅ Zhongcong XU ⋅ Jiacheng Wei ⋅ Fayao Liu ⋅ Jiashi Feng ⋅ Guosheng Lin ⋅ Jianfeng Zhang

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Modern interactive applications increasingly demand dynamic 3D content, yet the transformation of static 3D models into animated assets constitutes a significant bottleneck in content creation pipelines. While recent advances in generative AI have revolutionized static 3D model creation, rigging and animation continue to depend heavily on expert intervention. We present \textbf{Puppeteer}, a comprehensive framework that addresses both automatic rigging and animation for diverse 3D objects. Our system first predicts plausible skeletal structures via an auto-regressive transformer that introduces a joint-based tokenization strategy for compact representation and a hierarchical ordering methodology with stochastic perturbation that enhances bidirectional learning capabilities. It then infers skinning weights via an attention-based architecture incorporating topology-aware joint attention that explicitly encodes inter-joint relationships based on skeletal graph distances. Finally, we complement these rigging advances with a differentiable optimization-based animation pipeline that generates stable, high-fidelity animations while being computationally more efficient than existing approaches. Extensive evaluations across multiple benchmarks demonstrate that our method significantly outperforms state-of-the-art techniques in both skeletal prediction accuracy and skinning quality. The system robustly processes diverse 3D content, ranging from professionally designed game assets to AI-generated shapes, producing temporally coherent animations that eliminate the jittering issues common in existing methods.

View full details

Poster

Injecting Frame-Event Complementary Fusion into Diffusion for Optical Flow in Challenging Scenes

Haonan Wang ⋅ Hanyu Zhou ⋅ Haoyue Liu ⋅ Luxin Yan

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Optical flow estimation has achieved promising results in conventional scenes but faces challenges in high-speed and low-light scenes, which suffer from motion blur and insufficient illumination. These conditions lead to weakened texture and amplified noise and deteriorate the appearance saturation and boundary completeness of frame cameras, which are necessary for motion feature matching. In degraded scenes, the frame camera provides dense appearance saturation but sparse boundary completeness due to its long imaging time and low dynamic range. In contrast, the event camera offers sparse appearance saturation, while its short imaging time and high dynamic range gives rise to dense boundary completeness. Traditionally, existing methods utilize feature fusion or domain adaptation to introduce event to improve boundary completeness. However, the appearance features are still deteriorated, which severely affects the mostly adopted discriminative models that learn the mapping from visual features to motion fields and generative models that generate motion fields based on given visual features. So we introduce diffusion models that learn the mapping from noising flow to clear flow, which is not affected by the deteriorated visual features. Therefore, we propose a novel optical flow estimation framework Diff-ABFlow based on diffusion models with frame-event appearance-boundary fusion. Inspired by the appearance-boundary complementarity of frame and event, we propose an Attention-Guided Appearance-Boundary Fusion module to fuse frame and event. Based on diffusion models, we propose a Multi-Condition Iterative Denoising Decoder. Our proposed method can effectively utilize the respective advantages of frame and event, and shows great robustness to degraded input. In addition, we propose a dual-modal optical flow dataset for generalization experiments. Extensive experiments have verified the superiority of our proposed method. The code is released at .

View full details

Poster

Alligat0R: Pre-Training through Covisibility Segmentation for Relative Camera Pose Regression

Thibaut Loiseau ⋅ Guillaume Bourmaud ⋅ Vincent Lepetit

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Pre-training techniques have greatly advanced computer vision, with CroCo’s cross-view completion approach yielding impressive results in tasks like 3D reconstruction and pose regression. However, cross-view completion is ill-posed in non-covisible regions, limiting its effectiveness. We introduce Alligat0R, a novel pre-training approach that replaces cross-view learning with a covisibility segmentation task. Our method predicts whether each pixel in one image is covisible in the second image, occluded, or outside the field of view, making the pre-training effective in both covisible and non-covisible regions, and provides interpretable predictions. To support this, we present Cub3, a large-scale dataset with 5M image pairs and dense covisibility annotations derived from the nuScenes and ScanNet datasets. Cub3 includes diverse scenarios with varying degrees of overlap. The experiments show that our novel pre-training method Alligat0R significantly outperforms CroCo in relative pose regression. Alligat0R and Cub3 will be made publicly available.

View full details

Poster

Rethinking Entropy in Test-Time Adaptation: The Missing Piece from Energy Duality

Mincheol Park ⋅ Heeji Won ⋅ Won Woo Ro ⋅ Suhyun Kim

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Test-time adaptation (TTA) aims to preserve model performance under distribution shifts. Yet, most existing methods rely on entropy minimization for confident predictions. This paper re-examines the sufficiency of entropy minimization by analyzing its dual relationship with energy. We view energy as a proxy for likelihood, where lower energy indicates higher observability under the learned distribution. We uncover that entropy and energy are tightly associated, controlled by the model’s confidence or ambiguity, and show that simultaneous reduction of both is essential. Importantly, we reveal that entropy minimization alone neither ensures energy reduction nor supports reliable likelihood estimation, and it requires explicit discriminative guidance to reach zero entropy. To combat these problems, we propose a twofold solution. First, we introduce a likelihood-based objective grounded in energy-based models, which reshape the energy landscape to favor test samples. For stable and scalable training, we adopt sliced score matching—a sampling-free, Hessian-insensitive approximation of Fisher divergence. Second, we enhance entropy minimization with a cross-entropy that treats the predicted class as a target to promote discriminability. By counterbalancing entropy and energy through the solution of multi-objective optimization, our unified TTA, ReTTA, outperforms existing entropy- or energy-based approaches across diverse distribution shifts.

View full details

Poster

Personalized Decision Modeling: Utility Optimization or Textualized-Symbolic Reasoning

Yibo Zhao ⋅ Yang Zhao ⋅ Hongru Du ⋅ Hao Frank Yang

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Decision-making models for individuals, particularly in high-stakes scenarios like vaccine uptake, often diverge from population optimal predictions. This gap arises from the uniqueness of the individual decision-making process, shaped by numerical attributes (e.g., cost, time) and linguistic influences (e.g., personal preferences and constraints). Developing upon Utility Theory and leveraging the textual-reasoning capabilities of Large Language Models (LLMs), this paper proposes an Adaptive Textual-symbolic Human-centric Reasoning framework (ATHENA) to address the optimal information integration. ATHENA uniquely integrates two stages: First, it discovers robust, group-level symbolic utility functions via LLM-augmented symbolic discovery; Second, it implements individual-level semantic adaptation, creating personalized semantic templates guided by the optimal utility to model personalized choices. Validated on real-world travel mode and vaccine choice tasks, ATHENA consistently outperforms utility-based, machine learning, and other LLM-based models, lifting F1 score by at least 6.5\% over the strongest cutting-edge models. Further, ablation studies confirm that both stages of ATHENA are critical and complementary, as removing either clearly degrades overall predictive performance. By organically integrating symbolic utility modeling and semantic adaptation, ATHENA provides a new scheme for modeling human-centric decisions. The project page can be found at https://yibozh.github.io/Athena.

View full details

Poster

Web-Shepherd: Advancing PRMs for Reinforcing Web Agents

Hyungjoo Chae ⋅ Seonghwan Kim ⋅ Junhee Cho ⋅ Seungone Kim ⋅ Seungjun Moon ⋅ Gyeom Hwangbo ⋅ Dongha Lim ⋅ Minjin Kim ⋅ Yeonjun Hwang ⋅ Minju Gwak ⋅ Dongwook Choi ⋅ Minseok Kang ⋅ Gwanhoon Im ⋅ ByeongUng Cho ⋅ Hyojun Kim ⋅ Jun Han ⋅ Taeyoon Kwon ⋅ Minju Kim ⋅ Beong-woo Kwak ⋅ Dongjin Kang ⋅ Jinyoung Yeo

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Web navigation is a unique domain that can automate many repetitive real-life tasks and is challenging as it requires long-horizon sequential decision making beyond typical multimodal large language model (MLLM) tasks. Yet, specialized reward models for web navigation that can be utilized during both training and test-time have been absent until now. Despite the importance of speed and cost-effectiveness, prior works have utilized MLLMs as reward models, which poses significant constraints for real-world deployment. To address this, in this work, we propose the first process reward model (PRM) called Web-Shepherd which could assess web navigation trajectories in a step-level. To achieve this, we first construct the WebPRM Collection, a large-scale dataset with 40K step-level preference pairs and annotated checklists spanning diverse domains and difficulty levels. Next, we also introduce the WebRewardBench, the first meta-evaluation benchmark for evaluating PRMs. In our experiments, we observe that our Web-Shepherd achieves about 30 points better accuracy compared to using GPT-4o on WebRewardBench. Furthermore, when testing on WebArena-lite by using GPT-4o-mini as the policy and Web-Shepherd as the verifier, we achieve 10.9 points better performance, in 10x less cost compared to using GPT-4o-mini as the verifier. Our model, dataset, and code are publicly available at https://github.com/kyle8581/Web-Shepherd.

View full details

Poster

Graph-Based Attention for Differentiable MaxSAT Solving

Sota Moriyama ⋅ Katsumi Inoue

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The use of deep learning to solve fundamental AI problems such as Boolean Satisfiability (SAT) has been explored recently to develop robust and scalable reasoning systems. This work advances such neural-based reasoning approaches by developing a new Graph Neural Network (GNN) to differentiably solve (weighted) Maximum Satisfiability (MaxSAT). To this end, we propose SAT-based Graph Attention Networks (SGATs) as novel GNNs that are built on t-norm based attention and message passing mechanisms, and structurally designed to approximate greedy distributed local search. To demonstrate the effectiveness of our model, we develop a local search solver that uses SGATs to continuously solve any given MaxSAT problem. Experiments on (weighted) MaxSAT benchmark datasets demonstrate that SGATs significantly outperform existing neural-based architectures, and achieve state-of-the-art performance among continuous approaches, highlighting the strength of the proposed model.

View full details

Poster

Learning Interestingness in Automated Mathematical Theory Formation

George Tsoukalas ⋅ Rahul Saha ⋅ Amitayush Thakur ⋅ Sabrina Reguyal ⋅ Swarat Chaudhuri

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We take two key steps in automating the open-ended discovery of new mathematical theories, a grand challenge in artificial intelligence. First, we introduce Fermat, a reinforcement learning (RL) environment that models concept discovery and theorem-proving using a set of symbolic actions, opening up a range of RL problems relevant to theory discovery. Second, we explore a specific problem through Fermat: automatically scoring the interestingness of mathematical objects. We investigate evolutionary algorithms for synthesizing nontrivial interestingness measures. In particular, we introduce an LLM-based evolutionary algorithm that features function abstraction, leading to notable improvements in discovering elementary number theory and finite fields over hard-coded baselines. We open-source the \fermat environment at github.com/trishullab/Fermat.

View full details

Poster

4DGT: Learning a 4D Gaussian Transformer Using Real-World Monocular Videos

Zhen Xu ⋅ Zhengqin Li ⋅ Zhao Dong ⋅ Xiaowei Zhou ⋅ Richard Newcombe ⋅ Zhaoyang Lv

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We propose 4DGT, a 4D Gaussian-based Transformer model for dynamic scene reconstruction, trained entirely on real-world monocular posed videos. Using 4D Gaussian as an inductive bias, 4DGT unifies static and dynamic components, enabling the modeling of complex, time-varying environments with varying object lifespans. We proposed a novel density control strategy in training, which enables our 4DGT to handle longer space-time input. Our model processes 64 consecutive posed frames in a rolling-window fashion, predicting consistent 4D Gaussians in the scene. Unlike optimization-based methods, 4DGT performs purely feed-forward inference, reducing reconstruction time from hours to seconds and scaling effectively to long video sequences. Trained only on large-scale monocular posed video datasets, 4DGT can outperform prior Gaussian-based networks significantly in real-world videos and achieve on-par accuracy with optimization-based methods on cross-domain videos.

View full details

Poster

Boundary-Value PDEs Meet Higher-Order Differential Topology-aware GNNs

Yunfeng Liao ⋅ Yangxin Wu ⋅ Xiucheng Li

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Recent advances in graph neural network (GNN)-based neural operators have demonstrated significant progress in solving partial differential equations (PDEs) by effectively representing computational meshes. However, most existing approaches overlook the intrinsic physical and topological meaning of higher-order elements in the mesh, which are closely tied to differential forms. In this paper, we propose a higher-order GNN framework that incorporates higher-order interactions based on discrete and finite element exterior calculus. The time-independent boundary value problems (BVPs) in electromagnetism are instantiated to illustrate the proposed framework. It can be easily generalized to other PDEs that admit differential form formulations. Moreover, the novel physics-informed loss terms, integrated form estimators, and theoretical support are derived correspondingly. Experiments show that our proposed method outperforms the existing neural operators by large margins on BVPs in electromagnetism. Our code is available at https://github.com/Supradax/Higher-Order-Differential-Topology-aware-GNN.

View full details

Poster

DNA-DetectLLM: Unveiling AI-Generated Text via a DNA-Inspired Mutation-Repair Paradigm

Xiaowei Zhu ⋅ Yubing Ren ⋅ Fang Fang ⋅ Qingfeng Tan ⋅ Shi Wang ⋅ Yanan Cao

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The rapid advancement of large language models (LLMs) has blurred the line between AI-generated and human-written text. This progress brings societal risks such as misinformation, authorship ambiguity, and intellectual property concerns, highlighting the urgent need for reliable AI-generated text detection methods. However, recent advances in generative language modeling have resulted in significant overlap between the feature distributions of human-written and AI-generated text, blurring classification boundaries and making accurate detection increasingly challenging. To address the above challenges, we propose a DNA-inspired perspective, leveraging a repair-based process to directly and interpretably capture the intrinsic differences between human-written and AI-generated text. Building on this perspective, we introduce **DNA-DetectLLM**, a zero-shot detection method for distinguishing AI-generated and human-written text. The method constructs an ideal AI-generated sequence for each input, iteratively repairs non-optimal tokens, and quantifies the cumulative repair effort as an interpretable detection signal. Empirical evaluations demonstrate that our method achieves state-of-the-art detection performance and exhibits strong robustness against various adversarial attacks and input lengths. Specifically, DNA-DetectLLM achieves relative improvements of **5.55\%** in AUROC and **2.08\%** in F1 score across multiple public benchmark datasets. Code and data are available at https://github.com/Xiaoweizhu57/DNA-DetectLLM.

View full details

Poster

BevSplat: Resolving Height Ambiguity via Feature-Based Gaussian Primitives for Weakly-Supervised Cross-View Localization

Qiwei Wang ⋅ Shaoxun Wu ⋅ Yujiao Shi

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

This paper addresses the problem of weakly supervised cross-view localization, where the goal is to estimate the pose of a ground camera relative to a satellite image with noisy ground truth annotations. A common approach to bridge the cross-view domain gap for pose estimation is Bird’s-Eye View (BEV) synthesis. However, existing methods struggle with height ambiguity due to the lack of depth information in ground images and satellite height maps. Previous solutions either assume a flat ground plane or rely on complex models, such as cross-view transformers. We propose BevSplat, a novel method that resolves height ambiguity by using feature-based Gaussian primitives. Each pixel in the ground image is represented by a 3D Gaussian with semantic and spatial features, which are synthesized into a BEV feature map for relative pose estimation. We validate our method on the widely used KITTI and VIGOR datasets, which include both pinhole and panoramic query images. Experimental results show that BevSplat significantly improves localization accuracy over prior approaches.

View full details

Poster

Inference-Time Reward Hacking in Large Language Models

Hadi Khalaf ⋅ Claudio Mayrink Verdun ⋅ Alex Oesterling ⋅ Himabindu Lakkaraju ⋅ Flavio Calmon

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

A common paradigm to improve the performance of large language models is optimizing for a reward model. Reward models assign a numerical score to an LLM’s output that indicates, for example, how likely it is to align with user preferences or safety goals. However, reward models are never perfect. They inevitably function as proxies for complex desiderata such as correctness, helpfulness, and safety. By overoptimizing for a misspecified reward, we can subvert intended alignment goals and reduce overall performance -- a phenomenon commonly referred to as reward hacking. In this work, we characterize reward hacking in inference-time alignment and demonstrate when and how we can mitigate it by hedging on the proxy reward. We study this phenomenon under Best-of-$n$ (BoN) and Soft Best-of-$n$ (SBoN), and we introduce Best-of-Poisson (BoP) that provides an efficient, near-exact approximation of the optimal reward-KL divergence policy at inference time. We show that the characteristic pattern of hacking as observed in practice (where the true reward first increases before declining) is an inevitable property of a broad class of inference-time mechanisms, including BoN and BoP. To counter this effect, we introduce $\texttt{HedgeTune}$, an efficient algorithm to find the optimal inference-time parameter. We demonstrate that hedging mitigates reward hacking and achieves superior reward-distortion tradeoffs on math, reasoning, and human-preference setups.

View full details

Poster

SHF: Symmetrical Hierarchical Forest with Pretrained Vision Transformer Encoder for High-Resolution Medical Segmentation

Enzhi Zhang ⋅ Peng Chen ⋅ Rui Zhong ⋅ Du Wu ⋅ Jun Igarashi ⋅ Isaac Lyngaas ⋅ Xiao Wang ⋅ Masaharu Munetomo ⋅ Mohamed Wahib

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

This paper presents a novel approach to addressing the long-sequence problem in high-resolution medical images for Vision Transformers (ViTs). Using smaller patches as tokens can enhance ViT performance, but quadratically increases computation and memory requirements. Therefore, the common practice for applying ViTs to high-resolution images is either to: (a) employ complex sub-quadratic attention schemes or (b) use large to medium-sized patches and rely on additional mechanisms within the model to capture the spatial hierarchy of details. We propose Symmetrical Hierarchical Forest (SHF), a lightweight approach that adaptively patches the input image to increase token information density and encode hierarchical spatial structures into the input embedding. We then apply a reverse depatching scheme to the output embeddings of the transformer encoder, eliminating the need for convolution-based decoders. Unlike previous methods that modify attention mechanisms \wahib{or use a complex hierarchy of interacting models}, SHF can be retrofitted to any ViT model to allow it to learn the hierarchical structure of details in high-resolution images without requiring architectural changes. Experimental results demonstrate significant gains in computational efficiency and performance: on the PAIP WSI dataset, we achieved a 3$\sim$32$\times$ speedup or a 2.95\% to 7.03\% increase in accuracy (measured by Dice score) at a $64K^2$ resolution with the same computational budget, compared to state-of-the-art production models. On the 3D medical datasets BTCV and KiTS, training was 6$\times$ faster, with accuracy gains of 6.93\% and 5.9\%, respectively, compared to models without SHF.

View full details

Poster

Tensor Product Attention Is All You Need

Yifan Zhang ⋅ Yifeng Liu ⋅ Huizhuo Yuan ⋅ Zhen Qin ⋅ Yang Yuan ⋅ Quanquan Gu ⋅ Andrew Yao

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Scaling language models to handle longer input sequences typically necessitates large key-value (KV) caches, resulting in substantial memory overhead during inference. In this paper, we propose Tensor Product Attention (TPA), a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly, substantially shrinking the KV cache size at inference time. By factorizing these representations into contextual low-rank components and seamlessly integrating with Rotary Position Embedding (RoPE), TPA achieves improved model quality alongside memory efficiency. Based on TPA, we introduce the Tensor ProducT ATTenTion Transformer (T6), a new model architecture for sequence modeling. Through extensive empirical evaluation on language modeling tasks, we demonstrate that T6 surpasses or matches the performance of standard Transformer baselines including Multi-Head Attention (MHA), Multi-Query Attention (MQA), Grouped-Query Attention (GQA), and Multi-Head Latent Attention (MLA) across various metrics, including perplexity and a range of established evaluation benchmarks. Notably, TPA's memory efficiency and computational efficiency at decoding stage enables processing longer sequences under fixed resource constraints, addressing a critical scalability challenge in modern language models. Project Page: https://github.com/tensorgi/TPA.

View full details

Poster

Decomposing Interventional Causality into Synergistic, Redundant, and Unique Components

Abel Jansma

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We introduce a novel framework for decomposing interventional causal effects into synergistic, redundant, and unique components, building on the intuition of Partial Information Decomposition (PID) and the principle of Möbius inversion. While recent work has explored a similar decomposition of an observational measure, we argue that a proper causal decomposition must be interventional in nature. We develop a mathematical approach that systematically quantifies how causal power is distributed among variables in a system, using a recently derived closed-form expression for the Möbius function of the redundancy lattice. The formalism is then illustrated by decomposing the causal power in logic gates, cellular automata, and chemical reaction networks. Our results reveal how the distribution of causal power can be context- and parameter-dependent. The decomposition provides new insights into complex systems by revealing how causal influences are shared and combined among multiple variables, with potential applications ranging from attribution of responsibility in legal or AI systems, to the analysis of biological networks or climate models.

View full details

Poster

OpenWorldSAM: Extending SAM2 for Universal Image Segmentation with Language Prompts

Shiting (Ginny) Xiao ⋅ Rishabh Kabra ⋅ Yuhang Li ⋅ Donghyun Lee ⋅ Joao Carreira ⋅ Priyadarshini Panda

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The ability to segment objects based on open-ended language prompts remains a critical challenge, requiring models to ground textual semantics into precise spatial masks while handling diverse and unseen categories. We present OpenWorldSAM, a framework that extends the prompt-driven Segment Anything Model v2 (SAM2) to open-vocabulary scenarios by integrating multi-modal embeddings extracted from a lightweight vision-language model (VLM). Our approach is guided by four key principles: i) Unified prompting: OpenWorldSAM supports a diverse range of prompts, including category-level and sentence-level language descriptions, providing a flexible interface for various segmentation tasks. ii) Efficiency: By freezing the pre-trained components of SAM2 and the VLM, we train only 4.5 million parameters on the COCO-stuff dataset, achieving remarkable resource efficiency. iii) Instance Awareness: We enhance the model's spatial understanding through novel positional tie-breaker embeddings and cross-attention layers, enabling effective segmentation of multiple instances. iv) Generalization: OpenWorldSAM exhibits strong zero-shot capabilities, generalizing well on unseen categories and an open vocabulary of concepts without additional training. Extensive experiments demonstrate that OpenWorldSAM achieves state-of-the-art performance in open-vocabulary semantic, instance, and panoptic segmentation across multiple benchmarks. Code is available at https://github.com/GinnyXiao/OpenWorldSAM.

View full details

Poster

Self-Supervised Learning of Motion Concepts by Optimizing Counterfactuals

Stefan Stojanov ⋅ David Wendt ⋅ Seungwoo Kim ⋅ Rahul Venkatesh ⋅ Kevin Feigelis ⋅ Klemen Kotar ⋅ Khai Loong Aw ⋅ Jiajun Wu ⋅ Daniel Yamins

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Estimating motion primitives from video (e.g., optical flow and occlusion) is a critically important computer vision problem with many downstream applications, including controllable video generation and robotics. Current solutions are primarily supervised on synthetic data or require tuning of situation-specific heuristics, which inherently limits these models' capabilities in real-world contexts. A natural solution to transcend these limitations would be to deploy large-scale, self-supervised video models, which can be trained scalably on unrestricted real-world video datasets. However, despite recent progress, motion-primitive extraction from large pretrained video models remains relatively underexplored. In this work, we describe Opt-CWM, a self-supervised flow and occlusion estimation technique from a pretrained video prediction model. Opt-CWM uses ``counterfactual probes'' to extract motion information from a base video model in a zero-shot fashion. The key problem we solve is optimizing the quality of these probes, using a combination of an efficient parameterization of the space counterfactual probes, together with a novel generic sparse-prediction principle for learning the probe-generation parameters in a self-supervised fashion. Opt-CWM achieves state-of-the-art performance for motion estimation on real-world videos while requiring no labeled data.

View full details

Poster

PatientSim: A Persona-Driven Simulator for Realistic Doctor-Patient Interactions

Daeun Kyung ⋅ Hyunseung Chung ⋅ Seongsu Bae ⋅ Jiho Kim ⋅ Jae Ho Sohn ⋅ Taerim Kim ⋅ Soo Kyung Kim ⋅ Edward Choi

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Doctor-patient consultations require multi-turn, context-aware communication tailored to diverse patient personas. Training or evaluating doctor LLMs in such settings requires realistic patient interaction systems. However, existing simulators often fail to reflect the full range of personas seen in clinical practice. To address this, we introduce PatientSim, a patient simulator that generates realistic and diverse patient personas for clinical scenarios, grounded in medical expertise. PatientSim operates using: 1) clinical profiles, including symptoms and medical history, derived from real-world data in the MIMIC-ED and MIMIC-IV datasets, and 2) personas defined by four axes: personality, language proficiency, medical history recall level, and cognitive confusion level, resulting in 37 unique combinations.We evaluate eight LLMs for factual accuracy and persona consistency. The top-performing open-source model, Llama 3.3 70B, is validated by four clinicians to confirm the robustness of our framework. As an open-source, customizable platform, PatientSim provides a reproducible and scalable solution that can be customized for specific training needs. Offering a privacy-compliant environment, it serves as a robust testbed for evaluating medical dialogue systems across diverse patient presentations and shows promise as an educational tool for healthcare. The code is available at https://github.com/dek924/PatientSim.

View full details

Poster

OCTDiff: Bridged Diffusion Model for Portable OCT Super-Resolution and Enhancement

Ye Tian ⋅ Angela McCarthy ⋅ Gabriel Gomide ⋅ Nancy Liddle ⋅ Jedrzej Golebka ⋅ Royce Chen ⋅ Jeff Liebmann ⋅ Kaveri Thakoor

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Medical imaging super-resolution is critical for improving diagnostic utility and reducing costs, particularly for low-cost modalities such as portable Optical Coherence Tomography (OCT). We propose OCTDiff, a bridged diffusion model designed to enhance image resolution and quality from portable OCT devices. Our image-to-image diffusion framework addresses key challenges in the conditional generation process of denoising diffusion probabilistic models (DDPMs). We introduce Adaptive Noise Aggregation (ANA), a novel module to improve denoising dynamics within the reverse diffusion process. Additionally, we integrate Multi-Scale Cross-Attention (MSCA) into the U-Net backbone to capture local dependencies across spatial resolutions. To address overfitting on small clinical datasets and to preserve fine structural details essential for retinal diagnostics, we design a customized loss function guided by clinical quality scores. OCTDiff outperforms convolutional baselines and standard DDPMs, achieving state-of-the-art performance on clinical portable OCT datasets. Our model and its downstream applications have the potential to generalize to other medical imaging modalities and revolutionize the current workflow of ophthalmic diagnostics. The code is available at https://github.com/AI4VSLab/OCTDiff.

View full details

Poster

Learning with Calibration: Exploring Test-Time Computing of Spatio-Temporal Forecasting

Wei Chen ⋅ Yuxuan Liang

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Spatio-temporal forecasting is crucial in many domains, such as transportation, meteorology, and energy. However, real-world scenarios frequently present challenges such as signal anomalies, noise, and distributional shifts. Existing solutions primarily enhance robustness by modifying network architectures or training procedures. Nevertheless, these approaches are computationally intensive and resource-demanding, especially for large-scale applications. In this paper, we explore a _novel **t**est-**t**ime **c**omputing paradigm, namely learning with calibration, **ST-TTC**_, for **s**patio-**t**emporal forecasting. Through learning with calibration, we aim to capture periodic structural biases arising from non-stationarity during the testing phase and perform real-time bias correction on predictions to improve accuracy. Specifically, we first introduce a spectral-domain calibrator with phase-amplitude modulation to mitigate periodic shift and then propose a flash updating mechanism with a streaming memory queue for efficient test-time computation. _**ST-TTC**_ effectively bypasses complex training-stage techniques, offering an efficient and generalizable paradigm. Extensive experiments on real-world datasets demonstrate the effectiveness, universality, flexibility and efficiency of our proposed method.

View full details

Poster

AdaReasoner: Adaptive Reasoning Enables More Flexible Thinking

Xiangqi Wang ⋅ Yue Huang ⋅ Yanbo Wang ⋅ Xiaonan Luo ⋅ Kehan Guo ⋅ Yujun Zhou ⋅ Xiangliang Zhang

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

LLMs often need effective configurations, like temperature and reasoning steps, to handle tasks requiring sophisticated reasoning and problem-solving, ranging from joke generation to mathematical reasoning. Existing prompting approaches usually adopt general-purpose, fixed configurations that work “well enough” across tasks but seldom achieve task-specific optimality. To address this gap, we introduce AdaReasoner, an LLM-agnostic plugin designed for any LLM to automate adaptive reasoning configurations for tasks requiring different types of thinking. AdaReasoner is trained using a reinforcement learning (RL) framework, combining a factorized action space with a targeted exploration strategy, along with a pretrained reward model to optimize the policy model for reasoning configurations with only a few-shot guide. AdaReasoner is backed by theoretical guarantees and experiments of fast convergence and a sublinear policy gap. Across six different LLMs and a variety of reasoning tasks, it consistently outperforms standard baselines, preserves out-of-distribution robustness, and yield gains on knowledge-intensive tasks through tailored prompts.

View full details

Poster

Bits Leaked per Query: Information-Theoretic Bounds for Adversarial Attacks on LLMs

Masahiro Kaneko ⋅ Timothy Baldwin

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Adversarial attacks by malicious users that threaten the safety of large language models (LLMs) can be viewed as attempts to infer a target property $T$ that is unknown when an instruction is issued, and becomes knowable only after the model's reply is observed. Examples of target properties $T$ include the binary flag that triggers an LLM's harmful response or rejection, and the degree to which information deleted by unlearning can be restored, both elicited via adversarial instructions. The LLM reveals an \emph{observable signal} $Z$ that potentially leaks hints for attacking through a response containing answer tokens, thinking process tokens, or logits. Yet the scale of information leaked remains anecdotal, leaving auditors without principled guidance and defenders blind to the transparency--risk trade-off. We fill this gap with an information-theoretic framework that computes how much information can be safely disclosed, and enables auditors to gauge how close their methods come to the fundamental limit. Treating the mutual information $I(Z;T)$ between the observation $Z$ and the target property $T$ as the leaked bits per query, we show that achieving error $\varepsilon$ requires at least $\log(1/\varepsilon)/I(Z;T)$ queries, scaling linearly with the inverse leak rate and only logarithmically with the desired accuracy. Thus, even a modest increase in disclosure collapses the attack cost from quadratic to logarithmic in terms of the desired accuracy. Experiments on seven LLMs across system-prompt leakage, jailbreak, and relearning attacks corroborate the theory: exposing answer tokens alone requires about a thousand queries; adding logits cuts this to about a hundred; and revealing the full thinking process trims it to a few dozen. Our results provide the first principled yardstick for balancing transparency and security when deploying LLMs.

View full details

Poster

On Traceability in $\ell_p$ Stochastic Convex Optimization

Sasha Voitovych ⋅ Mahdi Haghifam ⋅ Idan Attias ⋅ Gintare Karolina Dziugaite ⋅ Roi Livni ⋅ Dan Roy

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

In this paper, we investigate the necessity of traceability for accurate learning in stochastic convex optimization (SCO) under $\ell_p$ geometries. Informally, we say a learning algorithm is \emph{$m$-traceable} if, by analyzing its output, it is possible to identify at least $m$ of its training samples. Our main results uncover a fundamental tradeoff between traceability and excess risk in SCO. For every $p\in [1,\infty)$, we establish the existence of an excess risk threshold below which every sample-efficient learner is traceable with the number of samples which is a \emph{constant fraction} of its training sample. For $p\in [1,2]$, this threshold coincides with the best excess risk of differentially private (DP) algorithms, i.e., above this threshold, there exist algorithms that are not traceable, which corresponds to a sharp phase transition. For $p \in (2,\infty)$, this threshold instead gives novel lower bounds for DP learning, partially closing an open problem in this setup. En route to establishing these results, we prove a sparse variant of the fingerprinting lemma, which is of independent interest to the community.

View full details

Poster

TimE: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios

Shaohang Wei ⋅ Wei Li ⋅ Feifan Song ⋅ Wen Luo ⋅ Tianyi Zhuang ⋅ Haochen Tan ⋅ Zhijiang Guo ⋅ Houfeng Wang

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Temporal reasoning is pivotal for Large Language Models (LLMs) to comprehend the real world. However, existing works neglect the real-world challenges for temporal reasoning: (1) intensive temporal information, (2) fast-changing event dynamics, and (3) complex temporal dependencies in social interactions. To bridge this gap, we propose a multi-level benchmark TimE, designed for temporal reasoning in real-world scenarios. TimE consists of 38,522 QA pairs, covering 3 levels with 11 fine-grained sub-tasks. This benchmark encompasses 3 sub-datasets reflecting different real-world challenges: TimE-Wiki, TimE-News, and TimE-Dial. We conduct extensive experiments on reasoning models and non-reasoning models. And we conducted an in-depth analysis of temporal reasoning performance across diverse real-world scenarios and tasks, and summarized the impact of test-time scaling on temporal reasoning capabilities. Additionally, we release TimE-Lite, a human-annotated subset to foster future research and standardized evaluation in temporal reasoning.

View full details

Poster

Joint‑Embedding vs Reconstruction: Provable Benefits of Latent Space Prediction for Self‑Supervised Learning

Hugues Van Assel ⋅ Mark Ibrahim ⋅ Tommaso Biancalani ⋅ Aviv Regev ⋅ Randall Balestriero

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Reconstruction and joint-embedding have emerged as two leading paradigms in Self‑Supervised Learning (SSL). Reconstruction methods focus on recovering the original sample from a different view in input space. On the other hand, joint-embedding methods align the representations of different views in latent space. Both approaches offer compelling advantages, yet practitioners lack clear guidelines for choosing between them. In this work, we unveil the core mechanisms that distinguish each paradigm. By leveraging closed-form solutions for both approaches, we precisely characterize how the view generation process, e.g. data augmentation, impacts the learned representations. We then demonstrate that, unlike supervised learning, both SSL paradigms require a minimal alignment between augmentations and irrelevant features to achieve asymptotic optimality with increasing sample size. Our findings indicate that in scenarios where these irrelevant features have a large magnitude, joint-embedding methods are preferable because they impose a strictly weaker alignment condition compared to reconstruction-based methods. These results not only clarify the trade-offs between the two paradigms but also substantiate the empirical success of joint-embedding approaches on real-world challenging datasets.

View full details

Poster

Purifying Shampoo: Investigating Shampoo's Heuristics by Decomposing its Preconditioner

Runa Eschenhagen ⋅ Aaron Defazio ⋅ Tsung-Hsien Lee ⋅ Richard Turner ⋅ Hao-Jun Shi

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The recent success of Shampoo in the AlgoPerf contest has sparked renewed interest in Kronecker-factorization-based optimization algorithms for training neural networks. Despite its success, Shampoo relies heavily on several heuristics such as learning rate grafting and stale preconditioning to achieve performance at-scale. These heuristics increase algorithmic complexity, necessitate further hyperparameter tuning, and lack theoretical justification. This paper investigates these heuristics from the angle of Frobenius norm approximation to full-matrix Adam and decouples the preconditioner's eigenvalues and eigenbasis updates. We show that grafting from Adam mitigates the staleness and mis-scaling of the preconditioner's *eigenvalues* and how correcting the eigenvalues directly eliminates the need for learning rate grafting. To manage the error induced by infrequent *eigenbasis* computations, we propose an adaptive criterion for determining the eigenbasis computation frequency motivated by terminating a warm-started QR algorithm. This criterion decouples the update frequency of different preconditioner matrices and enables us to investigate the impact of approximation error on convergence. These practical techniques offer a principled angle towards removing Shampoo's heuristics and developing improved Kronecker-factorization-based training algorithms.

View full details

Poster

Causality Meets Locality: Provably Generalizable and Scalable Policy Learning for Networked Systems

Hao Liang ⋅ shuqing shi ⋅ Yudi Zhang ⋅ Biwei Huang ⋅ Yali Du

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Large‑scale networked systems, such as traffic, power, and wireless grids, challenge reinforcement‑learning agents with both scale and environment shifts. To address these challenges, we propose \texttt{GSAC} (\textbf{G}eneralizable and \textbf{S}calable \textbf{A}ctor‑\textbf{C}ritic), a framework that couples causal representation learning with meta actor‑critic learning to achieve both scalability and domain generalization. Each agent first learns a sparse local causal mask that provably identifies the minimal neighborhood variables influencing its dynamics, yielding exponentially tight approximately compact representations (ACRs) of state and domain factors. These ACRs bound the error of truncating value functions to $\kappa$-hop neighborhoods, enabling efficient learning on graphs. A meta actor‑critic then trains a shared policy across multiple source domains while conditioning on the compact domain factors; at test time, a few trajectories suffice to estimate the new domain factor and deploy the adapted policy. We establish finite‑sample guarantees on causal recovery, actor-critic convergence, and adaptation gap, and show that \texttt{GSAC} adapts rapidly and significantly outperforms learning-from-scratch and conventional adaptation baselines.

View full details

Poster

Conformal Mixed-Integer Constraint Learning with Feasibility Guarantees

Daniel Ovalle ⋅ Lorenz Biegler ⋅ Ignacio Grossmann ⋅ Carl Laird ⋅ Mateo Dulce Rubio

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We propose Conformal Mixed-Integer Constraint Learning (C-MICL), a novel framework that provides probabilistic feasibility guarantees for data-driven constraints in optimization problems. While standard Mixed-Integer Constraint Learning methods often violate the true constraints due to model error or data limitations, our C-MICL approach leverages conformal prediction to ensure feasible solutions are ground-truth feasible with probability at least $1{-}\alpha$, under a conditional independence assumption. The proposed framework supports both regression and classification tasks without requiring access to the true constraint function, while avoiding the scalability issues associated with ensemble-based heuristics. Experiments on real-world applications demonstrate that C-MICL consistently achieves target feasibility rates, maintains competitive objective performance, and significantly reduces computational cost compared to existing methods. Our work bridges mathematical optimization and machine learning, offering a principled approach to incorporate uncertainty-aware constraints into decision-making with rigorous statistical guarantees.

View full details

Poster

Unveiling the Power of Multiple Gossip Steps: A Stability-Based Generalization Analysis in Decentralized Training

Qinglun Li ⋅ Yingqi Liu ⋅ Miao Zhang ⋅ Xiaochun Cao ⋅ Quanjun Yin ⋅ Li Shen

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Decentralized training removes the centralized server, making it a communication-efficient approach that can significantly improve training efficiency, but it often suffers from degraded performance compared to centralized training. Multi-Gossip Steps (MGS) serve as a simple yet effective bridge between decentralized and centralized training, significantly reducing experiment performance gaps. However, the theoretical reasons for its effectiveness and whether this gap can be fully eliminated by MGS remain open questions. In this paper, we derive upper bounds on the generalization error and excess error of MGS using stability analysis, systematically answering these two key questions. 1). Optimization Error Reduction: MGS reduces the optimization error bound at an exponential rate, thereby exponentially tightening the generalization error bound and enabling convergence to better solutions. 2). Gap to Centralization: Even as MGS approaches infinity, a non-negligible gap in generalization error remains compared to centralized mini-batch SGD ($\mathcal{O}(T^{\frac{c\beta}{c\beta +1}}/{n m})$ in centralized and $\mathcal{O}(T^{\frac{2c\beta}{2c\beta +2}}/{n m^{\frac{1}{2c\beta +2}}})$ in decentralized). Furthermore, we provide the first unified analysis of how factors like learning rate, data heterogeneity, node count, per-node sample size, and communication topology impact the generalization of MGS under non-convex settings without the bounded gradients assumption, filling a critical theoretical gap in decentralized training. Finally, promising experiments on CIFAR datasets support our theoretical findings.

View full details

Poster

Preconditioned Langevin Dynamics with Score-based Generative Models for Infinite-Dimensional Linear Bayesian Inverse Problems

Lorenzo Baldassari ⋅ Josselin Garnier ⋅ Knut Solna ⋅ Maarten V. de Hoop

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Designing algorithms for solving high-dimensional Bayesian inverse problems directly in infinite‑dimensional function spaces – where such problems are naturally formulated – is crucial to ensure stability and convergence as the discretization of the underlying problem is refined. In this paper, we contribute to this line of work by analyzing a widely used sampler for linear inverse problems: Langevin dynamics driven by score‑based generative models (SGMs) acting as priors, formulated directly in function space. Building on the theoretical framework for SGMs in Hilbert spaces, we give a rigorous definition of this sampler in the infinite-dimensional setting and derive, for the first time, error estimates that explicitly depend on the approximation error of the score. As a consequence, we obtain sufficient conditions for global convergence in Kullback–Leibler divergence on the underlying function space. Preventing numerical instabilities requires preconditioning of the Langevin algorithm and we prove the existence and form of an optimal preconditioner. The preconditioner depends on both the score error and the forward operator and guarantees a uniform convergence rate across all posterior modes. Our analysis applies to both Gaussian and a general class of non‑Gaussian priors. Finally, we present examples that illustrate and validate our theoretical findings.

View full details

Poster

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

Denis Sutter ⋅ Julian Minder ⋅ Thomas Hofmann ⋅ Tiago Pimentel

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The concept of causal abstraction got recently popularised to demystify the opaque decision-making processes of machine learning models; in short, a neural network can be abstracted as a higher-level algorithm if there exists a function which allows us to map between them. Notably, most interpretability papers implement these maps as linear functions, motivated by the linear representation hypothesis: the idea that features are encoded linearly in a model's representations. However, this linearity constraint is not required by the definition of causal abstraction. In this work, we critically examine the concept of causal abstraction by considering arbitrarily powerful alignment maps. In particular, we prove that under reasonable assumptions, any neural network can be mapped to any algorithm, rendering this unrestricted notion of causal abstraction trivial and uninformative. We complement these theoretical findings with empirical evidence, demonstrating that it is possible to perfectly map models to algorithms even when these models are incapable of solving the actual task; e.g., on an experiment using randomly initialised language models, our alignment maps reach 100\% interchange-intervention accuracy on the indirect object identification task. This raises the non-linear representation dilemma: if we lift the linearity constraint imposed to alignment maps in causal abstraction analyses, we are left with no principled way to balance the inherent trade-off between these maps' complexity and accuracy. Together, these results suggest an answer to our title's question: causal abstraction is not enough for mechanistic interpretability, as it becomes vacuous without assumptions about how models encode information. Studying the connection between this information-encoding assumption and causal abstraction should lead to exciting future work.

View full details

Poster

QFFT, Question-Free Fine-Tuning for Adaptive Reasoning

Wanlong Liu ⋅ Junxiao Xu ⋅ Fei Yu ⋅ Yukang Lin ⋅ Ke Ji ⋅ Wenyu Chen ⋅ Lifeng Shang ⋅ Yasheng Wang ⋅ Yan Xu ⋅ Benyou Wang

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Recent advancements in Long Chain-of-Thought (CoT) reasoning models have improved performance on complex tasks, but they suffer from overthinking, which generates redundant reasoning steps, especially for simple questions. This paper revisits the reasoning patterns of Long and Short CoT models, observing that the Short CoT patterns offer concise reasoning efficiently, while the Long CoT patterns excel in challenging scenarios where the Short CoT patterns struggle. To enable models to leverage both patterns, we propose Question-Free Fine-Tuning (QFFT), a fine-tuning approach that removes the input question during training and learns exclusively from Long CoT responses. This approach enables the model to adaptively employ both reasoning patterns: it prioritizes the Short CoT patterns and activates the Long CoT patterns only when necessary. Experiments on various mathematical datasets demonstrate that QFFT reduces average response length by more than 50\%, while achieving performance comparable to Supervised Fine-Tuning (SFT). Additionally, QFFT exhibits superior performance compared to SFT in noisy, out-of-domain, and low-resource scenarios.

View full details

Poster

Absence Bench: Language Models Can’t See What’s Missing

Harvey Yiyun Fu ⋅ Aryan Shrivastava ⋅ Jared Moore ⋅ Peter West ⋅ Chenhao Tan ⋅ Ari Holtzman

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Large language models (LLMs) are increasingly capable of processing long inputs and locating specific information within them, as evidenced by their performance on the Needle in a Haystack (NIAH) test. However, while models excel at recalling surprising information, they still struggle to identify clearly omitted information. We introduce AbsenceBench to assesses LLMs' capacity to detect missing information across three domains: numerical sequences, poetry, and GitHub pull requests. AbsenceBench asks models to identify which pieces of a document were deliberately removed, given access to both the original and edited contexts. Despite the apparent straightforwardness of these tasks, our experiments reveal that even state-of-the-art models like Claude-3.7-Sonnet achieve only 69.6% F1-score with a modest average context length of 5K tokens. Our analysis suggests this poor performance stems from a fundamental limitation: Transformer attention mechanisms cannot easily attend to "gaps" in documents since these absences don't correspond to any specific keys that can be attended to. Overall, our results and analysis provide a case study of the close proximity of tasks where models are already superhuman (NIAH) and tasks where models breakdown unexpectedly (AbsenceBench).

View full details

Poster

ESCA: Contextualizing Embodied Agents via Scene-Graph Generation

Jiani Huang ⋅ Amish Sethi ⋅ Matthew Kuo ⋅ Mayank Keoliya ⋅ Neelay Velingker ⋅ JungHo Jung ⋅ Ser Nam Lim ⋅ Ziyang Li ⋅ Mayur Naik

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Multi-modal large language models (MLLMs) are making rapid progress toward general-purpose embodied agents. However, existing MLLMs do not reliably capture fine-grained links between low-level visual features and high-level textual semantics, leading to weak grounding and inaccurate perception. To overcome this challenge, we propose ESCA, a framework that contextualizes embodied agents by grounding their perception in spatial-temporal scene graphs. At its core is SGCLIP, a novel, open-domain, promptable foundation model for generating scene graphs that is based on CLIP. SGCLIP is trained on 87K+ open-domain videos using a neurosymbolic pipeline that aligns automatically generated captions with scene graphs produced by the model itself, eliminating the need for human-labeled annotations. We demonstrate that SGCLIP excels in both prompt-based inference and task-specific fine-tuning, achieving state-of-the-art results on scene graph generation and action localization benchmarks. ESCA with SGCLIP improves perception for embodied agents based on both open-source and commercial MLLMs, achieving state of-the-art performance across two embodied environments. Notably, ESCA significantly reduces agent perception errors and enables open-source models to surpass proprietary baselines. We release the source code for SGCLIP model training at https://github.com/video-fm/LASER and for the embodied agent at https://github.com/video-fm/ESCA.

View full details

Poster

OPTFM: A Scalable Multi-View Graph Transformer for Hierarchical Pre-Training in Combinatorial Optimization

Hao Yuan ⋅ Wenli Ouyang ⋅ Changwen Zhang ⋅ Congrui Li ⋅ Yong Sun

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Foundation Models (FMs) have demonstrated remarkable success in fields like computer vision and natural language processing, yet their application to combinatorial optimization remains underexplored. Optimization problems, often modeled as graphs, pose unique challenges due to their diverse structures, varying distributions, and NP-hard complexity. To address these challenges, we propose OPTFM, the first graph foundation model for general combinatorial optimization. OPTFM introduces a scalable multi-view graph transformer with hybrid self-attention and cross-attention to model large-scale heterogeneous graphs in $O(N)$ time complexity while maintaining semantic consistency throughout the attention computation. A Dual-level pre-training framework integrates node-level graph reconstruction and instance-level contrastive learning, enabling robust and adaptable representations at multiple levels. Experimental results across diverse optimization tasks show that models trained on OPTFM embeddings without fine-tuning consistently outperform task-specific approaches, establishing a new benchmark for solving combinatorial optimization problems.

View full details

Poster

Thoughts Are All Over the Place: On the Underthinking of Long Reasoning Models

Yue Wang ⋅ Qiuzhi Liu ⋅ Jiahao Xu ⋅ Tian Liang ⋅ Xingyu Chen ⋅ Zhiwei He ⋅ Linfeng Song ⋅ Dian Yu ⋅ Juntao Li ⋅ Zhuosheng Zhang ⋅ Rui Wang ⋅ Zhaopeng Tu ⋅ Haitao Mi ⋅ Dong Yu

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Long reasoning models (LRMs) such as OpenAI's o1 and DeepSeek's R1 have demonstrated remarkable abilities in complex reasoning tasks by scaling test-time compute and exhibiting human-like deep thinking. However, we identify a phenomenon we term underthinking, where LRMs frequently switch between different reasoning thoughts without sufficiently exploring promising paths to reach a correct solution. This behavior leads to inadequate depth of reasoning and decreased performance, particularly on challenging mathematical problems. To systematically analyze this issue, we conduct experiments on three challenging test sets and two representative open-source LRMs, revealing that frequent thought switching correlates with incorrect responses. We introduce a novel metric to quantify underthinking by measuring token efficiency in incorrect answers. To address underthinking, we propose a decoding strategy with thought switching penalty (Tip) that discourages premature transitions between thoughts, encouraging deeper exploration of each reasoning path. Experimental results demonstrate that our approach improves accuracy across challenging datasets without requiring model fine-tuning. Our findings contribute to understanding reasoning inefficiencies in LRMs and offer a practical solution to enhance their problem-solving capabilities. Our code is open-source and available at https://github.com/wangyuenlp/underthinking.

View full details

Poster

Fast Training of Large Kernel Models with Delayed Projections

Amirhesam Abedsoltan ⋅ Siyuan Ma ⋅ Parthe Pandit ⋅ Misha Belkin

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Classical kernel machines have historically faced significant challenges in scaling to large datasets and model sizes—a key ingredient that has driven the success of neural networks. In this paper, we present a new methodology for building kernel machines that can scale efficiently with both data size and model size. Our algorithm introduces delayed projections to Preconditioned Stochastic Gradient Descent (PSGD) allowing the training of much larger models than was previously feasible. We validate our algorithm, \EP4, across multiple datasets, demonstrating drastic training speedups without compromising the performance. Our implementation is publicly available at: https://github.com/EigenPro/EigenPro .

View full details

Poster

Practical do-Shapley Explanations with Estimand-Agnostic Causal Inference

Álvaro Parafita ⋅ Tomas Garriga ⋅ Axel Brando ⋅ Francisco Cazorla

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Among explainability techniques, SHAP stands out as one of the most popular, but often overlooks the causal structure of the problem. In response, do-SHAP employs interventional queries, but its reliance on estimands hinders its practical application. To address this problem, we propose the use of estimand-agnostic approaches, which allow for the estimation of any identifiable query from a single model, making do-SHAP feasible on complex graphs. We also develop a novel algorithm to significantly accelerate its computation at a negligible cost, as well as a method to explain inaccessible Data Generating Processes. We demonstrate the estimation and computational performance of our approach, and validate it on two real-world datasets, highlighting its potential in obtaining reliable explanations.

View full details

Poster

Balancing Multimodal Training Through Game-Theoretic Regularization

Konstantinos Kontras ⋅ Thomas Strypsteen ⋅ Christos Chatzichristos ⋅ Paul Liang ⋅ Matthew Blaschko ⋅ Maarten De Vos

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Multimodal learning holds the promise for richer information extraction by capturing dependencies across data sources. Yet, current training methods often underperform due to modality competition, a phenomenon where modalities contend for training resources, leaving some underoptimized. This raises a pivotal question: how can we address training imbalances, ensure adequate optimization across all modalities, and achieve consistent performance improvements as we transition from unimodal to multimodal data? This paper proposes the Multimodal Competition Regularizer (MCR), inspired by a mutual information (MI) decomposition designed to prevent the adverse effects of competition in multimodal training. Our key contributions are: 1) A game-theoretic framework that adaptively balances modality contributions by encouraging each to maximize its informative role in the final prediction. 2) Refining lower and upper bounds for each MI term to enhance the extraction of both task-relevant unique and shared information across modalities. 3) Proposing latent space permutations for conditional MI estimation, significantly improving computational efficiency. MCR outperforms all previously suggested training strategies and simple baselines, demonstrating that training modalities jointly lead to important performance gains on synthetic and large real-world datasets. We release our code and models at https://github.com/kkontras/MCR.

View full details

Poster

SciArena: An Open Evaluation Platform for Non-Verifiable Scientific Literature-Grounded Tasks

Yilun Zhao ⋅ Kaiyan Zhang ⋅ Tiansheng Hu ⋅ Sihong Wu ⋅ Ronan Le Bras ⋅ Yixin Liu ⋅ Robert Tang ⋅ Joseph Chee Chang ⋅ Jesse Dodge ⋅ Jonathan Bragg ⋅ Chen Zhao ⋅ Hanna Hajishirzi ⋅ Doug Downey ⋅ Arman Cohan

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We present SciArena, an open and collaborative platform for evaluating foundation models on scientific literature-grounded tasks. Unlike traditional benchmarks for scientific literature understanding and synthesis, SciArena engages the research community directly, following the Chatbot Arena evaluation approach of community voting on model comparisons.By leveraging collective intelligence, SciArena offers a community-driven evaluation of model performance on open-ended scientific tasks that demand literature-grounded, long-form responses.The platform currently supports 44 open-source and proprietary foundation models and has collected over 19,000 votes from human researchers across diverse scientific domains. Our analysis of the data collected so far confirms its high quality.We discuss the results and insights based on the model ranking leaderboard.To further promote research in building model-based automated evaluation systems for literature tasks, we release SciArena-Eval, a meta-evaluation benchmark based on our collected preference data. The benchmark measures the accuracy of models in judging answer quality by comparing their pairwise assessments with human votes. Our experiments highlight the benchmark’s challenges and emphasize the need for more reliable automated evaluation methods.

View full details

Poster

UMoE: Unifying Attention and FFN with Shared Experts

Yuanhang Yang ⋅ Chaozheng Wang ⋅ Jing Li

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Sparse Mixture of Experts (MoE) architectures have emerged as a promising approach for scaling Transformer models. While initial works primarily incorporated MoE into feed-forward network (FFN) layers, recent studies have explored extending the MoE paradigm to attention layers to enhance model performance. However, existing attention-based MoE layers require specialized implementations and demonstrate suboptimal performance compared to their FFN-based counterparts. In this paper, we aim to unify MoE designs in attention and FFN layers by introducing a novel reformulation of the attention mechanism, that reveals an underlying FFN-like structure within attention modules. Our proposed architecture, UMoE, achieves superior performance through attention-based MoE layers while enabling efficient parameter sharing between FFN and attention components.

View full details

Poster

Accelerating Optimization via Differentiable Stopping Time

Zhonglin Xie ⋅ Yiman Fong ⋅ Haoran Yuan ⋅ Zaiwen Wen

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

A common approach for accelerating optimization algorithms is to minimize the loss achieved in a fixed time, which enables a differentiable framework with respect to the algorithm's hyperparameters. In contrast, the complementary objective of minimizing the time to reach a target loss is traditionally considered non-differentiable. To address this limitation, we propose a differentiable discrete stopping time and theoretically justify it based on its connection to continuous differential equations. We design an efficient algorithm to compute its sensitivities, thereby enabling a new differentiable formulation for directly accelerating algorithms. We demonstrate its effectiveness in applications such as online hyperparameter tuning and learning to optimize. Our proposed methods show superior performance in comprehensive experiments across various problems, which confirms their effectiveness.

View full details

Poster

VLMs have Tunnel Vision: Evaluating Nonlocal Visual Reasoning in Leading VLMs

Shmuel Berman ⋅ Jia Deng

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Vision Language Models (VLMs) excel at complex visual tasks such as VQA and chart understanding, yet recent work suggests they struggle with simple perceptual tests. We present an evaluation that tests vision-language models’ capacity for non-local visual reasoning— reasoning that requires chaining evidence collected from multiple, possibly distant, regions of an image. We isolate three distinct forms of non‑local vision: comparative perception, which demands holding two images in working memory and comparing them; saccadic search, which requires making discrete, evidence‑driven jumps to locate successive targets; and smooth visual search, which involves searching smoothly along a continuous contour. Flagship models (e.g., GPT-5, Gemini 2.5 Pro, Claude Sonnet 4), even those that perform well on prior primitive‑vision benchmarks, fail these tests and barely exceed random accuracy on two variants of our tasks that are trivial for humans. Our structured evaluation suite allows us to test if VLMs can perform similar visual algorithms to humans. Our findings show that despite gains in raw visual acuity, current models lack core visual reasoning capabilities.

View full details

Poster

Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning

Jaehun Jung ⋅ Seungju Han ⋅ Ximing Lu ⋅ Skyler Hallinan ⋅ David Acuna ⋅ Shrimai Prabhumoye ⋅ Mostofa Patwary ⋅ Mohammad Shoeybi ⋅ Bryan Catanzaro ⋅ Yejin Choi

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Data diversity is crucial for training a strong language model. Yet metrics of diversity often diverge from this goal, measuring variations in heuristic features—like n-grams or embeddings—that are detached from how the model actually performs on a target task. This motivates us to ask: *Can we redefine data diversity—beyond measuring variations in heuristic features—in a way that better predicts model generalization?* Through large-scale empirical analyses spanning over 300 training runs, carefully controlled for data scale and quality, we show that data diversity can be a strong predictor of generalization in LLM reasoning—as measured by average model performance on unseen out-of-distribution benchmarks. We introduce **G-Vendi**, a metric that quantifies diversity via the entropy of model-induced loss gradients. G-Vendi scales to million-sample datasets and yet consistently outperforms heuristic alternatives, achieving strong correlation ($\text{Spearman's } \rho \approx 0.9$) with out-of-distribution (OOD) performance across both natural language inference (NLI) and math reasoning tasks. Building on this insight, we present **Prismatic Synthesis**, a framework for generating diverse synthetic data by targeting underrepresented regions in gradient space. Experimental results show that Prismatic Synthesis consistently improves model performance as we scale synthetic data—not just on in-distribution test but across unseen, out-of-distribution benchmarks—significantly outperforming state-of-the-art models in both domains. For example, PrismMath-7B, our model distilled from a 32B LLM without human verification, outperforms R1-Distill-Qwen-7B—trained on proprietary data generated by 671B R1—on 6 out of 7 challenging math benchmarks.

View full details

Poster

Flow Density Control: Generative Optimization Beyond Entropy-Regularized Fine-Tuning

Riccardo De Santi ⋅ Marin Vlastelica ⋅ Ya-Ping Hsieh ⋅ Zebang Shen ⋅ Niao He ⋅ Andreas Krause

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Adapting large-scale foundational flow and diffusion generative models to optimize task-specific objectives while preserving prior information is crucial for real-world applications such as molecular design, protein docking, and creative image generation. Existing principled fine-tuning methods aim to maximize the expected reward of generated samples, while retaining knowledge from the pre-trained model via KL-divergence regularization. In this work, we tackle the significantly more general problem of optimizing general utilities beyond average rewards, including risk-averse and novelty-seeking reward maximization, diversity measures for exploration, and experiment design objectives among others. Likewise, we consider more general ways to preserve prior information beyond KL-divergence, such as optimal transport distances and Rényi divergences. To this end, we introduce Flow Density Control (FDC), a simple algorithm that reduces this complex problem to a specific sequence of simpler fine-tuning tasks, each solvable via scalable established methods. We derive convergence guarantees for the proposed scheme under realistic assumptions by leveraging recent understanding of mirror flows. Finally, we validate our method on illustrative settings, text-to-image, and molecular design tasks, showing that it can steer pre-trained generative models to optimize objectives and solve practically relevant tasks beyond the reach of current fine-tuning schemes.

View full details

Poster

PhysX-3D: Physical-Grounded 3D Asset Generation

Ziang Cao ⋅ Zhaoxi Chen ⋅ Liang Pan ⋅ Ziwei Liu

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

3D modeling is moving from virtual to physical. Existing 3D generation primarily emphasizes geometries and textures while neglecting physical-grounded modeling. Consequently, despite the rapid development of 3D generative models, the synthesized 3D assets often overlook rich and important physical properties, hampering their real-world application in physical domains like simulation and embodied AI. As an initial attempt to address this challenge, we propose \textbf{PhysX}, an end-to-end paradigm for physical-grounded 3D asset generation. \textbf{1)} To bridge the critical gap in physics-annotated 3D datasets, we present \textbf{\ourname}\ - the first physics-grounded 3D dataset systematically annotated across five foundational dimensions: \textbf{\textcolor{color2}{absolute scale}}, \textbf{\textcolor{color3}{material}}, \textbf{\textcolor{color1}{affordance}}, \textbf{\textcolor{color4}{kinematics}}, and \textbf{\textcolor{color5}{function description}}. In particular, we devise a scalable human-in-the-loop annotation pipeline based on vision-language models, which enables efficient creation of physics-first assets from raw 3D assets. \textbf{2)} Furthermore, we propose \textbf{PhysXGen}, a feed-forward framework for physics-grounded 3D asset generation, injecting physical knowledge into the pre-trained 3D structural space. Specifically, PhysXGen employs a dual-branch architecture to explicitly model the latent correlations between 3D structures and physical properties, thereby producing 3D assets with plausible physical predictions while preserving the native geometry quality. Extensive experiments validate the superior performance and promising generalization capability of our framework. All the code, data, and models will be released to facilitate future research in generative physical AI.

View full details

Poster

Self-Perturbed Anomaly-Aware Graph Dynamics for Multivariate Time-Series Anomaly Detection

Jinyu Cai ⋅ Yuan Xie ⋅ Glynnis Lim ⋅ Yifang Yin ⋅ Roger Zimmermann ⋅ See-Kiong Ng

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Detecting anomalies in multivariate time-series data is an essential task across various domains, yet there are unresolved challenges such as (1) severe class imbalance between normal and anomalous data due to rare anomaly availability in the real world; (2) limited adaptability of the static graph-based methods to dynamically changing inter-variable correlations; and (3) neglect of subtle anomalies due to overfitting to normal patterns in reconstruction-based methods. To tackle these issues, we propose Self-Perturbed Anomaly-Aware Graph Dynamics (SPAGD), a framework for time-series anomaly detection. SPAGD employs a self-perturbation module that generates self-perturbed time series from the reconstruction process of normal ones, which provide auxiliary signals to alleviate class imbalance during training. Concurrently, an anomaly-aware graph construction module is proposed to dynamically adjust the graph structure by leveraging the reconstruction residuals of self-perturbed time series, thereby emphasizing the inter-variable disruptions induced by anomalous candidates. A unified spatio-temporal anomaly detection module then integrates both spatial and temporal convolutions to train a classifier that distinguishes normal time series from the auxiliary self-perturbed samples. Extensive experiments across multiple benchmark datasets demonstrate the effectiveness of SPAGD compared to state-of-the-art baselines.

View full details

Poster

Robust Neural Rendering in the Wild with Asymmetric Dual 3D Gaussian Splatting

CHENGQI LI ⋅ Zhihao Shi ⋅ Yangdi Lu ⋅ Wenbo He ⋅ Xiangyu Xu

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

3D reconstruction from in-the-wild images remains a challenging task due to inconsistent lighting conditions and transient distractors. Existing methods typically rely on heuristic strategies to handle the low-quality training data, which often struggle to produce stable and consistent reconstructions, frequently resulting in visual artifacts. In this work, we propose Asymmetric Dual 3DGS, a novel framework that leverages the stochastic nature of these artifacts: they tend to vary across different training runs due to minor randomness. Specifically, our method trains two 3D Gaussian Splatting (3DGS) models in parallel, enforcing a consistency constraint that encourages convergence on reliable scene geometry while suppressing inconsistent artifacts. To prevent the two models from collapsing into similar failure modes due to confirmation bias, we introduce a divergent masking strategy that applies two complementary masks: a multi-cue adaptive mask and a self-supervised soft mask, which leads to an asymmetric training process of the two models, reducing shared error modes. In addition, to improve the efficiency of model training, we introduce a lightweight variant called Dynamic EMA Proxy, which replaces one of the two models with a dynamically updated Exponential Moving Average (EMA) proxy, and employs an alternating masking strategy to preserve divergence. Extensive experiments on challenging real-world datasets demonstrate that our method consistently outperforms existing approaches while achieving high efficiency. Codes and trained models will be released.

View full details

Poster

Transferable Black-Box One-Shot Forging of Watermarks via Image Preference Models

Tomáš Souček ⋅ Sylvestre-Alvise Rebuffi ⋅ Pierre Fernandez ⋅ Nikola Jovanović ⋅ Hady Elsahar ⋅ Valeriu Lacatusu ⋅ Tuan Tran ⋅ Alexandre Mourachko

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Recent years have seen a surge in interest in digital content watermarking techniques, driven by the proliferation of generative models and increased legal pressure. With an ever-growing percentage of AI-generated content available online, watermarking plays an increasingly important role in ensuring content authenticity and attribution at scale. There have been many works assessing the robustness of watermarking to removal attacks, yet, watermark forging, the scenario when a watermark is stolen from genuine content and applied to malicious content, remains underexplored. In this work, we investigate watermark forging in the context of widely used post-hoc image watermarking. Our contributions are as follows. First, we introduce a preference model to assess whether an image is watermarked. The model is trained using a ranking loss on purely procedurally generated images without any need for real watermarks. Second, we demonstrate the model's capability to remove and forge watermarks by optimizing the input image through backpropagation. This technique requires only a single watermarked image and works without knowledge of the watermarking model, making our attack much simpler and more practical than attacks introduced in related work. Third, we evaluate our proposed method on a variety of post-hoc image watermarking models, demonstrating that our approach can effectively forge watermarks, questioning the security of current watermarking approaches. Our code and further resources are publicly available.

View full details

Poster

To Distill or Decide? Understanding the Algorithmic Trade-off in Partially Observable RL

Yuda Song ⋅ Dhruv Rohatgi ⋅ Aarti Singh ⋅ J. Bagnell

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Partial observability is a notorious challenge in reinforcement learning (RL), due to the need to learn complex, history-dependent policies. Recent empirical successes have used *privileged expert distillation* -- which leverages availability of latent state information during training (e.g., from a simulator) to learn and imitate the optimal latent, Markovian policy -- to disentangle the task of ''learning to see'' from ''learning to act''. While expert distillation is more computationally efficient than RL without latent state information, it also has well-documented failure modes. In this paper -- through a simple but instructive theoretical model called the *perturbed Block MDP*, and controlled experiments on challenging simulated locomotion tasks -- we investigate the algorithmic trade-off between privileged expert distillation and standard RL without privileged information. Our main findings are: **(1)** The trade-off empirically hinges on the *stochasticity* of the latent dynamics, as theoretically predicted by contrasting *approximate decodability* with *belief contraction* in the perturbed Block MDP; and **(2)** The optimal latent policy is not always the best latent policy to distill. Our results suggest new guidelines for effectively exploiting privileged information, potentially advancing the efficiency of policy learning across many practical partially observable domains.

View full details

Poster

Towards Comprehensive Scene Understanding: Integrating First and Third-Person Views for LVLMs

Insu Lee ⋅ Wooje Park ⋅ Jaeyun Jang ⋅ Minyoung Noh ⋅ Kyuhong Shim ⋅ Byonghyo Shim

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Large vision-language models (LVLMs) are increasingly deployed in interactive applications such as virtual and augmented reality, where a first-person (egocentric) view captured by head-mounted cameras serves as key input. While this view offers fine-grained cues about user attention and hand-object interactions, its narrow field of view and lack of global context often lead to failures on spatially or contextually demanding queries. To address this, we introduce a framework that augments egocentric inputs with third-person (exocentric) views, providing complementary information such as global scene layout and object visibility to LVLMs. We present E3VQA, the first benchmark for multi-view question answering with 4K high-quality question-answer pairs grounded in synchronized ego-exo image pairs. Additionally, we propose M3CoT, a training-free prompting technique that constructs a unified scene representation by integrating scene graphs from three complementary perspectives. M3CoT enables LVLMs to reason more effectively across views, yielding consistent performance gains (4.84\% for GPT-4o and 5.94\% for Gemini 2.0 Flash) over a recent CoT baseline. Our extensive evaluation reveals key strengths and limitations of LVLMs in multi-view reasoning and highlights the value of leveraging both egocentric and exocentric inputs. The dataset and source code are available at [https://github.com/Leeinsu1/Towards-Comprehensive-Scene-Understanding](https://github.com/Leeinsu1/Towards-Comprehensive-Scene-Understanding).

View full details

Poster

CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays

Hyungyung Lee ⋅ Geon Choi ⋅ Jung-Oh Lee ⋅ Hangyul Yoon ⋅ Hyuk Hong ⋅ Edward Choi

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Recent progress in Large Vision-Language Models (LVLMs) has enabled promising applications in medical tasks, such as report generation and visual question answering. However, existing benchmarks focus mainly on the final diagnostic answer, offering limited insight into whether models engage in clinically meaningful reasoning. To address this, we present CheXStruct and CXReasonBench, a structured pipeline and benchmark built on the publicly available MIMIC-CXR-JPG dataset. CheXStruct automatically derives a sequence of intermediate reasoning steps directly from chest X-rays, such as segmenting anatomical regions, deriving anatomical landmarks and diagnostic measurements, computing diagnostic indices, and applying clinical thresholds. CXReasonBench leverages this pipeline to evaluate whether models can perform clinically valid reasoning steps and to what extent they can learn from structured guidance, enabling fine-grained and transparent assessment of diagnostic reasoning.The benchmark comprises 18,988 QA pairs across 12 diagnostic tasks and 1,200 cases, each paired with up to 4 visual inputs, and supports multi-path, multi-stage evaluation including visual grounding via anatomical region selection and diagnostic measurements.Even the strongest of 12 evaluated LVLMs struggle with structured reasoning and generalization, often failing to link abstract knowledge with anatomically grounded visual interpretation. The code is available at https://github.com/ttumyche/CXReasonBench

View full details

Poster

Neural Atlas Graphs for Dynamic Scene Decomposition and Editing

Jan Philipp Schneider ⋅ Pratik S. Bisht ⋅ Ilya Chugunov ⋅ Andreas Kolb ⋅ Michael Moeller ⋅ Felix Heide

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Learning editable high-resolution scene representations for dynamic scenes is an open problem with applications across the domains from autonomous driving to creative editing - the most successful approaches today make a trade-off between editability and supporting scene complexity: neural atlases represent dynamic scenes as two deforming image layers, foreground and background, which are editable in 2D, but break down when multiple objects occlude and interact. In contrast, scene graph models make use of annotated data such as masks and bounding boxes from autonomous-driving datasets to capture complex 3D spatial relationships, but their implicit volumetric node representations are challenging to edit view-consistently. We propose Neural Atlas Graphs (NAGs), a hybrid high-resolution scene representation, where every graph node is a view-dependent neural atlas, facilitating both 2D appearance editing and 3D ordering and positioning of scene elements. Fit at test-time, NAGs achieve state-of-the-art quantitative results on the Waymo Open Dataset - by 5 dB PSNR increase compared to existing methods - and make environmental editing possible in high resolution and visual quality - creating counterfactual driving scenarios with new backgrounds and edited vehicle appearance. We find that the method also generalizes beyond driving scenes and compares favorably - by more than 7 dB in PSNR - to recent matting and video editing baselines on the DAVIS video dataset with a diverse set of human and animal-centric scenes. Project Page: https://princeton-computational-imaging.github.io/nag/

View full details

Poster

Integration Matters for Learning PDEs with Backward SDEs

Sungje Park ⋅ Stephen Tu

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Backward stochastic differential equation (BSDE)-based deep learning methods provide an alternative to Physics-Informed Neural Networks (PINNs) for solving high-dimensional partial differential equations (PDEs), offering potential algorithmic advantages in settings such as stochastic optimal control, where the PDEs of interest are tied to an underlying dynamical system. However, standard BSDE-based solvers have empirically been shown to underperform relative to PINNs in the literature. In this paper, we identify the root cause of this performance gap as a discretization bias introduced by the standard Euler-Maruyama (EM) integration scheme applied to one-step self-consistency BSDE losses, which shifts the optimization landscape off target. We find that this bias cannot be satisfactorily addressed through finer step-sizes or multi-step self-consistency losses. To properly handle this issue, we propose a Stratonovich-based BSDE formulation, which we implement with stochastic Heun integration. We show that our proposed approach completely eliminates the bias issues faced by EM integration. Furthermore, our empirical results show that our Heun-based BSDE method consistently outperforms EM-based variants and achieves competitive results with PINNs across multiple high-dimensional benchmarks. Our findings highlight the critical role of integration schemes in BSDE-based PDE solvers, an algorithmic detail that has received little attention thus far in the literature.

View full details

Poster

Reverse Engineering Human Preferences with Reinforcement Learning

Lisa Alazraki ⋅ Yi-Chern Tan ⋅ Jon Ander Campos ⋅ Maximilian Mozes ⋅ Marek Rei ⋅ Max Bartolo

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The capabilities of Large Language Models (LLMs) are routinely evaluated by other LLMs trained to predict human preferences. This framework—known as *LLM-as-a-judge*—is highly scalable and relatively low cost. However, it is also vulnerable to malicious exploitation, as LLM responses can be tuned to overfit the preferences of the judge. Previous work shows that the answers generated by a candidate-LLM can be edited *post hoc* to maximise the score assigned to them by a judge-LLM. In this study, we adopt a different approach and use the signal provided by judge-LLMs as a reward to adversarially tune models that generate text preambles designed to boost downstream performance. We find that frozen LLMs pipelined with these models attain higher LLM-evaluation scores than existing frameworks. Crucially, unlike other frameworks which intervene directly on the model's response, our method is virtually undetectable. We also demonstrate that the effectiveness of the tuned preamble generator transfers when the candidate-LLM and the judge-LLM are replaced with models that are not used during training. These findings raise important questions about the design of more reliable LLM-as-a-judge evaluation settings. They also demonstrate that human preferences can be reverse engineered effectively, by pipelining LLMs to optimise upstream preambles via reinforcement learning—an approach that could find future applications in diverse tasks and domains beyond adversarial attacks.

View full details

Poster

Incremental Sequence Classification with Temporal Consistency

Lucas Maystre ⋅ Gabriel Barello ⋅ Tudor Berariu ⋅ Cambray ⋅ Rares Dolga ⋅ Alvaro Ortega Gonzalez ⋅ Andrei Nica ⋅ David Barber

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We address the problem of incremental sequence classification, where predictions are updated as new elements in the sequence are revealed. Drawing on temporal-difference learning from reinforcement learning, we identify a temporal-consistency condition that successive predictions should satisfy. We leverage this condition to develop a novel loss function for training incremental sequence classifiers. Through a concrete example, we demonstrate that optimizing this loss can offer substantial gains in data efficiency. We apply our method to text classification tasks and show that it improves predictive accuracy over competing approaches on several benchmark datasets. We further evaluate our approach on the task of verifying large language model generations for correctness in grade-school math problems. Our results show that models trained with our method are better able to distinguish promising generations from unpromising ones after observing only a few tokens.

View full details

Poster

Head Pursuit: Probing Attention Specialization in Multimodal Transformers

Lorenzo Basile ⋅ Valentino Maiorca ⋅ Diego Doimo ⋅ Francesco Locatello ⋅ Alberto Cazzaniga

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Language and vision-language models have shown impressive performance across a wide range of tasks, but their internal mechanisms remain only partly understood. In this work, we study how individual attention heads in text-generative models specialize in specific semantic or visual attributes. Building on an established interpretability method, we reinterpret the practice of probing intermediate activations with the final decoding layer through the lens of signal processing. This lets us analyze multiple samples in a principled way and rank attention heads based on their relevance to target concepts. Our results show consistent patterns of specialization at the head level across both unimodal and multimodal transformers. Remarkably, we find that editing as few as 1% of the heads, selected using our method, can reliably suppress or enhance targeted concepts in the model output. We validate our approach on language tasks such as question answering and toxicity mitigation, as well as vision-language tasks including image classification and captioning. Our findings highlight an interpretable and controllable structure within attention layers, offering simple tools for understanding and editing large-scale generative models.

View full details

Poster

Provable Gradient Editing of Deep Neural Networks

Zhe Tao ⋅ Aditya V Thakur

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

In explainable AI, DNN gradients are used to interpret the prediction; in safety-critical control systems, gradients could encode safety constraints; in scientific-computing applications, gradients could encode physical invariants. While recent work on provable editing of DNNs has focused on input-output constraints, the problem of enforcing hard constraints on DNN gradients remains unaddressed. We present ProGrad, the first efficient approach for editing the parameters of a DNN to provably enforce hard constraints on the DNN gradients. Given a DNN $\mathcal{N}$ with parameters $\theta$, and a set $\mathcal{S}$ of pairs $(\mathrm{x}, \mathrm{Q})$ of input $\mathrm{x}$ and corresponding linear gradient constraints $\mathrm{Q}$, ProGrad finds new parameters $\theta'$ such that $\bigwedge_{(\mathrm{x}, \mathrm{Q}) \in \mathcal{S}} \frac{\partial}{\partial \mathrm{x}}\mathcal{N}(\mathrm{x}; \theta') \in \mathrm{Q}$ while minimizing the changes $\lVert\theta' - \theta\rVert$. The key contribution is a novel *conditional variable gradient* of DNNs, which relaxes the NP-hard provable gradient editing problem to a linear program (LP), enabling ProGrad to use an LP solver to efficiently and effectively enforce the gradient constraints. We experimentally evaluated ProGrad via enforcing (i) hard Grad-CAM constraints on ImageNet ResNet DNNs; (ii) hard Integrated Gradients constraints on Llama 3 and Qwen 3 LLMs; (iii) hard gradient constraints in training a DNN to approximate a target function as a proxy for safety constraints in control systems and physical invariants in scientific applications. The results highlight the unique capability of ProGrad in enforcing hard constraints on DNN gradients.

View full details

Poster

The Power of Iterative Filtering for Supervised Learning with (Heavy) Contamination

Adam Klivans ⋅ Konstantinos Stavropoulos ⋅ Kevin Tian ⋅ Arsen Vasilyan

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Inspired by recent work on learning with distribution shift, we give a general outlier removal algorithm called *iterative polynomial filtering* and show a number of striking applications for supervised learning with contamination: (1) We show that any function class that can be approximated by low-degree polynomials with respect to a hypercontractive distribution can be efficiently learned under bounded contamination (also known as *nasty noise*). This is a surprising resolution to a longstanding gap between the complexity of agnostic learning and learning with contamination, as it was widely believed that low-degree approximators only implied tolerance to label noise. (2) For any function class that admits the (stronger) notion of sandwiching approximators, we obtain near-optimal learning guarantees even with respect to heavy additive contamination, where far more than $1/2$ of the training set may be added adversarially. Prior related work held only for regression and in a list-decodable setting. (3) We obtain the first efficient algorithms for tolerant testable learning of functions of halfspaces with respect to any fixed log-concave distribution. Even the non-tolerant case for a single halfspace in this setting had remained open. These results significantly advance our understanding of efficient supervised learning under contamination, a setting that has been much less studied than its unsupervised counterpart.

View full details

Poster

Fine-grained List-wise Alignment for Generative Medication Recommendation

Chenxiao Fan ⋅ Chongming Gao ⋅ Wentao Shi ⋅ Yaxin Gong ⋅ Zhao Zihao ⋅ Fuli Feng

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Accurate and safe medication recommendations are critical for effective clinical decision-making, especially in multimorbidity cases. However, existing systems rely on point-wise prediction paradigms that overlook synergistic drug effects and potential adverse drug-drug interactions (DDIs). We propose FLAME, a fine-grained list-wise alignment framework for large language models (LLMs), enabling drug-by-drug generation of drug lists. FLAME formulates recommendation as a sequential decision process, where each step adds or removes a single drug. To provide fine-grained learning signals, we devise step-wise Group Relative Policy Optimization (GRPO) with potential-based reward shaping, which explicitly models DDIs and optimizes the contribution of each drug to the overall prescription. Furthermore, FLAME enhances patient modeling by integrating structured clinical knowledge and collaborative information into the representation space of LLMs. Experiments on benchmark datasets demonstrate that FLAME achieves state-of-the-art performance, delivering superior accuracy, controllable safety–accuracy trade-offs, and strong generalization across diverse clinical scenarios. Our code is available at https://github.com/cxfann/Flame.

View full details

Poster

SWE-smith: Scaling Data for Software Engineering Agents

John Yang ⋅ Kilian Lieret ⋅ Carlos Jimenez ⋅ Alexander Wettig ⋅ Kabir Khandpur ⋅ Yanzhe Zhang ⋅ Binyuan Hui ⋅ Ofir Press ⋅ Ludwig Schmidt ⋅ Diyi Yang

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Despite recent progress in Language Models (LMs) for software engineering, collecting training data remains a significant pain point.Existing datasets are small, with at most 1,000s of training instances from 11 or fewer GitHub repositories.The procedures to curate such datasets are often complex, necessitating hundreds of hours of human labor; companion execution environments also take up several terabytes of storage, severely limiting their scalability and usability.To address this pain point, we introduce SWE-smith, a novel pipeline for generating software engineering training data at scale.Given any Python codebase, SWE-smith constructs a corresponding execution environment, then automatically synthesizes 100s to 1,000s of task instances that break existing test(s) in the codebase.Using SWE-smith, we create a dataset of 50k instances sourced from 128 GitHub repositories, an order of magnitude larger than all previous works.We train SWE-agent-LM-32B, achieving 40.2% Pass@1 resolve rate on the SWE-bench Verified benchmark, state of the art among open source models.We open source SWE-smith (collection procedure, task instances, trajectories, models) to lower the barrier of entry for research in LM systems for automated software engineering.All assets available at \url{https://swesmith.com}.

View full details

Poster

Asymmetric Duos: Sidekicks Improve Uncertainty

Tim G. Zhou ⋅ Evan Shelhamer ⋅ Geoff Pleiss

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The go-to strategy to apply deep networks in settings where uncertainty informs decisions—ensembling multiple training runs with random initializations—is ill-suited for the extremely large-scale models and practical fine-tuning workflows of today. We introduce a new cost-effective strategy for improving the uncertainty quantification and downstream decisions of a large model (e.g. a fine-tuned ViT-B): coupling it with a less accurate but much smaller "sidekick" (e.g. a fine-tuned ResNet-34) with a fraction of the computational cost. We propose aggregating the predictions of this *Asymmetric Duo* by simple learned weighted averaging. Surprisingly, despite their inherent asymmetry, the sidekick model almost never harms the performance of the larger model. In fact, across five image classification benchmarks, and a variety of model architectures and training schemes (including soups), Asymmetric Duos significantly improve accuracy, uncertainty quantification, and selective classification metrics with only ${\sim}10-20$% more computation. Code is available at: https://github.com/timgzhou/asymmetric-duos

View full details

Poster

Hogwild! Inference: Parallel LLM Generation via Concurrent Attention

Gleb Rodionov ⋅ Roman Garipov ⋅ Alina Shutova ⋅ George Yakushev ⋅ Erik Schultheis ⋅ Vage Egiazarian ⋅ Anton Sinitsin ⋅ Denis Kuznedelev ⋅ Dan Alistarh

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Large Language Models (LLMs) have demonstrated the ability to tackle increasingly complex tasks through advanced reasoning, long-form content generation, and tool use. Solving these tasks often involves long inference-time computations. In human problem solving, a common strategy to expedite work is collaboration: by dividing the problem into sub-tasks, exploring different strategies concurrently, etc. Recent research has shown that LLMs can also operate in parallel by implementing explicit cooperation frameworks, such as voting mechanisms or the explicit creation of independent sub-tasks that can be executed in parallel. However, each of these frameworks may not be suitable for all types of tasks, which can hinder their applicability. In this work, we propose a different design approach: we run LLM "workers" in parallel , allowing them to synchronize via a concurrently-updated attention cache and prompt these workers to decide how best to collaborate. Our approach allows the instances to come up with their own collaboration strategy for the problem at hand, all the while "seeing" each other's partial progress in the concurrent cache. We implement this approach via Hogwild! Inference: a parallel LLM inference engine where multiple instances of the same LLM run in parallel with the same attention cache, with "instant" access to each other's generated tokens. Hogwild! inference takes advantage of Rotary Position Embeddings (RoPE) to avoid recomputation while improving parallel hardware utilization. We find that modern reasoning-capable LLMs can perform inference with shared Key-Value cache out of the box, without additional fine-tuning.

View full details

Poster

RoboScape: Physics-informed Embodied World Model

Yu Shang ⋅ Xin Zhang ⋅ Yinzhou Tang ⋅ Lei Jin ⋅ Chen Gao ⋅ Wei Wu ⋅ Yong Li

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

World models have become indispensable tools for embodied intelligence, serving as powerful simulators capable of generating realistic robotic videos while addressing critical data scarcity challenges. However, current embodied world models exhibit limited physical awareness, particularly in modeling 3D geometry and motion dynamics, resulting in unrealistic video generation for contact-rich robotic scenarios. In this paper, we present RoboScape, a unified physics-informed world model that jointly learns RGB video generation and physics knowledge within an integrated framework. We introduce two key physics-informed joint training tasks: temporal depth prediction that enhances 3D geometric consistency in video rendering, and keypoint dynamics learning that implicitly encodes physical properties (e.g., object shape and material characteristics) while improving complex motion modeling. Extensive experiments demonstrate that RoboScape generates videos with superior visual fidelity and physical plausibility across diverse robotic scenarios. We further validate its practical utility through downstream applications including robotic policy training with generated data and policy evaluation. Our work provides new insights for building efficient physics-informed world models to advance embodied intelligence research. Our code and demos are available at: https://github.com/tsinghua-fib-lab/RoboScape.

View full details

Poster

Conservative classifiers do consistently well with improving agents: characterizing statistical and online learning

Dravyansh Sharma ⋅ Alec Sun

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Machine learning is now ubiquitous in societal decision-making, for example in evaluating job candidates or loan applications, and it is increasingly important to take into account how classified agents will react to the learning algorithms. The majority of recent literature on strategic classification has focused on reducing and countering deceptive behaviors by the classified agents, but recent work of Attias et al. identifies surprising properties of learnability when the agents genuinely improve in order to attain the desirable classification, such as smaller generalization error than standard PAC-learning. In this paper we characterize so-called learnability with improvements across multiple new axes. We introduce an asymmetric variant of minimally consistent concept classes and use it to provide an exact characterization of proper learning with improvements in the realizable setting. While prior work studies learnability only under general, arbitrary agent improvement regions, we give positive results for more natural Euclidean ball improvement sets. In particular, we characterize improper learning under a generative assumption on the data distribution. We further show how to learn in more challenging settings, achieving lower generalization error under well-studied bounded noise models and obtaining mistake bounds in realizable and agnostic online learning. We resolve open questions posed by Attias et al. for both proper and improper learning.

View full details

Poster

Extracting task-relevant preserved dynamics from contrastive aligned neural recordings

Yiqi Jiang ⋅ Kaiwen Sheng ⋅ Yujia Gao ⋅ Estefany Kelly Buchanan ⋅ Yu Shikano ⋅ Seung Je Woo ⋅ Yixiu Zhao ⋅ Tony Hyun Kim ⋅ Fatih Dinc ⋅ Scott Linderman ⋅ Mark Schnitzer

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Recent work indicates that low-dimensional dynamics of neural and behavioral data are often preserved across days and subjects. However, extracting these preserved dynamics remains challenging: high-dimensional neural population activity and the recorded neuron populations vary across recording sessions. While existing modeling tools can improve alignment between neural and behavioral data, they often operate on a per-subject basis or discretize behavior into categories, disrupting its natural continuity and failing to capture the underlying dynamics. We introduce $\underline{\text{C}}$ontrastive $\underline{\text{A}}$ligned $\underline{\text{N}}$eural $\underline{\text{D}}$$\underline{\text{Y}}$namics (CANDY), an end‑to‑end framework that aligns neural and behavioral data using rank-based contrastive learning, adapted for continuous behavioral variables, to project neural activity from different sessions onto a shared low-dimensional embedding space. CANDY fits a shared linear dynamical system to the aligned embeddings, enabling an interpretable model of the conserved temporal structure in the latent space. We validate CANDY on synthetic and real-world datasets spanning multiple species, behaviors, and recording modalities. Our results show that CANDY is able to learn aligned latent embeddings and preserved dynamics across neural recording sessions and subjects, and it achieves improved cross-session behavior decoding performance. We further show that the latent linear dynamical system generalizes to new sessions and subjects, achieving comparable or even superior behavior decoding performance to models trained from scratch. These advances enable robust cross‑session behavioral decoding and offer a path towards identifying shared neural dynamics that underlie behavior across individuals and recording conditions. The code and two-photon imaging data of striatal neural activity that we acquired here are available at https://github.com/schnitzer-lab/CANDY-public.git.

View full details

Poster

DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models

Ziyi Wu ⋅ Anil Kag ⋅ Ivan Skorokhodov ⋅ Willi Menapace ⋅ Ashkan Mirzaei ⋅ Igor Gilitschenski ⋅ Sergey Tulyakov ⋅ Aliaksandr Siarohin

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Direct Preference Optimization (DPO) has recently been applied as a post‑training technique for text-to-video diffusion models. To obtain training data, annotators are asked to provide preferences between two videos generated from independent noise. However, this approach prohibits fine-grained comparisons, and we point out that it biases the annotators towards low-motion clips as they often contain fewer visual artifacts. In this work, we introduce DenseDPO, a method that addresses these shortcomings by making three contributions. First, we create each video pair for DPO by denoising corrupted copies of a ground truth video. This results in aligned pairs with similar motion structures while differing in local details, effectively neutralizing the motion bias. Second, we leverage the resulting temporal alignment to label preferences on short segments rather than entire clips, yielding a denser and more precise learning signal. With only one‑third of the labeled data, DenseDPO greatly improves motion generation over vanilla DPO, while matching it in text alignment, visual quality, and temporal consistency. Finally, we show that DenseDPO unlocks automatic preference annotation using off-the-shelf Vision Language Models (VLMs): GPT accurately predicts segment-level preferences similar to task-specifically fine-tuned video reward models, and DenseDPO trained on these labels achieves performance close to using human labels.

View full details

Poster

Instance-Optimality for Private KL Distribution Estimation

Jiayuan Ye ⋅ Vitaly Feldman ⋅ Kunal Talwar

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We study the fundamental problem of estimating an unknown discrete distribution $p$ over $d$ symbols, given $n$ i.i.d. samples from the distribution. We are interested in minimizing the KL divergence between the true distribution and the algorithm's estimate. We first construct minimax optimal private estimators. Minimax optimality however fails to shed light on an algorithm's performance on individual (non-worst-case) instances $p$ and simple minimax-optimal DP estimators can have poor empirical performance on real distributions. We then study this problem from an instance-optimality viewpoint, where the algorithm's error on $p$ is compared to the minimum achievable estimation error over a small local neighborhood of $p$. Under natural notions of local neighborhood, we propose algorithms that achieve instance-optimality up to constant factors, with and without a differential privacy constraint. Our upper bounds rely on (private) variants of the Good-Turing estimator. Our lower bounds use additive local neighborhoods that more precisely captures the hardness of distribution estimation in KL divergence, compared to ones considered in prior works.

View full details

Poster

Exploration via Feature Perturbation in Contextual Bandits

Seouh-won Yi ⋅ Min-hwan Oh

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We propose *feature perturbation*, a simple yet effective exploration strategy for contextual bandits that injects randomness directly into feature inputs, instead of randomizing unknown parameters or adding noise to rewards. Remarkably, this algorithm achieves $\widetilde{\mathcal{O}}(d\sqrt{T})$ worst-case regret bound for generalized linear contextual bandits, while avoiding the $\widetilde{\mathcal{O}}(d^{3/2}\sqrt{T})$ regret typical of existing randomized bandit algorithms. Because our algorithm eschews parameter sampling, it is both computationally efficient and naturally extends to non-parametric or neural network models. We verify these advantages through empirical evaluations, demonstrating that feature perturbation not only surpasses existing methods but also unifies strong practical performance with the near-optimal regret guarantees.

View full details

Poster

Probing Neural Combinatorial Optimization Models

Zhiqin Zhang ⋅ Yining Ma ⋅ Zhiguang Cao ⋅ Hoong Chuin Lau

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Neural combinatorial optimization (NCO) has achieved remarkable performance, yet its learned model representations and decision rationale remain a black box. This impedes both academic research and practical deployment, since researchers and stakeholders require deeper insights into NCO models. In this paper, we take the first critical step towards interpreting NCO models by investigating their representations through various probing tasks. Moreover, we introduce a novel probing tool named Coefficient Significance Probing (CS-Probing) to enable deeper analysis of NCO representations by examining the coefficients and statistical significance during probing. Extensive experiments and analysis reveal that NCO models encode low-level information essential for solution construction, while capturing high-level knowledge to facilitate better decisions. Using CS-Probing, we find that prevalent NCO models impose varying inductive biases on their learned representations, uncover direct evidence related to model generalization, and identify key embedding dimensions associated with specific knowledge. These insights can be potentially translated into practice, for example, with minor code modifications, we improve the generalization of the analyzed model. Our work represents a first systematic attempt to interpret black-box NCO models, showcasing probing as a promising tool for analyzing their internal mechanisms and revealing insights for the NCO community. The source code is publicly available [here](https://github.com/123zhangzq/NeurIPS2025_probing).

View full details

Poster

DeepDiver: Adaptive Web-Search Intensity Scaling via Reinforcement Learning

Wenxuan Shi ⋅ Haochen Tan ⋅ Chuqiao Kuang ⋅ Xiaoguang Li ⋅ Hanting Chen ⋅ Xiaozhe Ren ⋅ Yasheng Wang ⋅ Lu Hou ⋅ Lifeng Shang

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Information seeking demands iterative evidence gathering and reflective reasoning, yet large language models (LLMs) still struggle with it in open-web question answering. Existing prompting and supervised fine-tuning (SFT) methods remain fixed by prompt rules or training corpora, and are usually benchmarked only on well-structured wiki sources, limiting real-world adaptability. We introduce $\textbf{WebPuzzle}$, a 24k-sample training and 275-sample test benchmark that evaluates information seeking on the live internet, across both wiki and open-domain queries. Leveraging 7k WebPuzzle instances, we develop $\textbf{DeepDiver}$, a reinforcement-learning (RL) framework that cultivates $\textbf{Search Intensity Scaling (SIS)}$—an emergent ability to escalate search frequency and depth instead of settling on overconfident, under-evidenced answers. With SIS, Qwen2.5-7B-Instruct and Pangu-7B-Reasoner attain performance on real-web tasks comparable to the 671B-parameter DeepSeek-R1. We detail DeepDiver’s curriculum from cold-start SFT to a well designed RL procedure, and show that its seeking policy generalized from closed-ended queries to open-ended generation such as long-form writing. Our results advance adaptive information seeking in LLMs and provide a rigorous benchmark for future work.

View full details

Poster

Estimating cognitive biases with attention-aware inverse planning

Sounak Banerjee ⋅ Daphne Cornelisse ⋅ Deepak Gopinath ⋅ Emily Sumner ⋅ Jonathan DeCastro ⋅ Guy Rosman ⋅ Eugene Vinitsky ⋅ Mark Ho

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

People's goal-directed behaviors are influenced by their cognitive biases, and autonomous systems that interact with people should be aware of this. For example, people's attention to objects in their environment will be biased in a way that systematically affects how they perform everyday tasks such as driving to work. Here, building on recent work in computational cognitive science, we formally articulate the \textit{attention-aware inverse planning problem}, in which the goal is to estimate a person's attentional biases from their actions. We demonstrate how attention-aware inverse planning systematically differs from standard inverse reinforcement learning and how cognitive biases can be inferred from behavior. Finally, we present an approach to attention-aware inverse planning that combines deep reinforcement learning with computational cognitive modeling. We use this approach to infer the attentional strategies of RL agents in real-life driving scenarios selected from the Waymo Open Dataset, demonstrating the scalability of estimating cognitive biases with attention-aware inverse planning.

View full details

Poster

Meta CLIP 2: A Worldwide Scaling Recipe

Yung-Sung Chuang ⋅ Yang Li ⋅ Dong Wang ⋅ Ching-Feng Yeh ⋅ Kehan Lyu ⋅ Ramya Raghavendra ⋅ Jim Glass ⋅ Lifei Huang ⋅ Jason Weston ⋅ Luke Zettlemoyer ⋅ Xinlei Chen ⋅ Zhuang Liu ⋅ Saining Xie ⋅ Scott Yih ⋅ Shang-Wen Li ⋅ Hu Xu

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Contrastive Language-Image Pretraining (CLIP) is a popular foundation model, supporting from zero-shot classification, retrieval to encoders for multimodal large language models (MLLMs). Although CLIP is successfully trained on billion-scale image-text pairs from the English world, scaling CLIP's training further to learning from the worldwide web data is still challenging: (1) no curation method is available to handle data points from non-English world; (2) the English performance from existing multilingual CLIP is worse than its English-only counterpart, i.e., "curse of multilinguality" that is common in LLMs. Here, we present Meta CLIP 2, the first recipe training CLIP from scratch on worldwide web-scale image-text pairs. To generalize our findings, we conduct rigorous ablations with minimal changes that are necessary to address the above challenges and present a recipe enabling mutual benefits from English and non-English world data. In zero-shot ImageNet classification, Meta CLIP 2 ViT-H/14 surpasses its English-only counterpart by 0.8% and mSigLIP by 0.7%, and surprisingly sets new state-of-the-art without system-level confounding factors (e.g., translation, bespoke architecture changes) on multilingual benchmarks, such as CVQA with 57.4%, Babel-ImageNet with 50.2% and XM3600 with 64.3% on image-to-text retrieval. Code and model are available at https://github.com/facebookresearch/MetaCLIP.

View full details

Poster

Gradient-Variation Online Adaptivity for Accelerated Optimization with Hölder Smoothness

Yuheng Zhao ⋅ Yu-Hu Yan ⋅ Kfir Y. Levy ⋅ Peng Zhao

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Smoothness is known to be crucial for acceleration in offline optimization, and for gradient-variation regret minimization in online learning. Interestingly, these two problems are actually closely connected --- accelerated optimization can be understood through the lens of gradient-variation online learning. In this paper, we investigate online learning with *Hölder* functions, a general class encompassing both smooth and non-smooth (Lipschitz) functions, and explore its implications for offline optimization. For (strongly) convex online functions, we design the corresponding gradient-variation online learning algorithm whose regret smoothly interpolates between the optimal guarantees in smooth and non-smooth regimes. Notably, our algorithms do not require prior knowledge of the Hölder smoothness parameter, exhibiting strong adaptivity over existing methods. Through online-to-batch conversion, this gradient-variation online adaptivity yields an optimal universal method for stochastic convex optimization under Hölder smoothness. However, achieving universality in offline strongly convex optimization is more challenging. We address this by integrating online adaptivity with a detection-based guess-and-check procedure, which, for the first time, yields a universal offline method that achieves accelerated convergence in the smooth regime while maintaining near-optimal convergence in the non-smooth one.

View full details

Poster

Beyond Expectations: Quantile-Guided Alignment for Risk-Calibrated Language Models

Xinran Wang ⋅ Jin Du ⋅ Azal Khan ⋅ qi le ⋅ Enmao Diao ⋅ Jiawei Zhou ⋅ Jie Ding ⋅ Ali Anwar

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Large language models can generate rare but catastrophic outputs, such as harmful conversations or insecure code. Existing Reinforcement Learning from Human Feedback (RLHF) typically maximizes average reward, leaving high-risk tail events insufficiently controlled. We introduce Quantile‑Guided Alignment (QA), a framework that allows users to specify desired improvements at any quantile—individually or across multiple reward dimensions—thus shifting the distribution of outputs with finer control toward safer, more desirable outcomes. The method extends standard RLHF via an augmented reward formulation that enforces quantile constraints. Experiments on conversation and code‐generation tasks show that quantile alignment significantly enhances quality at targeted tails while maintaining overall performance. The results position QA as a principled route to risk‑calibrated language models with tail‑focused alignment.

View full details

Poster

DAPO : Improving Multi-Step Reasoning Abilities of Large Language Models with Direct Advantage-Based Policy Optimization

Jiacai Liu ⋅ Chaojie Wang ⋅ Chris Liu ⋅ Liang Zeng ⋅ Rui Yan ⋅ Yiwen Sun ⋅ Yang Liu

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The role of reinforcement learning (RL) in enhancing the reasoning of large language models (LLMs) is becoming increasingly significant. Despite the success of RL in many scenarios, there are still many challenges in improving the reasoning of LLMs. One key challenge is the sparse reward, which introduces more training variance in policy optimization and makes it difficult to obtain a good estimation for value function in Actor-Critic (AC) methods. To address these issues, we introduce Direct Advantage-Based Policy Optimization (DAPO), a novel step-level offline RL algorithm with theoretical guarantees for enhancing the reasoning abilities of LLMs. Unlike response-level methods (such as DPO and GRPO) that the update directions of all reasoning steps are governed by the outcome reward uniformly, DAPO employs a critic function to provide step-level dense signals for policy optimization. Additionally, the actor and critic in DAPO are trained independently, ensuring that critic is a good estimation of true state value function and avoiding the co-training instability observed in standard AC methods. We train DAPO on mathematical and code problems and then evaluate its performance on multiple benchmarks. Our results show that DAPO can effectively enhance the mathematical and code capabilities on both SFT models and RL models, demonstrating the effectiveness of DAPO.

View full details

Poster

SORTeD Rashomon Sets of Sparse Decision Trees: Anytime Enumeration

Elif Arslan ⋅ Jacobus van der Linden ⋅ Serge Hoogendoorn ⋅ Marco Rinaldi ⋅ Emir Demirović

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Sparse decision tree learning provides accurate and interpretable predictive models that are ideal for high-stakes applications by finding the single most accurate tree within a (soft) size limit. Rather than relying on a single “best” tree, Rashomon sets—trees with similar performance but varying structures—can be used to enhance variable importance analysis, enrich explanations, and enable users to choose simpler trees or those that satisfy stakeholder preferences (e.g., fairness) without hard-coding such criteria into the objective function. However, because finding the optimal tree is NP-hard, enumerating the Rashomon set is inherently challenging. Therefore, we introduce SORTD, a novel framework that improves scalability and enumerates trees in the Rashomon set in order of the objective value, thus offering anytime behavior. Our experiments show that SORTD reduces runtime by up to two orders of magnitude compared with the state of the art. Moreover, SORTD can compute Rashomon sets for any separable and totally ordered objective and supports post-evaluating the set using other separable (and partially ordered) objectives. Together, these advances make exploring Rashomon sets more practical in real-world applications.

View full details

Poster

The Complexity of Symmetric Equilibria in Min-Max Optimization and Team Zero-Sum Games

Ioannis Anagnostides ⋅ Ioannis Panageas ⋅ Tuomas Sandholm ⋅ Jingming Yan

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We consider the problem of computing stationary points in min-max optimization, with a focus on the special case of Nash equilibria in (two-)team zero-sum games. We first show that computing $\epsilon$-Nash equilibria in $3$-player $\text{\emph{adversarial}}$ team games---wherein a team of $2$ players competes against a $\text{\emph{single}}$ adversary---is $\textsf{CLS}$-complete, resolving the complexity of Nash equilibria in such settings. Our proof proceeds by reducing from $\text{\emph{symmetric}}$ $\epsilon$-Nash equilibria in $\text{\emph{symmetric}}$, identical-payoff, two-player games, by suitably leveraging the adversarial player so as to enforce symmetry---without disturbing the structure of the game. In particular, the class of instances we construct comprises solely polymatrix games, thereby also settling a question left open by Hollender, Maystre, and Nagarajan (2024). Moreover, we establish that computing $\text{\emph{symmetric}}$ (first-order) equilibria in $\text{\emph{symmetric}}$ min-max optimization is $\textsf{PPAD}$-complete, even for quadratic functions. Building on this reduction, we show that computing symmetric $\epsilon$-Nash equilibria in symmetric, $6$-player ($3$ vs. $3$) team zero-sum games is also $\textsf{PPAD}$-complete, even for $\epsilon = \text{poly}(1/n)$. As a corollary, this precludes the existence of symmetric dynamics---which includes many of the algorithms considered in the literature---converging to stationary points. Finally, we prove that computing a $\text{\emph{non-symmetric}}$ $\text{poly}(1/n)$-equilibrium in symmetric min-max optimization is $\textsf{FNP}$-hard.

View full details

Poster

Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding

Xiaoqian Shen ⋅ Wenxuan Zhang ⋅ Jun Chen ⋅ Mohamed Elhoseiny

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Understanding and reasoning over long videos pose significant challenges for large video language models (LVLMs) due to the difficulty in processing intensive video tokens beyond context window and retaining long-term sequential information. Retrieval-Augmented Generation (RAG) has demonstrated effectiveness in processing long context for Large Language Models (LLMs); however, applying RAG to long video faces challenges such as disrupted temporal dependencies and inclusion of irrelevant information that can hinder accurate reasoning. To address these limitations, we propose Vgent, a novel \textbf{graph-based retrieval-reasoning-augmented generation framework} to enhance LVLMs for long video understanding. Our approach introduces two key innovations: (i) It represents videos by structured graphs with semantic relationships across video clips preserved to improve retrieval effectiveness. (ii) It introduces an intermediate reasoning step to mitigate the reasoning limitation of LVLMs, which leverages structured verification to reduce retrieval noise and facilitate the explicit aggregation of relevant information across clips, resulting in more accurate and context-aware responses. We comprehensively evaluate our framework with various open-source LVLMs on three long-video understanding benchmarks. Our approach yielded an overall performance improvement of $3.0\%\sim 5.4\%$ over base models on MLVU, and outperformed state-of-the-art video RAG methods by $8.6\%$. Our code is publicly available at https://xiaoqian-shen.github.io/Vgent.

View full details

Poster

The Temporal Graph of Bitcoin Transactions

Vahid Jalili

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Since its 2009 genesis block, the Bitcoin network has processed >1.08 billion (B) transactions representing >8.72B BTC, offering rich potential for machine learning (ML); yet, its pseudonymity and obscured flow of funds inherent in its UTxO-based design, have rendered this data largely inaccessible for ML research. Addressing this gap, we present an ML-compatible graph modeling the Bitcoin's economic topology by reconstructing the flow of funds. This temporal, heterogeneous graph encompasses complete transaction history up to block 863000, consisting of >2.4B nodes and >39.72B edges. Additionally, we provide custom sampling methods yielding node and edge feature vectors of sampled communities, tools to load and analyze the Bitcoin graph data within specialized graph databases, and ready-to-use database snapshots. This comprehensive dataset and toolkit empower the ML community to tackle Bitcoin's intricate ecosystem at scale, driving progress in applications such as anomaly detection, address classification, market analysis, and large-scale graph ML benchmarking. Dataset and code available at https://github.com/B1AAB/EBA.

View full details

Poster

Co-Reinforcement Learning for Unified Multimodal Understanding and Generation

Jingjing Jiang ⋅ Chongjie Si ⋅ Jun Luo ⋅ Hanwang Zhang ⋅ Chao Ma

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

This paper presents a pioneering exploration of reinforcement learning (RL) via group relative policy optimization for unified multimodal large language models (ULMs), aimed at simultaneously reinforcing generation and understanding capabilities. Through systematic pilot studies, we uncover the significant potential of ULMs to enable the synergistic co-evolution of dual capabilities within a shared policy optimization framework. Building on this insight, we introduce \textbf{CoRL}, a \textbf{Co}-\textbf{R}einforcement \textbf{L}earning framework comprising a unified RL stage for joint optimization and a refined RL stage for task-specific enhancement. With the proposed CoRL, our resulting model, \textbf{ULM-R1}, achieves average improvements of 7\% on three text-to-image generation datasets and 23\% on nine multimodal understanding benchmarks. These results demonstrate the effectiveness of CoRL and highlight the substantial benefits of reinforcement learning in facilitating cross-task synergy and optimization for ULMs. Code is available at \url{https://github.com/mm-vl/ULM-R1}.

View full details

Poster

Dense Associative Memory with Epanechnikov Energy

Benjamin Hoover ⋅ Zhaoyang Shi ⋅ Krishnakumar Balasubramanian ⋅ Dmitry Krotov ⋅ Parikshit Ram

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We propose a novel energy function for Dense Associative Memory (DenseAM) networks, the log-sum-ReLU (LSR), inspired by optimal kernel density estimation. Unlike the common log-sum-exponential (LSE) function, LSR is based on the Epanechnikov kernel and enables exact memory retrieval with exponential capacity without requiring exponential separation functions. Uniquely, it introduces abundant additional emergent local minima while preserving perfect pattern recovery --- a characteristic previously unseen in DenseAM literature. Empirical results show that LSR energy has significantly more local minima (memories) that have comparable log-likelihood to LSE-based models. Analysis of LSR's emergent memories on image datasets reveals a degree of creativity and novelty, hinting at this method's potential for both large-scale memory storage and generative tasks.

View full details

Poster

COOPERA: Continual Open-Ended Human-Robot Assistance

Chenyang Ma ⋅ Kai Lu ⋅ Ruta Desai ⋅ Xavier Puig ⋅ Andrew Markham ⋅ Niki Trigoni

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

To understand and collaborate with humans, robots must account for individual human traits, habits, and activities over time. However, most robotic assistants lack these abilities, as they primarily focus on predefined tasks in structured environments and lack a human model to learn from. This work introduces COOPERA, a novel framework for COntinual, OPen-Ended human-Robot Assistance, where simulated humans, driven by psychological traits and long-term intentions, interact with robots in complex environments. By integrating continuous human feedback, our framework, for the first time, enables the study of long-term, open-ended human-robot collaboration (HRC) in different collaborative tasks across various time-scales. Within COOPERA, we introduce a benchmark and an approach to personalize the robot's collaborative actions by learning human traits and context-dependent intents. Experiments validate the extent to which our simulated humans reflect realistic human behaviors and demonstrate the value of inferring and personalizing to human intents for open-ended and long-term HRC.

View full details

Poster

Efficient Fairness-Performance Pareto Front Computation

Mark Kozdoba ⋅ Binyamin Perets ⋅ Shie Mannor

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

There is a well known intrinsic trade-off between the fairness of a representation and the performance of classifiers derived from the representation. In this paper we propose a new method to compute the optimal Pareto front of this trade off. In contrast to the existing methods, this approach does not require the training of complex fair representation models. Our approach is derived through three main steps: We analyze fair representations theoretically, and derive several structural properties of optimal representations. We then show that these properties enable a reduction of the computation of the Pareto Front to a compact discrete problem. Finally, we show that these compact approximating problems can be efficiently solved via off-the shelf concave-convex programming methods. In addition to representations, we show that the new methods may also be used to directly compute the Pareto front of fair classification problems. Moreover, the proposed methods may be used with any concave performance measure. This is in contrast to the existing reduction approaches, developed recently in fair classification, which rely explicitly on the structure of the non-differentiable accuracy measure, and are thus unlikely to be extendable. The approach was evaluated on several real world benchmark datasets and compares favorably to a number of recent state of the art fair representation and classification methods.

View full details

Poster

Searching Latent Program Spaces

Matthew Macfarlane ⋅ Clem Bonnet

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

General intelligence requires systems that acquire new skills efficiently and generalize beyond their training distributions. Although program synthesis approaches have strong generalization power, they face scaling issues due to large combinatorial spaces that quickly make them impractical and require human-generated DSLs or pre-trained priors to narrow this search space. On the other hand, deep learning methods have had high successes, but they lack structured test-time adaptation and rely on heavy stochastic sampling or expensive gradient updates for fine-tuning. In this work, we propose the Latent Program Network (LPN), a new architecture that builds in test-time search directly into neural models. LPN learns a latent space of implicit programs---neurally mapping inputs to outputs---through which it can search using gradients at test time. LPN combines the adaptability of symbolic approaches and the scalability of neural methods. It searches through a compact latent space at test time and bypasses the need for pre-defined domain-specific languages. On a range of programming-by-examples tasks, LPN either outperforms or matches performance compared to in-context learning and test-time training methods. Tested on the ARC-AGI benchmark, we demonstrate that LPN can both learn a compact program space and search through it at test time to adapt to novel tasks. LPN doubles its performance on out-of-distribution tasks when test-time search is switched on.

View full details

Poster

Efficient Prompt Compression with Evaluator Heads for Long-Context Transformer Inference

Weizhi Fei ⋅ Xueyan Niu ⋅ XIE GUOQING ⋅ Yingqing Liu ⋅ Bo Bai ⋅ Wei Han

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Although applications involving long-context inputs are crucial for the effective utilization of large language models (LLMs), they also result in increased computational costs and reduced performance. To address this challenge, we propose an efficient, training-free prompt compression method that retains key information within compressed prompts. We identify specific attention heads in transformer-based LLMs, which we designate as evaluator heads, that are capable of selecting tokens in long inputs that are most significant for inference. Building on this discovery, we develop EHPC, an Evaluator Head-based Prompt Compression method, which enables LLMs to rapidly "skim through'' input prompts by leveraging only the first few layers with evaluator heads during the pre-filling stage, subsequently passing only the important tokens to the model for inference. EHPC achieves state-of-the-art results across two mainstream benchmarks: prompt compression and long-context inference acceleration. Consequently, it effectively improves performance with the reduced costs associated with commercial API calls compared to prompt compressing methods. We further demonstrate that EHPC attains competitive results compared to key-value cache-based acceleration methods, thereby highlighting its potential to enhance the efficiency of LLMs for long-context tasks.

View full details

Poster

Vision Transformers Don't Need Trained Registers

Nicholas Jiang ⋅ Amil Dravid ⋅ Alexei Efros ⋅ Yossi Gandelsman

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We investigate the mechanism underlying a previously identified phenomenon in Vision Transformers -- the emergence of high-norm tokens that lead to noisy attention maps (Darcet et al., 2024). We observe that in multiple models (e.g., CLIP, DINOv2), a sparse set of neurons is responsible for concentrating high-norm activations on outlier tokens, leading to irregular attention patterns and degrading downstream visual processing. While the existing solution for removing these outliers involves retraining models from scratch with additional learned $\textit{register tokens}$, we use our findings to create a training-free approach to mitigate these artifacts. By shifting the high-norm activations from our discovered $\textit{register neurons}$ into an additional untrained token, we can mimic the effect of register tokens on a model already trained without registers. We demonstrate that our method produces cleaner attention and feature maps, enhances performance over base models across multiple downstream visual tasks, and achieves results comparable to models explicitly trained with register tokens. We then extend test-time registers to off-the-shelf vision-language models, yielding cleaner attention-based, text-to-image attribution. Finally, we outline a simple mathematical model that reflects the observed behavior of register neurons and high norm tokens. Our results suggest that test-time registers effectively take on the role of register tokens at test-time, offering a training-free solution for any pre-trained model released without them.

View full details

Poster

EAG3R: Event-Augmented 3D Geometry Estimation for Dynamic and Extreme-Lighting Scenes

Xiaoshan Wu ⋅ Yifei Yu ⋅ Xiaoyang Lyu ⋅ Yihua Huang ⋅ Bo Wang ⋅ Baoheng Zhang ⋅ Zhongrui Wang ⋅ Xiaojuan Qi

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Robust 3D geometry estimation from videos is critical for applications such as autonomous navigation, SLAM, and 3D scene reconstruction. Recent methods like DUSt3R demonstrate that regressing dense pointmaps from image pairs enables accurate and efficient pose-free reconstruction. However, existing RGB-only approaches struggle under real-world conditions involving dynamic objects and extreme illumination, due to the inherent limitations of conventional cameras. In this paper, we propose \textbf{EAG3R}, a novel geometry estimation framework that augments pointmap-based reconstruction with asynchronous event streams. Built upon the MonST3R backbone, EAG3R introduces two key innovations: (1) a retinex-inspired image enhancement module and a lightweight event adapter with SNR-aware fusion mechanism that adaptively combines RGB and event features based on local reliability; and (2) a novel event-based photometric consistency loss that reinforces spatiotemporal coherence during global optimization. Our method enables robust geometry estimation in challenging dynamic low-light scenes without requiring retraining on night-time data. Extensive experiments demonstrate that EAG3R significantly outperforms state-of-the-art RGB-only baselines across monocular depth estimation, camera pose tracking, and dynamic reconstruction tasks.

View full details

Poster

Mozart: Modularized and Efficient MoE Training on 3.5D Wafer-Scale Chiplet Architectures

Shuqing Luo ⋅ Ye Han ⋅ Pingzhi Li ⋅ Jiayin Qin ⋅ Jie Peng ⋅ Yang Zhao ⋅ Yu Cao ⋅ Tianlong Chen

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Mixture-of-Experts (MoE) architecture offers enhanced efficiency for Large Language Models (LLMs) with modularized computation, yet its inherent sparsity poses significant hardware deployment challenges, including memory locality issues, communication overhead, and inefficient computing resource utilization. Inspired by the modular organization of the human brain, we propose $\texttt{Mozart}$, a novel algorithm-hardware co-design framework tailored for efficient training of MoE-based LLMs on 3.5D wafer-scale chiplet architectures. On the algorithm side, $\texttt{Mozart}$ exploits the inherent modularity of chiplets and introduces: ($1$) an expert allocation strategy that enables efficient on-package all-to-all communication, and ($2$) a fine-grained scheduling mechanism that improves communication-computation overlap through streaming tokens and experts. On the architecture side, $\texttt{Mozart}$ adaptively co-locates heterogeneous modules on specialized chiplets with a 2.5D NoP-Tree topology and hierarchical memory structure. Evaluation across three popular MoE models demonstrates significant efficiency gains, enabling more effective parallelization and resource utilization for large-scale modularized MoE-LLMs.

View full details

Poster

Vision-centric Token Compression in Large Language Model

Ling Xing ⋅ Alex Jinpeng Wang ⋅ Rui Yan ⋅ Xiangbo Shu ⋅ Jinhui Tang

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Real-world applications are stretching context windows to hundreds of thousand of tokens while Large Language Models (LLMs) swell from billions to trillions of parameters. This dual expansion send compute and memory costs skyrocketing, making $\textit{token compression}$ indispensable. We introduce Vision Centric Token Compression ($\textbf{Vist}$), a $\textit{slow–fast}$ compression framework that mirrors human reading: the $\textit{fast}$ path renders distant tokens into images, letting a $\textbf{frozen, lightweight vision encoder}$ skim the low-salience context; the $\textit{slow}$ path feeds the proximal window into the LLM for fine-grained reasoning. A Probability-Informed Visual Enhancement (PVE) objective masks high-frequency tokens during training, steering the Resampler to concentrate on semantically rich regions—just as skilled reader gloss over function words. On eleven in-context learning benchmarks, $\textbf{Vist}$ achieves the same accuracy with 2.3$\times$ fewer tokens, cutting FLOPs by 16\% and memory by 50\%. This method delivers remarkable results, outperforming the strongest text encoder-based compression method CEPE by $\textbf{7.6}$\% on average over benchmarks like TriviaQA, NQ, PopQA, NLUI, and CLIN, setting a new standard for token efficiency in LLMs. The project is at https://github.com/CSU-JPG/VIST.

View full details

Poster

Repo2Run: Automated Building Executable Environment for Code Repository at Scale

Ruida Hu ⋅ Chao Peng ⋅ XinchenWang ⋅ Junjielong Xu ⋅ Cuiyun Gao

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Scaling up executable code data is significant for improving language models’ software engineering capability. The intricate nature of the process makes it labor-intensive, time-consuming and expert-knowledge-dependent to build a large number of executable code repositories, limiting the scalability of existing work based on running tests. The primary bottleneck lies in the automated building of test environments for different repositories, which is an essential yet underexplored task. To mitigate the gap, we introduce Repo2Run, the first LLM-based agent aiming at automating the building of executable test environments for any repositories at scale. Specifically, given a code repository, Repo2Run iteratively builds the Docker image, runs unit tests based on the feedback of the building, and synthesizes the Dockerfile until the entire pipeline is executed successfully. The resulting Dockerfile can then be used to create Docker container environments for running code and tests. We created a benchmark containing 420 Python repositories with unit tests for evaluation. The results illustrate that Repo2Run achieves an 86.0% success rate, outperforming SWE-agent by 77.0%. The resources of Repo2Run are available at https://github.com/bytedance/Repo2Run.

View full details

Poster

Wider or Deeper? Scaling LLM Inference-Time Compute with Adaptive Branching Tree Search

Yuichi Inoue ⋅ Kou Misaki ⋅ Yuki Imajuku ⋅ So Kuroki ⋅ Taishi Nakamura ⋅ Takuya Akiba

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Recent advances demonstrate that increasing inference-time computation can significantly boost the reasoning capabilities of large language models (LLMs). Although repeated sampling (i.e., generating multiple candidate outputs) is a highly effective strategy, it does not leverage external feedback signals for refinement, which are often available in tasks like coding. In this work, we propose Adaptive Branching Monte Carlo Tree Search (AB-MCTS), a novel inference-time framework that generalizes repeated sampling with principled multi-turn exploration and exploitation. At each node in the search tree, AB-MCTS dynamically decides whether to ''go wider'' by expanding new candidate responses or ''go deeper'' by revisiting existing ones based on external feedback signals. We evaluate our method on complex coding and engineering tasks using frontier models. Empirical results show that AB-MCTS outperforms both repeated sampling and standard MCTS, underscoring the importance of combining the response diversity of LLMs with multi-turn solution refinement for effective inference-time scaling.

View full details

Poster

Win Fast or Lose Slow: Balancing Speed and Accuracy in Latency-Sensitive Decisions of LLMs

Hao Kang ⋅ Qingru Zhang ⋅ Han Cai ⋅ Weiyuan Xu ⋅ Tushar Krishna ⋅ Yilun Du ⋅ Tsachy Weissman

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Large language models (LLMs) have shown remarkable performance across diverse reasoning and generation tasks, and are increasingly deployed as agents in dynamic environments such as code generation and recommendation systems. However, many real-world applications, such as high-frequency trading and real-time competitive gaming, require decisions under strict latency constraints, where faster responses directly translate into higher rewards. Despite the importance of this latency–quality trade-off, it remains underexplored in the context of LLM-based agents. In this work, we present the first systematic study of this trade-off in real-time decision-making tasks. To support our investigation, we introduce two new benchmarks: HFTBench, a high-frequency trading simulation, and StreetFighter, a competitive gaming platform. Our analysis reveals that optimal latency–quality balance varies by task, and that sacrificing quality for lower latency can significantly enhance downstream performance. To address this, we propose FPX, an adaptive framework that dynamically selects model size and quantization level based on real-time demands. Our method achieves the best performance on both benchmarks, improving win rate by up to 80% in Street Fighter and boosting daily yield by up to 26.52% in trading, underscoring the need for latency-aware evaluation and deployment strategies for LLM-based agents. These results demonstrate the critical importance of latency-aware evaluation and deployment strategies for real-world LLM-based agents.

View full details

Poster

Fast Projection-Free Approach (without Optimization Oracle) for Optimization over Compact Convex Set

Chenghao Liu ⋅ Enming Liang ⋅ Minghua Chen

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Projection-free first-order methods, e.g., the celebrated Frank-Wolfe (FW) algorithms, have emerged as powerful tools for optimization over simple convex sets such as polyhedra, because of their scalability, fast convergence, and iteration-wise feasibility without costly projections. However, extending these methods effectively to general compact convex sets remains challenging and largely open, as FW methods rely on expensive linear optimization oracles (LOO), while penalty-based methods often struggle with poor feasibility. We tackle this open challenge by presenting **Hom-PGD**, a novel projection-free method without expensive (optimization) oracles. Our method constructs a homeomorphism between the convex constraint set and a unit ball, transforming the original problem into an equivalent ball-constrained formulation, thus enabling efficient gradient-based optimization while preserving the original problem structure. We prove that Hom-PGD attains *optimal* convergence rates matching gradient descent with constant step-size to find an $\epsilon$-approximate (stationary) solution: $\mathcal{O}(\log (1/\epsilon))$ for strongly convex objectives, $\mathcal{O}(\epsilon^{-1})$ for convex objectives, and $\mathcal{O}(\epsilon^{-2})$ for non-convex objectives. Meanwhile, Hom-PGD enjoys a low per-iteration complexity of $\mathcal{O}(n^2)$, without expensive oracles like LOO or projection, where $n$ is the input size. Our framework further extends to certain non-convex sets, broadening its applicability in practical optimization scenarios with complex constraints. Extensive numerical experiments demonstrate that Hom-PGD achieves comparable convergence rates to state-of-the-art projection-free methods, while significantly reducing per-iteration runtime (up to 5 orders of magnitude faster) and thus the total problem-solving time.

View full details

Poster

Rig3R: Rig-Aware Conditioning and Discovery for 3D Reconstruction

Samuel Li ⋅ Pujith Kachana ⋅ Prajwal Chidananda ⋅ Saurabh Nair ⋅ Yasutaka Furukawa ⋅ Matthew A Brown

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Estimating agent pose and 3D scene structure from multi-camera rigs is a central task in embodied AI applications such as autonomous driving. Recent learned approaches such as DUSt3R have shown impressive performance in multiview settings. However, these models treat images as unstructured collections, limiting effectiveness in scenarios where frames are captured from synchronized rigs with known or inferable structure. To this end, we introduce Rig3R, a generalization of prior multiview reconstruction models that incorporates rig structure when available, and learns to infer it when not. Rig3R conditions on optional rig metadata including camera ID, time, and rig poses to develop a rig-aware latent space that remains robust to missing information. It jointly predicts pointmaps and two types of raymaps: a pose raymap relative to a global frame, and a rig raymap relative to a rig-centric frame consistent across time. Rig raymaps allow the model to infer rig structure directly from input images when metadata is missing. The global pose raymaps allow the model to reason about the agent’s ego-motion, while the rig raymaps allow the model to infer rig structure directly from input images when metadata is missing. Rig3R achieves state-of-the-art performance in 3D reconstruction, camera pose estimation, and rig discovery -- outperforming both traditional and learned methods by 17-45% mAA across diverse real-world rig datasets, all in a single forward pass without post-processing or iterative refinement.

View full details

Poster

EDELINE: Enhancing Memory in Diffusion-based World Models via Linear-Time Sequence Modeling

Jia-Hua Lee ⋅ Bor-Jiun Lin ⋅ Wei-Fang Sun ⋅ Chun-Yi Lee

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

World models represent a promising approach for training reinforcement learning agents with significantly improved sample efficiency. While most world model methods primarily rely on sequences of discrete latent variables to model environment dynamics, this compression often neglects critical visual details essential for reinforcement learning. Recent diffusion-based world models condition generation on a fixed context length of frames to predict the next observation, using separate recurrent neural networks to model rewards and termination signals. Although this architecture effectively enhances visual fidelity, the fixed context length approach inherently limits memory capacity. In this paper, we introduce EDELINE, a unified world model architecture that integrates state space models with diffusion models. Our approach outperforms existing baselines across visually challenging Atari 100k tasks, memory-demanding Crafter benchmark, and 3D first-person ViZDoom environments, demonstrating superior performance in all these diverse challenges. Code is available at https://github.com/LJH-coding/EDELINE.

View full details

Poster

ARM: Adaptive Reasoning Model

Siye Wu ⋅ Jian Xie ⋅ yikai zhang ⋅ Aili Chen ⋅ Kai Zhang ⋅ Yu Su ⋅ Yanghua Xiao

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

While large reasoning models demonstrate strong performance on complex tasks, they lack the ability to adjust reasoning token usage based on task difficulty. This often leads to the "overthinking" problem—excessive and unnecessary reasoning—which, although potentially mitigated by human intervention to control the token budget, still fundamentally contradicts the goal of achieving fully autonomous AI. In this work, we propose Adaptive Reasoning Model (ARM), a reasoning model capable of adaptively selecting appropriate reasoning formats based on the task at hand. These formats include three efficient ones—Direct Answer, Short CoT, and Code—as well as a more elaborate format, Long CoT. To train ARM, we introduce Ada-GRPO, an adaptation of Group Relative Policy Optimization (GRPO), which addresses the format collapse issue in traditional GRPO. Ada-GRPO enables ARM to achieve high token efficiency, reducing tokens by an average of $\sim$30%, and up to $\sim$70%, while maintaining performance comparable to the model that relies solely on Long CoT. Furthermore, not only does it improve inference efficiency through reduced token generation, but it also brings a $\sim$2$\times$ speedup in training. In addition to the default Adaptive Mode, ARM supports two additional reasoning modes: 1) Instruction-Guided Mode, which allows users to explicitly specify the reasoning format via special tokens—ideal when the appropriate format is known for a batch of tasks. 2) Consensus-Guided Mode, which aggregates the outputs of the three efficient formats and resorts to Long CoT in case of disagreement, prioritizing performance with higher token usage. All the resources will be released.

View full details

Poster

Tackling Biased Evaluators in Dueling Bandits

Ming Tang ⋅ Yuxuan Zhou ⋅ Chao Huang

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

In dueling bandits, an agent explores and exploits choices (i.e., arms) by learning from their stochastic feedback in the form of relative preferences. Prior related studies focused on unbiased feedback. In practice, however, the feedback provided by evaluators can be biased. For example, human users are likely to provide biased evaluation towards large language models due to their heterogeneous background. In this work, we aim to minimize the regret in dueling bandits considering evaluators’ biased feedback. We begin with a benchmark case where evaluators’ bias information is known. Solving the known-bias case is nontrivial, because the bias cannot be easily decoupled from the feedback. We overcome this challenge and propose an unbiased arm performance estimator and a bias-sensitive dueling bandits algorithm. We manage to analyze the regret, dealing with the complex form of the estimator, and show that the feedback either matching or opposing the ground-truth reduces the regret. Then, we study the case where evaluators’ bias information is unknown. The associated estimator can hardly be solved in closed-form due to the non-convexity of the estimator solving problem. We address this challenge and propose an extended bias-sensitive algorithm by incorporating block coordinate descent. This algorithm is proven to achieve the same order of regret (as in the known bias case) with a bounded error. Experiments show that when compared with baselines, our algorithms reduces the regret by up to 86.9%.

View full details

Poster

Scaling and context steer LLMs along the same computational path as the human brain

Joséphine Raugel ⋅ Jérémy Rapin ⋅ Stéphane d'Ascoli ⋅ Valentin Wyart ⋅ Jean-Remi King

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Recent studies suggest that the representations learned by large language models (LLMs) are partially aligned to those of the human brain. However, whether this representational alignment arises from a similar sequence of computations remains elusive. In this study, we explore this question by examining temporally-resolved brain signals of participants listening to 10 hours of an audiobook. We study these neural dynamics jointly with a benchmark encompassing 17 LLMs varying in size and architecture type. Our analyses reveal that LLMs and the brain generate representations in a similar order: specifically, activations in the initial layers of LLMs tend to best align with early brain responses, while the deeper layers of LLMs tend to best align with later brain responses. This brain-LLM alignment is consistent across transformers and recurrent architectures. However, its emergence depends on both model size and context length. Overall, the alignment between LLMs and the brain provides novel elements supporting a partial convergence between biological and artificial neural networks.

View full details

Poster

Generative Trajectory Stitching through Diffusion Composition

Yunhao Luo ⋅ Utkarsh Mishra ⋅ Yilun Du ⋅ Danfei Xu

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Effective trajectory stitching for long-horizon planning is a significant challenge in robotic decision-making. While diffusion models have shown promise in planning, they are limited to solving tasks similar to those seen in their training data. We propose CompDiffuser, a novel generative approach that can solve new tasks by learning to compositionally stitch together shorter trajectory chunks from previously seen tasks. Our key insight is modeling the trajectory distribution by subdividing it into overlapping chunks and learning their conditional relationships through a single bidirectional diffusion model. This allows information to propagate between segments during generation, ensuring physically consistent connections. We conduct experiments on benchmark tasks of various difficulties, covering different environment sizes, agent state dimension, trajectory types, training data quality, and show that CompDiffuser significantly outperforms existing methods.

View full details

Poster

Privacy amplification by random allocation

Moshe Shenfeld ⋅ Vitaly Feldman

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We consider the privacy amplification properties of a sampling scheme in which a user's data is used in $k$ steps chosen randomly and uniformly from a sequence (or set) of $t$ steps. This sampling scheme has been recently applied in the context of differentially private optimization [Chua et al., 2024a, Choquette-Choo et al., 2024] and is also motivated by communication-efficient high-dimensional private aggregation [Asi et al., 2025]. Existing analyses of this scheme either rely on privacy amplification by shuffling which leads to overly conservative bounds or require Monte Carlo simulations that are computationally prohibitive in most practical scenarios. We give the first theoretical guarantees and numerical estimation algorithms for this sampling scheme. In particular, we demonstrate that the privacy guarantees of random $k$-out-of-$t$ allocation can be upper bounded by the privacy guarantees of the well-studied independent (or Poisson) subsampling in which each step uses the user's data with probability $(1+o(1))k/t$. Further, we provide two additional analysis techniques that lead to numerical improvements in several parameter regimes. Altogether, our bounds give efficiently-computable and nearly tight numerical results for random allocation applied to Gaussian noise addition.

View full details

Poster

Guarantees for Alternating Least Squares in Overparameterized Tensor Decompositions

Dionysis Arvanitakis ⋅ Vaidehi Srinivas ⋅ Aravindan Vijayaraghavan

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Tensor decomposition is a canonical non-convex optimization problem that is computationally challenging, and yet important due to applications in factor analysis and parameter estimation of latent variable models. In practice, scalable iterative methods, particularly Alternating Least Squares (ALS), remain the workhorse for tensor decomposition despite the lack of global convergence guarantees. A popular approach to tackle challenging non-convex optimization problems is overparameterization--- on input an $n \times n \times n$ tensor of rank $r$, the algorithm can output a decomposition of potentially rank $k$ (potentially larger than $r$). On the theoretical side, overparameterization for iterative methods is challenging to reason about and requires new techniques. The work of Wang et al., (NeurIPS 2020) makes progress by showing that a variant of gradient descent globally converges when overparameterized to $k=O(r^{7.5} \log n)$. Our main result shows that overparameterization provably enables global convergence of ALS: on input a third order $n \times n \times n$ tensor with a decomposition of rank $r \ll n$, ALS overparameterized with rank $k=O(r^2)$ achieves global convergence with high probability under random initialization. Moreover our analysis also gives guarantees for the more general low-rank approximation problem. The analysis introduces new techniques for understanding iterative methods in the overparameterized regime based on new matrix anticoncentration arguments.

View full details

Poster

HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios

Kunyu Peng ⋅ Junchao Huang ⋅ Xiangsheng Huang ⋅ Di Wen ⋅ Junwei Zheng ⋅ Yufan Chen ⋅ Kailun Yang ⋅ Jiamin Wu ⋅ Chongqing Hao ⋅ Rainer Stiefelhagen

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Action segmentation is a core challenge in high-level video understanding, aiming to partition untrimmed videos into segments and assign each a label from a predefined action set. Existing methods primarily address single-person activities with fixed action sequences, overlooking multi-person scenarios. In this work, we pioneer textual reference-guided human action segmentation in multi-person settings, where a textual description specifies the target person for segmentation. We introduce the first dataset for Referring Human Action Segmentation, i.e., RHAS133, built from 133 movies and annotated with 137 fine-grained actions with 33h video data, together with textual descriptions for this new task. Benchmarking existing action segmentation methods on RHAS133 using VLM-based feature extractors reveals limited performance and poor aggregation of visual cues for the target person. To address this, we propose a holistic-partial aware Fourier-conditioned diffusion framework, i.e., HopaDIFF, leveraging a novel cross-input gate attentional xLSTM to enhance holistic-partial long-range reasoning and a novel Fourier condition to introduce more fine-grained control to improve the action segmentation generation. HopaDIFF achieves state-of-the-art results on RHAS133 in diverse evaluation settings. The dataset and code are available at https://github.com/KPeng9510/HopaDIFF.

View full details

Poster

Activation Control for Efficiently Eliciting Long Chain-of-thought Ability of Language Models

Zekai Zhao ⋅ Qi Liu ⋅ Kun Zhou ⋅ Zihan Liu ⋅ Yifei Shao ⋅ Zhiting Hu ⋅ Biwei Huang

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Despite the remarkable reasoning performance, eliciting the long chain-of-thought(CoT) ability in large language models(LLMs) typically requires costly reinforcement learning or supervised fine-tuning on high-quality distilled data. We investigate the internal mechanisms behind this capability and show that a small set of high-impact activations in the last few layers, greatly govern the long-form reasoning attributes, e.g. output length and self-reflection. Through simply amplifying these activations and adding ``wait'' tokens, the long CoT ability can be invoked without training, leading to significantly increased self-reflection rate and accuracy. In addition, we also find that the activation changes follow predictable trajectories, i.e. a sharp rise after special tokens and a subsequent exponential decay. Based on these insights, we introduce a general training-free activation control technique. It utilizes a few contrastive examples to identify the relevant activations, and then incorporates simple analytic functions to adjust their values at inference time to elicit long CoTs. Extensive experiments have verified the effectiveness of our methods in efficiently eliciting the long CoT ability of LLMs and improving the performance. Besides, we further propose a parameter-efficient fine-tuning method that trains only the last-layer activation amplification module and a few LoRA layers, outperforming LoRA on reasoning benchmarks with much fewer parameters. Our code and data will be fully public released.

View full details

Poster

AlphaZero Neural Scaling and Zipf's Law: a Tale of Board Games and Power Laws

Oren Neumann ⋅ Claudius Gros

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Neural scaling laws are observed in a range of domains, to date with no universal understanding of why they occur. Recent theories suggest that loss power laws arise from Zipf's law, a power law observed in domains like natural language. One theory suggests that language scaling laws emerge when Zipf-distributed task quanta are learned in descending order of frequency. In this paper we examine power-law scaling in AlphaZero, a reinforcement learning algorithm, using a model of language-model scaling. We find that game states in training and inference data scale with Zipf's law, which is known to arise from the tree structure of the environment, and examine the correlation between scaling-law and Zipf's-law exponents. In agreement with the quanta scaling model, we find that agents optimize state loss in descending order of frequency, even though this order scales inversely with modelling complexity. We also find that inverse scaling, the failure of models to improve with size, is correlated with unusual Zipf curves where end-game states are among the most frequent states. We show evidence that larger models shift their focus to these less-important states, sacrificing their understanding of important early-game states.

View full details

Poster

InfMasking: Unleashing Synergistic Information by Contrastive Multimodal Interactions

Liangjian Wen ⋅ Qun Dai ⋅ Jianzhuang Liu ⋅ Jiangtao Zheng ⋅ Yong Dai ⋅ Dongkai Wang ⋅ Zhao Kang ⋅ Jun Wang ⋅ Zenglin Xu ⋅ Jiang Duan

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

In multimodal representation learning, synergistic interactions between modalities not only provide complementary information but also create unique outcomes through specific interaction patterns that no single modality could achieve alone. Existing methods may struggle to effectively capture the full spectrum of synergistic information, leading to suboptimal performance in tasks where such interactions are critical. This is particularly problematic because synergistic information constitutes the fundamental value proposition of multimodal representation. To address this challenge, we introduce InfMasking, a contrastive synergistic information extraction method designed to enhance synergistic information through an Infinite Masking strategy. InfMasking stochastically occludes most features from each modality during fusion, preserving only partial information to create representations with varied synergistic patterns. Unmasked fused representations are then aligned with masked ones through mutual information maximization to encode comprehensive synergistic information. This infinite masking strategy enables capturing richer interactions by exposing the model to diverse partial modality combinations during training. As computing mutual information estimates with infinite masking is computationally prohibitive, we derive an InfMasking loss to approximate this calculation. Through controlled experiments, we demonstrate that InfMasking effectively enhances synergistic information between modalities. In evaluations on large-scale real-world datasets, InfMasking achieves state-of-the-art performance across seven benchmarks. Code is released at https://github.com/brightest66/InfMasking.

View full details

Poster

Learning the Wrong Lessons: Syntactic-Domain Spurious Correlations in Language Models

Chantal Shaib ⋅ Vinith Suriyakumar ⋅ Byron Wallace ⋅ Marzyeh Ghassemi

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

For an LLM to correctly respond to an instruction it must understand both the semantics and the domain (i.e., subject area) of a given task-instruction pair. However, syntax can also convey implicit information. Recent work shows that \textit{syntactic templates}---frequent sequences of Part-of-Speech (PoS) tags---are prevalent in training data and often appear in model outputs. In this work we characterize syntactic templates, domain, and semantics in task-instruction pairs. We identify cases of spurious correlations between syntax and domain, where models learn to associate a domain with syntax during training; this can sometimes override prompt semantics. Using a synthetic training dataset, we find that the syntactic-domain correlation can lower performance (mean 0.51 +/- 0.06) on entity knowledge tasks in OLMo-2 models (1B-13B). We introduce an evaluation framework to detect this phenomenon in trained models, and show that it occurs on a subset of the FlanV2 dataset in open (OLMo-2-7B; Llama-4-Maverick), and closed (GPT-4o) models. Finally, we present a case study on the implications for LLM security, showing that unintended syntactic-domain correlations can be used to bypass refusals in OLMo-2-7B Instruct and GPT-4o. Our findings highlight two needs: (1) to explicitly test for syntactic-domain correlations, and (2) to ensure \textit{syntactic} diversity in training data, specifically within domains, to prevent such spurious correlations.

View full details

Poster

TimeWak: Temporal Chained-Hashing Watermark for Time Series Data

Zhi Wen Soi ⋅ Chaoyi Zhu ⋅ Fouad Abiad ⋅ Aditya Shankar ⋅ Jeroen Galjaard ⋅ Huijuan Wang ⋅ Lydia Chen

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Synthetic time series generated by diffusion models enable sharing privacy-sensitive datasets, such as patients' functional MRI records. Key criteria for synthetic data include high data utility and traceability to verify the data source. Recent watermarking methods embed in homogeneous latent spaces, but state-of-the-art time series generators operate in data space, making latent-based watermarking incompatible. This creates the challenge of watermarking directly in data space while handling feature heterogeneity and temporal dependencies. We propose TimeWak, the first watermarking algorithm for multivariate time series diffusion models. To handle temporal dependence and spatial heterogeneity, TimeWak embeds a temporal chained-hashing watermark directly within the temporal-feature data space. The other unique feature is the $\epsilon$-exact inversion, which addresses the non-uniform reconstruction error distribution across features from inverting the diffusion process to detect watermarks. We derive the error bound of inverting multivariate time series while preserving robust watermark detectability. We extensively evaluate TimeWak on its impact on synthetic data quality, watermark detectability, and robustness under various post-editing attacks, against five datasets and baselines of different temporal lengths. Our results show that TimeWak achieves improvements of 61.96% in context-FID score, and 8.44% in correlational scores against the strongest state-of-the-art baseline, while remaining consistently detectable.

View full details

Poster

Precise Asymptotics and Refined Regret of Variance-Aware UCB

Yingying Fan ⋅ Yuxuan Han ⋅ Jinchi Lv ⋅ Xiaocong Xu ⋅ Zhengyuan Zhou

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

In this paper, we study the behavior of the Upper Confidence Bound-Variance (UCB-V) algorithm for the Multi-Armed Bandit (MAB) problems, a variant of the canonical Upper Confidence Bound (UCB) algorithm that incorporates variance estimates into its decision-making process. More precisely, we provide an asymptotic characterization of the arm-pulling rates for UCB-V, extending recent results for the canonical UCB in Kalvit and Zeevi (2021) and Khamaru and Zhang (2024). In an interesting contrast to the canonical UCB, our analysis reveals that the behavior of UCB-V can exhibit instability, meaning that the arm-pulling rates may not always be asymptotically deterministic. Besides the asymptotic characterization, we also provide non-asymptotic bounds for the arm-pulling rates in the high probability regime, offering insights into the regret analysis. As an application of this high probability result, we establish that UCB-V can achieve a more refined regret bound, previously unknown even for more complicate and advanced variance-aware online decision-making algorithms. A matching regret lower bound is also established, demonstrating the optimality of our result.

View full details

Poster

MonoLift: Learning 3D Manipulation Policies from Monocular RGB via Distillation

Ziru Wang ⋅ Mengmeng Wang ⋅ Guang Dai ⋅ Yongliu Long ⋅ Jingdong Wang

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Although learning 3D manipulation policies from monocular RGB images is lightweight and deployment-friendly, the lack of structural information often leads to inaccurate action estimation. While explicit 3D inputs can mitigate this issue, they typically require additional sensors and introduce data acquisition overhead. An intuitive alternative is to incorporate a pre-trained depth estimator; however, this often incurs substantial inference-time cost. To address this, we propose MonoLift, a tri-level knowledge distillation framework that transfers spatial, temporal, and action-level knowledge from a depth-guided teacher to a monocular RGB student. By jointly distilling geometry-aware features, temporal dynamics, and policy behaviors during training, MonoLift enables the student model to perform 3D-aware reasoning and precise control at deployment using only monocular RGB input. Extensive experiments on both simulated and real-world manipulation tasks show that MonoLift not only outperforms existing monocular approaches but even surpasses several methods that rely on explicit 3D input, offering a resource-efficient and effective solution for vision-based robotic control. The video demonstration is available on our project page: https://robotasy.github.io/MonoLift/.

View full details

Poster

Quantization-Free Autoregressive Action Transformer

Ziyad Sheebaelhamd ⋅ Michael Tschannen ⋅ Michael Muehlebach ⋅ Claire Vernade

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Current transformer-based imitation learning approaches introduce discrete action representations and train an autoregressive transformer decoder on the resulting latent code. However, the initial quantization breaks the continuous structure of the action space thereby limiting the capabilities of the generative model. We propose a quantization-free method instead that leverages Generative Infinite-Vocabulary Transformers (GIVT) as a direct, continuous policy parametrization for autoregressive transformers. This simplifies the imitation learning pipeline while achieving state-of-the-art performance on a variety of popular simulated robotics tasks. We enhance our policy roll-outs by carefully studying sampling algorithms, further improving the results.

View full details

Poster

GraLoRA: Granular Low-Rank Adaptation for Parameter-Efficient Fine-Tuning

Yeonjoon Jung ⋅ Daehyun Ahn ⋅ Hyungjun Kim ⋅ Taesu Kim ⋅ Eunhyeok Park

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Low-Rank Adaptation (LoRA) is a popular method for parameter-efficient fine-tuning (PEFT) of generative models, valued for its simplicity and effectiveness. Despite recent enhancements, LoRA still suffers from a fundamental limitation: overfitting when the bottleneck is widened. It performs best at ranks 32–64, yet its accuracy stagnates or declines at higher ranks, still falling short of full fine-tuning (FFT) performance. We identify the root cause as LoRA’s structural bottleneck, which introduces gradient entanglement to the unrelated input channels and distorts gradient propagation. To address this, we introduce a novel structure, Granular Low-Rank Adaptation (GraLoRA) that partitions weight matrices into sub-blocks, each with its own low-rank adapter. With negligible computational or storage cost, GraLoRA overcomes LoRA’s limitations, effectively increases the representational capacity, and more closely approximates FFT behavior. Experiments on code generation, commonsense reasoning, mathematical reasoning, general language understanding, and image generation benchmarks show that GraLoRA consistently outperforms LoRA and other baselines, achieving up to +8.5\% absolute gain in Pass@1 on HumanEval+. These improvements hold across model sizes and rank settings, making GraLoRA a scalable and robust solution for PEFT.

View full details

Poster

SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training

Jintao Zhang ⋅ Jia wei ⋅ Haoxu Wang ⋅ Pengle Zhang ⋅ Xiaoming Xu ⋅ Haofeng Huang ⋅ Kai Jiang ⋅ Jianfei Chen ⋅ Jun Zhu

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

The efficiency of attention is important due to its quadratic time complexity. We enhance the efficiency of attention through two key contributions: First, we leverage the new $\texttt{FP4}$ Tensor Cores in Blackwell GPUs to accelerate attention computation. Our implementation achieves $\textbf{1038}$ $\texttt{TOPS}$ on $\texttt{RTX5090}$, which is a $\textbf{5}\times$ speedup over the fastest FlashAttention on $\texttt{RTX5090}$. Experiments show that our $\texttt{FP4}$ attention can accelerate inference of various models in a plug-and-play way. Second, we pioneer low-bit attention to training tasks. Existing low-bit attention works like FlashAttention3 and SageAttention focus only on inference. However, the efficiency of training large models is also important. To explore whether low-bit attention can be effectively applied to training tasks, we design an accurate and efficient $\texttt{8-bit}$ attention for both forward and backward propagation. Experiments indicate that $\texttt{8-bit}$ attention achieves lossless performance in fine-tuning tasks but exhibits slower convergence in pretraining tasks. The code is available at https://github.com/thu-ml/SageAttention.

View full details

Poster

Dimension-adapted Momentum Outscales SGD

Damien Ferbach ⋅ Katie Everett ⋅ Gauthier Gidel ⋅ Elliot Paquette ⋅ Courtney Paquette

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We investigate scaling laws for stochastic momentum algorithms on the power law random features model, parameterized by data complexity, target complexity, and model size. When trained with a stochastic momentum algorithm, our analysis reveals four distinct loss curve shapes determined by varying data-target complexities. While traditional stochastic gradient descent with momentum (SGD-M) yields identical scaling law exponents to SGD, dimension-adapted Nesterov acceleration (DANA) improves these exponents by scaling momentum hyperparameters based on model size and data complexity. This outscaling phenomenon, which also improves compute-optimal scaling behavior, is achieved by DANA across a broad range of data and target complexities, while traditional methods fall short. Extensive experiments on high-dimensional synthetic quadratics validate our theoretical predictions and large-scale text experiments with LSTMs show DANA's improved loss exponents over SGD hold in a practical setting.

View full details

Poster

Towards Building Model/Prompt-Transferable Attackers against Large Vision-Language Models

Xiaowen Cai ⋅ Daizong Liu ⋅ Xiaoye Qu ⋅ Xiang Fang ⋅ Jianfeng Dong ⋅ Keke Tang ⋅ Pan Zhou ⋅ Lichao Sun ⋅ Wei Hu

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Although Large Vision-Language Models (LVLMs) exhibit impressive multimodal capabilities, their vulnerability to adversarial examples has raised serious security concerns. Existing LVLM attackers simply optimize adversarial images that easily overfit a certain model/prompt, making them ineffective once they are transferred to attack a different model/prompt. Motivated by this research gap, this paper aims to develop a more powerful attack that is transferable to black-box LVLM models of different structures and task-aware prompts of different semantics. Specifically, we introduce a new perspective of information theory to investigate LVLMs' transferable characteristics by exploring the relative dependence between outputs of the LVLM model and input adversarial samples. Our empirical observations suggest that enlarging/decreasing the mutual information between outputs and the disentangled adversarial/benign patterns of input images helps to generate more agnostic perturbations for misleading LVLMs' perception with better transferability. In particular, we formulate the complicated calculation of information gain as an estimation problem and incorporate such informative constraints into the adversarial learning process. Extensive experiments on various LVLM models/prompts demonstrate our significant transfer-attack performance.

View full details

Poster

SViMo: Synchronized Diffusion for Video and Motion Generation in Hand-object Interaction Scenarios

Lingwei Dang ⋅ Ruizhi Shao ⋅ Hongwen Zhang ⋅ Wei MIN ⋅ Yebin Liu ⋅ Qingyao Wu

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Hand-Object Interaction (HOI) generation has significant application potential. However, current 3D HOI motion generation approaches heavily rely on predefined 3D object models and lab-captured motion data, limiting generalization capabilities. Meanwhile, HOI video generation methods prioritize pixel-level visual fidelity, often sacrificing physical plausibility. Recognizing that visual appearance and motion patterns share fundamental physical laws in the real world, we propose a novel framework that combines visual priors and dynamic constraints within a synchronized diffusion process to generate the HOI video and motion simultaneously. To integrate the heterogeneous semantics, appearance, and motion features, our method implements tri-modal adaptive modulation for feature aligning, coupled with 3D full-attention for modeling inter- and intra-modal dependencies. Furthermore, we introduce a vision-aware 3D interaction diffusion model that generates explicit 3D interaction sequences directly from the synchronized diffusion outputs, then feeds them back to establish a closed-loop feedback cycle. This architecture eliminates dependencies on predefined object models or explicit pose guidance while significantly enhancing video-motion consistency. Experimental results demonstrate our method's superiority over state-of-the-art approaches in generating high-fidelity, dynamically plausible HOI sequences, with notable generalization capabilities in unseen real-world scenarios. Project page at [https://droliven.github.io/SViMo_project](https://droliven.github.io/SViMo_project).

View full details

Poster

Tight Generalization Bounds for Large-Margin Halfspaces

Kasper Green Larsen ⋅ Natascha Schalburg

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We prove the first generalization bound for large-margin halfspaces that is asymptotically tight in the tradeoff between the margin, the fraction of training points with the given margin, the failure probability and the number of training points.

View full details

Poster

ReSim: Reliable World Simulation for Autonomous Driving

Jiazhi Yang ⋅ Kashyap Chitta ⋅ Shenyuan Gao ⋅ Long Chen ⋅ Yuqian Shao ⋅ Xiaosong Jia ⋅ Hongyang Li ⋅ Andreas Geiger ⋅ Xiangyu Yue ⋅ Li Chen

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

How can we reliably simulate future driving scenarios under a wide range of ego driving behaviors? Recent driving world models, developed exclusively on real-world driving data composed mainly of safe expert trajectories, struggle to follow hazardous or non-expert behaviors, which are rare in such data. This limitation restricts their applicability to tasks such as policy evaluation. In this work, we address this challenge by enriching real-world human demonstrations with diverse non-expert data collected from a driving simulator (e.g., CARLA), and building a controllable world model trained on this heterogeneous corpus. Starting with a video generator featuring diffusion transformer architecture, we devise several strategies to effectively integrate conditioning signals and improve prediction controllability and fidelity. The resulting model, ReSim, enables Reliable Simulation of diverse open-world driving scenarios under various actions, including hazardous non-expert ones. To close the gap between high-fidelity simulation and applications that require reward signals to judge different actions, we introduce a Video2Reward module that estimates reward from ReSim’s simulated future. Our ReSim paradigm achieves up to 44% higher visual fidelity, improves controllability for both expert and non-expert actions by over 50%, and boosts planning and policy selection performance on NAVSIM by 2% and 25%, respectively.

View full details

Poster

From Counterfactuals to Trees: Competitive Analysis of Model Extraction Attacks

Awa Khouna ⋅ Julien Ferry ⋅ Thibaut Vidal

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

The advent of Machine Learning as a Service (MLaaS) has heightened the trade-off between model explainability and security. In particular, explainability techniques, such as counterfactual explanations, inadvertently increase the risk of model extraction attacks, enabling unauthorized replication of proprietary models. In this paper, we formalize and characterize the risks and inherent complexity of model reconstruction, focusing on the "oracle'' queries required for faithfully inferring the underlying prediction function. We present the first formal analysis of model extraction attacks through the lens of competitive analysis, establishing a foundational framework to evaluate their efficiency. Focusing on models based on additive decision trees (e.g., decision trees, gradient boosting, and random forests), we introduce novel reconstruction algorithms that achieve provably perfect fidelity while demonstrating strong anytime performance. Our framework provides theoretical bounds on the query complexity for extracting tree-based model, offering new insights into the security vulnerabilities of their deployment.

View full details

Poster

Strategic Hypothesis Testing

Yatong Chen ⋅ Safwan Hossain ⋅ Yiling Chen

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We examine hypothesis testing within a principal-agent framework, where a strategic agent, holding private beliefs about the effectiveness of a product, submits data to a principal who decides on approval. The principal employs a hypothesis testing rule, aiming to pick a p-value threshold that balances false positives and false negatives while anticipating the agent’s incentive to maximize expected profitability. Building on prior work, we develop a game-theoretic model that captures how the agent’s participation and reporting behavior respond to the principal’s statistical decision rule. Despite the complexity of the interaction, we show that the principal's errors exhibit clear monotonic behavior when segmented by an efficiently computable critical p-value threshold, leading to an interpretable characterization of their optimal p-value threshold. We empirically validate our model and these insights using publicly available data on drug approvals. Overall, our work offers a comprehensive perspective on strategic interactions within the hypothesis testing framework, providing technical and regulatory insights.

View full details

Poster

OpenBox: Annotate Any Bounding Boxes in 3D

In-Jae Lee ⋅ Mungyeom Kim ⋅ Kwonyoung Ryu ⋅ Pierre Musacchio ⋅ Jaesik Park

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Unsupervised and open-vocabulary 3D object detection has recently gained attention, particularly in autonomous driving, where reducing annotation costs and recognizing unseen objects are critical for both safety and scalability. However, most existing approaches uniformly annotate 3D bounding boxes, ignore objects’ physical states, and require multiple self-training iterations for annotation refinement, resulting in suboptimal quality and substantial computational overhead. To address these challenges, we propose OpenBox, a two-stage automatic annotation pipeline that leverages a 2D vision foundation model. In the first stage, OpenBox associates instance-level cues from 2D images processed by a vision foundation model with the corresponding 3D point clouds via context-aware refinement. In the second stage, it categorizes instances by rigidity and motion state, then generates adaptive bounding boxes with class-specific size statistics. As a result, OpenBox produces high-quality 3D bounding box annotations without requiring self-training. Experiments on the Waymo Open Dataset (WOD), the Lyft Level 5 Perception dataset, and the nuScenes dataset demonstrate improved accuracy and efficiency over baselines.

View full details

Poster

OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents

Thomas Kuntz ⋅ Agatha Duzan ⋅ Hao Zhao ⋅ Francesco Croce ⋅ Zico Kolter ⋅ Nicolas Flammarion ⋅ Maksym Andriushchenko

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Computer use agents are LLM-based agents that can directly interact with a graphical user interface, by processing screenshots or accessibility trees. While these systems are gaining popularity, their safety has been largely overlooked, despite the fact that evaluating and understanding their potential for harmful behavior is essential for widespread adoption. To address this gap, we introduce OS-Harm, a new benchmark for measuring safety of computer use agents. OS-Harm is built on top of the OSWorld environment (Xie et al., 2024) and aims to test models across three categories of harm: deliberate user misuse, prompt injection attacks, and model misbehavior. To cover these cases, we create 150 tasks that span several types of safety violations (harassment, copyright infringement, disinformation, data exfiltration, etc.) and require the agent to interact with a variety of OS applications (email client, code editor, browser, etc.). Moreover, we propose an automated judge to evaluate both accuracy and safety of agents that achieves high agreement with human annotations (0.76 and 0.79 F1 score). We evaluate computer use agents based on a range of frontier models—such as `o4-mini`, `Claude 3.7 Sonnet`, `Gemini 2.5 Pro`—and provide insights into their safety. In particular, all models tend to directly comply with many deliberate misuse queries, are relatively vulnerable to static prompt injections, and occasionally perform unsafe actions. The OS-Harm benchmark is available at https://github.com/tml-epfl/os-harm.

View full details

Poster

Deep Continuous-Time State-Space Models for Marked Event Sequences

Yuxin Chang ⋅ Alex Boyd ⋅ Cao (Danica) Xiao ⋅ Taha Kass-Hout ⋅ Parminder Bhatia ⋅ Padhraic Smyth ⋅ andrew warrington

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Marked temporal point processes (MTPPs) model sequences of events occurring at irregular time intervals, with wide-ranging applications in fields such as healthcare, finance and social networks. We propose the _state-space point process_ (S2P2) model, a novel and performant model that leverages techniques derived for modern deep state-space models (SSMs) to overcome limitations of existing MTPP models, while simultaneously imbuing strong inductive biases for continuous-time event sequences that other discrete sequence models (i.e., RNNs, transformers) do not capture. Inspired by the classical linear Hawkes processes, we propose an architecture that interleaves stochastic jump differential equations with nonlinearities to create a highly expressive intensity-based MTPP model, without the need for restrictive parametric assumptions for the intensity. Our approach enables efficient training and inference with a parallel scan, bringing linear complexity and sublinear scaling while retaining expressivity to MTPPs. Empirically, S2P2 achieves state-of-the-art predictive likelihoods across eight real-world datasets, delivering an average improvement of 33% over the best existing approaches.

View full details

Poster

BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning

Jianyang Gu ⋅ Sam Stevens ⋅ Elizabeth Campolongo ⋅ Matthew Thompson ⋅ Net Zhang ⋅ Jiaman Wu ⋅ Andrei Kopanev ⋅ Zheda Mai ⋅ Alexander White ⋅ James Balhoff ⋅ Wasila Dahdul ⋅ Daniel Rubenstein ⋅ Hilmar Lapp ⋅ Tanya Berger-Wolf ⋅ Wei-Lun (Harry) Chao ⋅ Yu Su

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Foundation models trained at scale exhibit remarkable emergent behaviors, learning new capabilities beyond their initial training objectives. We find such emergent behaviors in biological vision models via large-scale contrastive vision-language training. To achieve this, we first curate TreeOfLife-200M, comprising 214 million images of living organisms, the largest and most diverse biological organism image dataset to date. We then train BioCLIP 2 on TreeOfLife-200M to distinguish different species. Despite the narrow training objective, BioCLIP 2 yields extraordinary accuracy when applied to various biological visual tasks such as habitat classification and trait prediction. We identify emergent properties in the learned embedding space of BioCLIP 2. At the inter-species level, the embedding distribution of different species aligns closely with functional and ecological meanings (e.g., beak sizes and habitats). At the intra-species level, instead of being diminished, the intra-species variations (e.g., life stages and sexes) are preserved and better separated in subspaces orthogonal to inter-species distinctions. We provide formal proof and analyses to explain why hierarchical supervision and contrastive objectives encourage these emergent properties. Crucially, our results reveal that these properties become increasingly significant with larger-scale training data, leading to a biologically meaningful embedding space.

View full details

Poster

Adaptive Neighborhood-Constrained Q Learning for Offline Reinforcement Learning

Yixiu Mao ⋅ Yun Qu ⋅ Qi Wang ⋅ Xiangyang Ji

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Offline reinforcement learning (RL) suffers from extrapolation errors induced by out-of-distribution (OOD) actions. To address this, offline RL algorithms typically impose constraints on action selection, which can be systematically categorized into density, support, and sample constraints. However, we show that each category has inherent limitations: density and sample constraints tend to be overly conservative in many scenarios, while the support constraint, though least restrictive, faces challenges in accurately modeling the behavior policy. To overcome these limitations, we propose a new neighborhood constraint that restricts action selection in the Bellman target to the union of neighborhoods of dataset actions. Theoretically, the constraint not only bounds extrapolation errors and distribution shift under certain conditions, but also approximates the support constraint without requiring behavior policy modeling. Moreover, it retains substantial flexibility and enables pointwise conservatism by adapting the neighborhood radius for each data point. In practice, we employ data quality as the adaptation criterion and design an adaptive neighborhood constraint. Building on an efficient bilevel optimization framework, we develop a simple yet effective algorithm, Adaptive Neighborhood-constrained Q learning (ANQ), to perform Q learning with target actions satisfying this constraint. Empirically, ANQ achieves state-of-the-art performance on standard offline RL benchmarks and exhibits strong robustness in scenarios with noisy or limited data.

View full details

Poster

Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis

Tianrui Wang ⋅ Haoyu Wang ⋅ Meng Ge ⋅ Cheng Gong ⋅ Chunyu Qiang ⋅ Ziyang Ma ⋅ Zikang Huang ⋅ Guanrou Yang ⋅ Xiaobao Wang ⋅ Eng-Siong Chng ⋅ Xie Chen ⋅ Longbiao Wang ⋅ Jianwu Dang

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

While emotional text-to-speech (TTS) has made significant progress, most existing research remains limited to utterance-level emotional expression and fails to support word-level control. Achieving word-level expressive control poses fundamental challenges, primarily due to the complexity of modeling multi-emotion transitions and the scarcity of annotated datasets that capture intra-sentence emotional and prosodic variation. In this paper, we propose WeSCon, the first self-training framework that enables word-level control of both emotion and speaking rate in a pretrained zero-shot TTS model, without relying on datasets containing intra-sentence emotion or speed transitions. Our method introduces a transition-smoothing strategy and a dynamic speed control mechanism to guide the pretrained TTS model in performing word-level expressive synthesis through a multi-round inference process. To further simplify the inference, we incorporate a dynamic emotional attention bias mechanism and fine-tune the model via self-training, thereby activating its ability for word-level expressive control in an end-to-end manner. Experimental results show that WeSCon effectively overcomes data scarcity, achieving state-of-the-art performance in word-level emotional expression control while preserving the strong zero-shot synthesis capabilities of the original TTS model.

View full details

Poster

Eluder dimension: localise it!

Alireza Bakhtiari ⋅ Alex Ayoub ⋅ Samuel Robertson ⋅ David Janz ⋅ Csaba Szepesvari

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We establish a lower bound on the eluder dimension in generalised linear model classes, showing that standard eluder dimension-based analysis cannot lead to first-order regret bounds. To address this, we introduce a localisation method for the eluder dimension; our analysis immediately recovers and improves on classic results for Bernoulli bandits, and allows for the first genuine first-order bounds for finite-horizon reinforcement learning tasks with bounded cumulative returns.

View full details

Poster

Optimization Inspired Few-Shot Adaptation for Large Language Models

Boyan Gao ⋅ Xin Wang ⋅ Yibo Yang ⋅ David Clifton

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Large Language Models (LLMs) have demonstrated remarkable performance in real-world applications. However, adapting LLMs to novel tasks via fine-tuning often requires substantial training data and computational resources that are impractical in few-shot scenarios. Existing approaches, such as In-context learning and Parameter-Efficient Fine-Tuning (PEFT), face key limitations: In-context learning introduces additional inference computational overhead with limited performance gains, while PEFT models are prone to overfitting on the few demonstration examples. In this work, we reinterpret the forward pass of LLMs as an optimization process, a sequence of preconditioned gradient descent steps refining internal representations. Based on this connection, we propose Optimization-Inspired Few-Shot Adaptation (OFA), integrating a parameterization that learns preconditioners without introducing additional trainable parameters, and an objective that improves optimization efficiency by learning preconditioners based on a convergence bound, while simultaneously steering the optimization path toward the flat local minimum. Our method overcomes both issues of ICL-based and PEFT-based methods, and demonstrates superior performance over the existing methods on a variety of few-shot adaptation tasks in experiments.

View full details

Poster

3D Interaction Geometric Pre-training for Molecular Relational Learning

Namkyeong Lee ⋅ Yunhak Oh ⋅ Heewoong Noh ⋅ Gyoung S. Na ⋅ Minkai Xu ⋅ Hanchen Wang ⋅ Tianfan Fu ⋅ Chanyoung Park

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Molecular Relational Learning (MRL) is a rapidly growing field that focuses on understanding the interaction dynamics between molecules, which is crucial for applications ranging from catalyst engineering to drug discovery. Despite recent progress, earlier MRL approaches are limited to using only the 2D topological structure of molecules, as obtaining the 3D interaction geometry remains prohibitively expensive. This paper introduces a novel 3D geometric pre-training strategy for MRL (3DMRL) that incorporates a 3D virtual interaction environment, overcoming the limitations of costly traditional quantum mechanical calculation methods. With the constructed 3D virtual interaction environment, 3DMRL trains 2D MRL model to learn the global and local 3D geometric information of molecular interaction. Extensive experiments on various tasks using real-world datasets, including out-of-distribution and extrapolation scenarios, demonstrate the effectiveness of 3DMRL, showing up to a 24.93% improvement in performance across 40 tasks. Our code is publicly available at https://github.com/Namkyeong/3DMRL.

View full details

Poster

UniteFormer: Unifying Node and Edge Modalities in Transformers for Vehicle Routing Problems

Dian Meng ⋅ Zhiguang Cao ⋅ Jie Gao ⋅ Yaoxin Wu ⋅ Yaqing Hou

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Neural solvers for the Vehicle Routing Problem (VRP) have typically relied on either node or edge inputs, limiting their flexibility and generalization in real-world scenarios. We propose UniteFormer, a unified neural solver that supports node-only, edge-only, and hybrid input types through a single model trained via joint edge-node modalities. UniteFormer introduces: (1) a mixed encoder that integrates graph convolutional networks and attention mechanisms to collaboratively process node and edge features, capturing cross-modal interactions between them; and (2) a parallel decoder enhanced with query mapping and a feed-forward layer for improved representation. The model is trained with REINFORCE by randomly sampling input types across batches. Experiments on the Traveling Salesman Problem (TSP) and Capacitated Vehicle Routing Problem (CVRP) demonstrate that UniteFormer achieves state-of-the-art performance and generalizes effectively to TSPLib and CVRPLib instances. These results underscore UniteFormer’s ability to handle diverse input modalities and its strong potential to improve performance across various VRP tasks.

View full details

Poster

Robo2VLM: Improving Visual Question Answering using Large-Scale Robot Manipulation Data

Kaiyuan Eric Chen ⋅ Shuangyu Xie ⋅ Zehan Ma ⋅ Pannag Sanketi ⋅ Ken Goldberg

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Vision-Language Models (VLMs) acquire real-world knowledge and general reasoning ability through Internet-scale image-text corpora. They can augment robotic systems with scene understanding and task planning, and assist visuomotor policies that are trained on robot trajectory data. We explore the reverse paradigm — using rich, real, multi-modal robot trajectory data to enhance and evaluate VLMs. In this paper, we present Robo2VLM, a Visual Question Answering (VQA) dataset generation framework for VLMs. Given a human tele-operated robot demonstration with video and robot data, Robo2VLM derives ground-truth from non-visual and non-descriptive sensory modalities, such as end-effector pose, gripper aperture, and force sensing. Based on these modalities, it segments the robot trajectory into a sequence of manipulation phases. At each phase, Robo2VLM uses scene and interaction understanding to identify 3D properties of the robot, task goal, and the target object. The properties are used to generate representative VQA queries – images with textural multiple-choice questions – based on spatial, goal-conditioned, and interaction reasoning question templates. We use a subset of Open X-Embodiment to generate Robo2VLM-1, a large-scale in-the-wild dataset with 684,710 questions based on 463 distinct scenes and 3,396 robotic manipulation tasks from 176k real robot trajectories. Results suggest that Robo2VLM-1 can benchmark and improve VLM capabilities in spatial and interaction reasoning.

View full details

Poster

Establishing Linear Surrogate Regret Bounds for Convex Smooth Losses via Convolutional Fenchel–Young Losses

Yuzhou Cao ⋅ Han Bao ⋅ Lei Feng ⋅ Bo An

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Surrogate regret bounds, also known as excess risk bounds, bridge the gap between the convergence rates of surrogate and target losses. The regret transfer is lossless if the surrogate regret bound is linear. While convex smooth surrogate losses are appealing in particular due to the efficient estimation and optimization, the existence of a trade-off between the loss smoothness and linear regret bound has been believed in the community. Under this scenario, the better optimization and estimation properties of convex smooth surrogate losses may inevitably deteriorate after undergoing the regret transfer onto a target loss. We overcome this dilemma for arbitrary discrete target losses by constructing a convex smooth surrogate loss, which entails a linear surrogate regret bound composed with a tailored prediction link. The construction is based on Fenchel--Young losses generated by the *convolutional negentropy*, which are equivalent to the infimal convolution of a generalized negentropy and the target Bayes risk. Consequently, the infimal convolution enables us to derive a smooth loss while maintaining the surrogate regret bound linear. We additionally benefit from the infimal convolution to have a consistent estimator of the underlying class probability. Our results are overall a novel demonstration of how convex analysis penetrates into optimization and statistical efficiency in risk minimization.

View full details

Poster

Wavelet Canonical Coherence for Nonstationary Signals

Haibo Wu ⋅ Marina Knight ⋅ Keiland Cooper ⋅ Norbert Fortin ⋅ Hernando Ombao

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Understanding the evolving dependence between two sets of multivariate signals is fundamental in neuroscience and other domains where sub-networks in a system interact dynamically over time. Despite the growing interest in multivariate time series analysis, existing methods for between-clusters dependence typically rely on the assumption of stationarity and lack the temporal resolution to capture transient, frequency-specific interactions. To overcome this limitation, we propose scale-specific wavelet canonical coherence (WaveCanCoh), a novel framework that extends canonical coherence analysis to the nonstationary setting by leveraging the multivariate locally stationary wavelet model. The proposed WaveCanCoh enables the estimation of time-varying canonical coherence between clusters, providing interpretable insight into scale-specific time-varying interactions between clusters. Through extensive simulation studies, we demonstrate that WaveCanCoh accurately recovers true coherence structures under both locally stationary and general nonstationary conditions. Application to local field potential (LFP) activity data recorded from the hippocampus reveals distinct dynamic coherence patterns between correct and incorrect memory-guided decisions, illustrating capacity of the method to detect behaviorally relevant neural coordination. These results highlight WaveCanCoh as a flexible and principled tool for modeling complex cross-group dependencies in nonstationary multivariate systems. Code for implementing WaveCanCoh is available at https://github.com/mhaibo/WaveCanCoh.git.

View full details

Poster

OnlineSplatter: Pose-Free Online 3D Reconstruction for Free-Moving Objects

Mark H. Huang ⋅ Lin Geng Foo ⋅ Christian Theobalt ⋅ Ying Sun ⋅ De Wen Soh

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Free-moving object reconstruction from monocular video remains challenging, particularly without reliable pose or depth cues and under arbitrary object motion. We introduce OnlineSplatter, a novel online feed-forward framework generating high-quality, object-centric 3D Gaussians directly from RGB frames without requiring camera pose, depth priors, or bundle optimization. Our approach anchors reconstruction using the first frame and progressively refines the object representation through a dense Gaussian primitive field, maintaining constant computational cost regardless of video sequence length. Our core contribution is a dual-key memory module combining latent appearance-geometry keys with explicit directional keys, robustly fusing current frame features with temporally aggregated object states. This design enables effective handling of free-moving objects via spatial-guided memory readout and an efficient sparsification mechanism, ensuring comprehensive yet compact object coverage. Evaluations on real-world datasets demonstrate that OnlineSplatter significantly outperforms state-of-the-art pose-free reconstruction baselines, consistently improving with more observations while maintaining constant memory and runtime.

View full details

Poster

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu ⋅ Peixian Chen ⋅ Yunhang Shen ⋅ Yulei Qin ⋅ Mengdan Zhang ⋅ Xu Lin ⋅ Jinrui Yang ⋅ Xiawu Zheng ⋅ Ke Li ⋅ Xing Sun ⋅ Yunsheng Wu ⋅ Rongrong Ji ⋅ Caifeng Shan ⋅ Ran He

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Multimodal Large Language Model (MLLM) relies on the powerful LLM to perform multimodal tasks, showing amazing emergent abilities in recent studies, such as writing poems based on an image. However, it is difficult for these case studies to fully reflect the performance of MLLM, lacking a comprehensive evaluation. In this paper, we fill in this blank, presenting the first comprehensive MLLM Evaluation benchmark MME. It measures both perception and cognition abilities on a total of 14 subtasks. In order to avoid data leakage that may arise from direct use of public datasets for evaluation, the annotations of instruction-answer pairs are all manually designed. The concise instruction design allows us to fairly compare MLLMs, instead of struggling in prompt engineering. Besides, with such an instruction, we can also easily carry out quantitative statistics. A total of 30 advanced MLLMs are comprehensively evaluated on our MME, which not only suggests that existing MLLMs still have a large room for improvement, but also reveals the potential directions for the subsequent model optimization. The data are released at the project page: https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation.

View full details

Poster

UMA: A Family of Universal Models for Atoms

Brandon Wood ⋅ Misko Dzamba ⋅ Xiang Fu ⋅ Meng Gao ⋅ Muhammed Shuaibi ⋅ Luis Barroso-Luque ⋅ Kareem Abdelmaqsoud ⋅ Vahe Gharakhanyan ⋅ John Kitchin ⋅ Daniel Levine ⋅ Kyle Michel ⋅ Anuroop Sriram ⋅ Taco Cohen ⋅ Abhishek Das ⋅ Sushree Sahoo ⋅ Ammar Rizvi ⋅ Zachary Ulissi ⋅ Larry Zitnick

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

The ability to quickly and accurately compute properties from atomic simulations is critical for advancing a large number of applications in chemistry and materials science including drug discovery, energy storage, and semiconductor manufacturing. To address this need, we present a family of Universal Models for Atoms (UMA), designed to push the frontier of speed, accuracy, and generalization. UMA models are trained on half a billion unique 3D atomic structures (the largest training runs to date) by compiling data across multiple chemical domains, e.g. molecules, materials, and catalysts. We develop empirical scaling laws to help understand how to increase model capacity alongside dataset size to achieve the best accuracy. The UMA small and medium models utilize a novel architectural design we refer to as mixture of linear experts that enables increasing model capacity without sacrificing speed. For example, UMA-medium has 1.4B parameters but only $\sim$50M active parameters per atomic structure. We evaluate UMA models on a diverse set of applications across multiple domains and find that, remarkably, a single model without any fine-tuning can perform similarly or better than specialized models. We are releasing the UMA code, weights, and associated data to accelerate computational workflows and enable the community to build increasingly capable AI models.

View full details

Poster

Solving Inequality Proofs with Large Language Models

Jiayi Sheng ⋅ Luna Lyu ⋅ Jikai Jin ⋅ Tanglin Xia ⋅ Alex Gu ⋅ James Zou ⋅ Pan Lu

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Inequality proving, crucial across diverse scientific and mathematical fields, tests advanced reasoning skills such as discovering tight bounds and strategic theorem application. This makes it a distinct, demanding frontier for large language models (LLMs), offering insights beyond general mathematical problem-solving. Progress in this area is hampered by existing datasets that are often scarce, synthetic, or rigidly formal. We address this by proposing an *informal yet verifiable* task formulation, recasting inequality proving into two automatically checkable subtasks: bound estimation and relation prediction. Building on this, we release *IneqMath*, an expert-curated dataset of Olympiad-level inequalities, including a test set and training corpus enriched with step-wise solutions and theorem annotations. We also develop a novel LLM-as-judge evaluation suite, combining a *final-answer* judge with four specialized *step-wise* judges designed to detect common reasoning flaws. A systematic evaluation of 29 leading LLMs on *IneqMath* reveals a surprising reality: even top models like o1 achieve less than 10% overall accuracy under step-wise scrutiny; this is a drop of up to 65.5% from their accuracy considering only final answer equivalence. This discrepancy exposes fragile deductive chains and a critical gap for current LLMs between merely finding an answer and constructing a rigorous proof. Scaling model size and increasing test-time computation yield limited gains in overall proof correctness. Instead, our findings highlight promising research directions such as theorem-guided reasoning and self-refinement.

View full details

Poster

A Principled Path to Fitted Distributional Evaluation

Sungee Hong ⋅ Jiayi Wang ⋅ Zhengling Qi ⋅ Raymond K. W. Wong

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

In reinforcement learning, distributional off-policy evaluation (OPE) focuses on estimating the return distribution of a target policy using offline data collected under a different policy. This work focuses on extending the widely used fitted Q-evaluation---developed for expectation-based reinforcement learning---to the distributional OPE setting. We refer to this extension as fitted distributional evaluation (FDE). While only a few related approaches exist, there remains no unified framework for designing FDE methods. To fill this gap, we present a set of guiding principles for constructing theoretically grounded FDE methods. Building on these principles, we develop several new FDE methods with convergence analysis and provide theoretical justification for existing methods, even in non-tabular environments. Extensive experiments, including simulations on linear quadratic regulators and Atari games, demonstrate the superior performance of the FDE methods.

View full details

Poster

The Best Instruction-Tuning Data are Those That Fit

Dylan Zhang ⋅ Qirun Dai ⋅ Hao Peng

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

High-quality supervised finetuning (SFT) data are essential for unlocking pretrained LLMs’ capabilities. Typically, instructions are paired with responses from various sources—by human annotators or other LMs—which are often out of the distribution of the target model to be finetuned. At scale, this mismatch can lead to diminishing returns and even hurt model performance and robustness. We hypothesize that SFT is most effective when the data is aligned with the model’s pretrained distribution, and propose **GRAPE**—a novel SFT framework that tailors supervision to the target model. For each instruction, it **g**athers **r**esponses from various sources and selects the one that **a**ligns most closely to the model’s **pre**trained distribution, as measured by the normalized probability. Standard SFT is then performed on these selected responses. We first evaluate GRAPE in a controlled experiment, sampling multiple responses per question in the UltraInteract dataset from diverse models. We finetune using GRAPE-selected data on LMs from different families, including LLaMA-1-8B, Mistral-7B, and Qwen2.5-7B. GRAPE significantly outperforms strong baselines—including distilling from the strongest model—with absolute gains up to **13.8%** averaged across benchmarks, and outperforms a 3× larger data baseline with improvements up to **17.3%**. GRAPE's benefits generalize to off-the-shelf SFT data. When used to subsample from the post-training data of Tulu3 and Olmo-2, GRAPE surpasses strong baselines trained on 4.5× more data by **6.1%**, and outperforms state-of-the-art selection methods by **3.9%** on average. Notably, with only **1/3 the data** and **half the training epochs**, GRAPE enables LLaMA-1-8B to **exceed Tulu3-SFT performance by 3.5%**. Our findings highlight that aligning supervision with the pretrained distribution provides a simple yet powerful strategy to improve both the **efficiency** and **effectiveness** of SFT.

View full details

Poster

Talk2Event: Grounded Understanding of Dynamic Scenes from Event Cameras

Lingdong Kong ⋅ Dongyue Lu ⋅ Alan Liang ⋅ Rong Li ⋅ Yuhao Dong ⋅ Tianshuai Hu ⋅ Lai Xing Ng ⋅ Wei Ooi ⋅ Benoit Cottereau

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Event cameras offer microsecond-level latency and robustness to motion blur, making them ideal for understanding dynamic environments. Yet, connecting these asynchronous streams to human language remains an open challenge. We introduce Talk2Event, the first large-scale benchmark for language-driven object grounding in event-based perception. Built from real-world driving data, Talk2Event provides over 30,000 validated referring expressions, each enriched with four grounding attributes -- appearance, status, relation to viewer, and relation to other objects -- bridging spatial, temporal, and relational reasoning. To fully exploit these cues, we propose EventRefer, an attribute-aware grounding framework that dynamically fuses multi-attribute representations through a Mixture of Event-Attribute Experts (MoEE). Our method adapts to different modalities and scene dynamics, achieving consistent gains over state-of-the-art baselines in event-only, frame-only, and event-frame fusion settings. We hope our dataset and approach will establish a foundation for advancing multimodal, temporally-aware, and language-driven perception in real-world robotics and autonomy.

View full details

Poster

Object-centric 3D Motion Field for Robot Learning from Human Videos

Zhao-Heng Yin ⋅ Sherry Yang ⋅ Pieter Abbeel

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Learning robot control policies from human videos is a promising direction for scaling up robot learning. However, how to extract action knowledge (or action representations) from videos for policy learning remains a key challenge. Existing action representations such as video frames, pixelflow, and pointcloud flow have inherent limitations such as modeling complexity or loss of information. In this paper, we propose to use object-centric 3D motion field to represent actions for robot learning from human videos, and present a novel framework for extracting this representation from videos for zero-shot control. We introduce two novel components. First, a novel training pipeline for training a ``denoising'' 3D motion field estimator to extract fine object 3D motions from human videos with noisy depth robustly. Second, a dense object-centric 3D motion field prediction architecture that favors both cross-embodiment transfer and policy generalization to background. We evaluate the system in real world setups. Experiments show that our method reduces 3D motion estimation error by over 50% compared to the latest method, achieve 55% average success rate in diverse tasks where prior approaches fail ($\lesssim 10$\%), and can even acquire fine-grained manipulation skills like insertion.

View full details

Poster

What One Cannot, Two Can: Two-Layer Transformers Provably Represent Induction Heads on Any-Order Markov Chains

Chanakya Ekbote ⋅ Ashok Vardhan Makkuva ⋅ Marco Bondaschi ⋅ Nived Rajaraman ⋅ Michael Gastpar ⋅ Jason Lee ⋅ Paul Liang

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

In-context learning (ICL) is a hallmark capability of transformers, through which trained models learn to adapt to new tasks by leveraging information from the input context. Prior work has shown that ICL emerges in transformers due to the presence of special circuits called induction heads. Given the equivalence between induction heads and conditional $k$-grams, a recent line of work modeling sequential inputs as Markov processes has revealed the fundamental impact of model depth on its ICL capabilities: while a two-layer transformer can efficiently represent a conditional $1$-gram model, its single-layer counterpart cannot solve the task unless it is exponentially large. However, for higher order Markov sources, the best known constructions require at least three layers (each with a single attention head) - leaving open the question: *can a two-layer single-head transformer represent any $k^{\text{th}}$-order Markov process?* In this paper, we precisely address this and theoretically show that a two-layer transformer with one head per layer can indeed represent any conditional $k$-gram. Thus, our result provides the tightest known characterization of the interplay between transformer depth and Markov order for ICL. Building on this, we further analyze the learning dynamics of our two-layer construction, focusing on a simplified variant for first-order Markov chains, illustrating how effective in-context representations emerge during training. Together, these results deepen our current understanding of transformer-based ICL and illustrate how even shallow architectures can surprisingly exhibit strong ICL capabilities on structured sequence modeling tasks.

View full details

Poster

Enhancing CLIP Robustness via Cross-Modality Alignment

Xingyu Zhu ⋅ Beier Zhu ⋅ Shuo Wang ⋅ Kesen Zhao ⋅ Hanwang Zhang

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Vision-language models (VLMs) such as CLIP demonstrate strong generalization in zero-shot classification but remain highly vulnerable to adversarial perturbations. Existing methods primarily focus on adversarial fine-tuning or prompt optimization, they often overlook the gaps in CLIP’s encoded features, which is shown as the text and image features lie far apart from each other. This misalignment is significantly amplified under adversarial perturbations, leading to severe degradation in classification performance. To address this problem, we propose **C**r**O**ss-moda**L**ity **A**lignment, dubbed **COLA**, an optimal transport-based framework that explicitly addresses adversarial misalignment by restoring both global image-text alignment and local structural consistency in the feature space. (1) COLA first projects adversarial image embeddings onto a subspace spanned by class text features, effectively filtering out non-semantic distortions while preserving discriminative information. (2) It then models images and texts as discrete distributions over multiple augmented views and refines their alignment via OT, with the subspace projection seamlessly integrated into the cost computation. This design ensures stable cross-modal alignment even under adversarial conditions. COLA is training-free and compatible with existing fine-tuned models. Extensive evaluations across 14 zero-shot classification benchmarks demonstrate the effectiveness of COLA, especially with an average improvement of 6.7% on ImageNet and its variants under PGD adversarial attacks, while maintaining high accuracy on clean samples.

View full details

Poster

MetaMind: Modeling Human Social Thoughts with Metacognitive Multi-Agent Systems

Xuanming Zhang ⋅ Yuxuan Chen ⋅ Samuel (Min-Hsuan) Yeh ⋅ Sharon Li

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Human social interactions depend on the ability to infer others' unspoken intentions, emotions, and beliefs—a cognitive skill grounded in the psychological concept of Theory of Mind (ToM). While large language models (LLMs) excel in semantic understanding tasks, they struggle with the ambiguity and contextual nuance inherent in human communication. To bridge this gap, we introduce **MetaMind**, a multi-agent framework inspired by psychological theories of metacognition, designed to emulate human-like social reasoning. MetaMind decomposes social understanding into three collaborative stages: (1) a *Theory-of-Mind Agent* generates hypotheses about user mental states (e.g., intent, emotion), (2) a *Moral Agent* refines these hypotheses using cultural norms and ethical constraints, and (3) a *Response Agent* generates contextually appropriate responses while validating alignment with inferred intent. Our framework achieves state-of-the-art performance across three challenging benchmarks, with 35.7% improvement in real-world social scenarios and 6.2% gain in ToM reasoning. Notably, it enables LLMs to match human-level performance on key ToM tasks for the first time. Ablation studies confirm the necessity of all components, which showcase the framework’s ability to balance contextual plausibility, social appropriateness, and user adaptation. This work advances AI systems toward human-like social intelligence, with applications in empathetic dialogue and culturally sensitive interactions. Code is available at https://github.com/XMZhangAI/MetaMind.

View full details

Poster

Hierarchical Shortest-Path Graph Kernel Network

Jiaxin Wang ⋅ Wenxuan Tu ⋅ Jieren Cheng

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Graph kernels have emerged as a fundamental and widely adopted technique in graph machine learning. However, most existing graph kernel methods rely on fixed graph similarity estimation that cannot be directly optimized for task-specific objectives, leading to sub-optimal performance. To address this limitation, we propose a kernel-based learning framework called Hierarchical Shortest-Path Graph Kernel Network HSP-GKN, which seamlessly integrates graph similarity estimation with downstream tasks within a unified optimization framework. Specifically, we design a hierarchical shortest-path graph kernel that efficiently preserves both the semantic and structural information of a given graph by transforming it into hierarchical features used for subsequent neural network learning. Building upon this kernel, we develop a novel end-to-end learning framework that matches hierarchical graph features with learnable $hidden$ graph features to produce a similarity vector. This similarity vector subsequently serves as the graph embedding for end-to-end training, enabling the neural network to learn task-specific representations. Extensive experimental results demonstrate the effectiveness and superiority of the designed kernel and its corresponding learning framework compared to current competitors.

View full details

Poster

KARMA: Leveraging Multi-Agent LLMs for Automated Knowledge Graph Enrichment

Yuxing Lu ⋅ Wei Wu ⋅ Xukai Zhao ⋅ Rui Peng ⋅ Jinzhuo Wang

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Maintaining comprehensive and up-to-date knowledge graphs (KGs) is critical for modern AI systems, but manual curation struggles to scale with the rapid growth of scientific literature. This paper presents KARMA, a novel framework employing multi-agent large language models (LLMs) to automate KG enrichment through structured analysis of unstructured text. Our approach employs nine collaborative agents, spanning entity discovery, relation extraction, schema alignment, and conflict resolution that iteratively parse documents, verify extracted knowledge, and integrate it into existing graph structures while adhering to domain-specific schema. Experiments on 1,200 PubMed articles from three different domains demonstrate the effectiveness of KARMA in knowledge graph enrichment, with the identification of up to 38,230 new entities while achieving 83.1\% LLM-verified correctness and reducing conflict edges by 18.6\% through multi-layer assessments.

View full details

Poster

Scalable Cross-View Sample Alignment for Multi-View Clustering with View Structure Similarity

Jun Wang ⋅ Zhenglai Li ⋅ Chang Tang ⋅ Suyuan Liu ⋅ Hao Yu ⋅ Chuan Tang ⋅ Miaomiao Li ⋅ Xinwang Liu

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Most existing multi-view clustering methods aim to generate a consensus partition across all views, based on the assumption that all views share the same sample arrangement. However, in real-world scenarios, the collected data across different views is often unsynchronized, making it difficult to ensure consistent sample correspondence between views. To address this issue, we propose a scalable sample-alignment-based multi-view clustering method, referred to as SSA-MVC. Specifically, we first employ a cluster-label matching (CLM) algorithm to select the view whose clustering labels best match those of the others as the benchmark view. Then, for each of the remaining views, we construct representations of non-aligned samples by computing their similarities with aligned samples. Based on these representations, we build a similarity graph between the non-aligned samples of each view and those in the benchmark view, which serves as the alignment criterion. This alignment criterion is then integrated into a late-fusion framework to enable clustering without requiring aligned samples. Notably, the learned sample alignment matrix can be used to enhance existing multi-view clustering methods in scenarios where sample correspondence is unavailable. The effectiveness of the proposed SSA-MVC algorithm is validated through extensive experiments conducted on eight real-world multi-view datasets.

View full details

Poster

Generating Informative Samples for Risk-Averse Fine-Tuning of Downstream Tasks

Heasung Kim ⋅ Taekyun Lee ⋅ Hyeji Kim ⋅ Gustavo De Veciana

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Risk-averse modeling is critical in safety-sensitive and high-stakes applications. Conditional Value-at-Risk (CVaR) quantifies such risk by measuring the expected loss in the tail of the loss distribution, and minimizing it provides a principled framework for training robust models. However, direct CVaR minimization remains challenging due to the difficulty of accurately estimating rare, high-loss events—particularly at extreme quantiles. In this work, we propose a novel training framework that synthesizes informative samples for CVaR optimization using score-based generative models. Specifically, we guide a diffusion-based generative model to sample from a reweighted distribution that emphasizes inputs likely to incur high loss under a pretrained reference model. These samples are then incorporated via a loss-weighted importance sampling scheme to reduce noise in stochastic optimization. We establish convergence guarantees and show that the synthesized, high-loss-emphasized dataset substantially contributes to the noise reduction. Empirically, we validate the effectiveness of our approach across multiple settings, including a real-world wireless channel compression task, where our method achieves significant improvements over standard risk minimization strategies.

View full details

Poster

Joint Hierarchical Representation Learning of Samples and Features via Informed Tree-Wasserstein Distance

Ya-Wei Eileen Lin ⋅ Ronald Coifman ⋅ Gal Mishne ⋅ Ronen Talmon

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

High-dimensional data often exhibit hierarchical structures in both modes: samples and features. Yet, most existing approaches for hierarchical representation learning consider only one mode at a time. In this work, we propose an unsupervised method for jointly learning hierarchical representations of samples and features via Tree-Wasserstein Distance (TWD). Our method alternates between the two data modes. It first constructs a tree for one mode, then computes a TWD for the other mode based on that tree, and finally uses the resulting TWD to build the second mode’s tree. By repeatedly alternating through these steps, the method gradually refines both trees and the corresponding TWDs, capturing meaningful hierarchical representations of the data. We provide a theoretical analysis showing that our method converges. We show that our method can be integrated into hyperbolic graph convolutional networks as a pre-processing technique, improving performance in link prediction and node classification tasks. In addition, our method outperforms baselines in sparse approximation and unsupervised Wasserstein distance learning tasks on word-document and single-cell RNA-sequencing datasets.

View full details

Poster

Improved Bounds for Swap Multicalibration and Swap Omniprediction

Haipeng Luo ⋅ Spandan Senapati ⋅ Vatsal Sharan

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

In this paper, we consider the related problems of multicalibration --- a multigroup fairness notion and omniprediction --- a simultaneous loss minimization paradigm, both in the distributional and online settings. The recent work of Garg et al. (2024) raised the open problem of whether it is possible to efficiently achieve $\tilde{\mathcal{O}}(\sqrt{T})$ $\ell_{2}$-multicalibration error against bounded linear functions. In this paper, we answer this question in a strongly affirmative sense. We propose an efficient algorithm that achieves $\tilde{\mathcal{O}}(T^{\frac{1}{3}})$ $\ell_{2}$-swap multicalibration error (both in high probability and expectation). On propagating this bound onward, we obtain significantly improved rates for $\ell_{1}$-swap multicalibration and swap omniprediction for a loss class of convex Lipschitz functions. In particular, we show that our algorithm achieves $\tilde{\mathcal{O}}(T^{\frac{2}{3}})$ $\ell_{1}$-swap multicalibration and swap omniprediction errors, thereby improving upon the previous best-known bound of $\tilde{\mathcal{O}}(T^{\frac{7}{8}})$. As a consequence of our improved online results, we further obtain several improved sample complexity rates in the distributional setting. In particular, we establish a $\tilde{\mathcal{O}}(\varepsilon ^ {-3})$ sample complexity of efficiently learning an $\varepsilon$-swap omnipredictor for the class of convex and Lipschitz functions, $\tilde{\mathcal{O}}(\varepsilon ^{-2.5})$ sample complexity of efficiently learning an $\varepsilon$-swap agnostic learner for the squared loss, and $\tilde{\mathcal{O}}(\varepsilon ^ {-5}), \tilde{\mathcal{O}}(\varepsilon ^ {-2.5})$ sample complexities of learning $\ell_{1}, \ell_{2}$-swap multicalibrated predictors against linear functions, all of which significantly improve on the previous best-known bounds.

View full details

Poster

AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders

Yuezhou Hu ⋅ Jiaxin Guo ⋅ Xinyu Feng ⋅ Tuo Zhao

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Speculative Decoding (SD) accelerates large language model inference by employing a small draft model to generate predictions, which are then verified by a larger target model. The effectiveness of SD hinges on the alignment between these models, which is typically enhanced by Knowledge Distillation (KD). However, conventional KD methods aim to minimize the KL divergence between the draft and target models across all tokens, a goal that is misaligned with the true objective of SD, which is to maximize token acceptance rate. Therefore, draft models often struggle to fully assimilate the target model's knowledge due to capacity constraints, leading to suboptimal performance. To address this challenge, we propose AdaSPEC, a novel method that incorporates selective token filtering into the KD process. AdaSPEC utilizes a reference model to identify and filter out difficult-to-fit tokens, enabling the distillation of a draft model that better aligns with the target model on simpler tokens. This approach improves the overall token acceptance rate without compromising generation quality. We evaluate AdaSPEC across diverse tasks, including arithmetic reasoning, instruction-following, coding, and summarization, using model configurations of 31M/1.4B and 350M/2.7B parameters. Our results demonstrate that AdaSPEC consistently outperforms the state-of-the-art DistillSpec method, achieving higher acceptance rates across all tasks (up to 15\%). The code is publicly available at \url{https://github.com/yuezhouhu/adaspec}.

View full details

Poster

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

Jang Hyun Cho ⋅ Andrea Madotto ⋅ Effrosyni Mavroudi ⋅ Triantafyllos Afouras ⋅ Tushar Nagarajan ⋅ Muhammad Maaz ⋅ Yale Song ⋅ Tengyu Ma ⋅ Shuming Hu ⋅ Suyog Jain ⋅ Miguel Martin ⋅ Huiyu Wang ⋅ Hanoona Bangalath ⋅ Peize Sun ⋅ Po-Yao Huang ⋅ Daniel Bolya ⋅ Nikhila Ravi ⋅ Shashank Jain ⋅ Tammy Stark ⋅ Seungwhan Moon ⋅ Babak Damavandi ⋅ Vivian Lee ⋅ Andrew Westbury ⋅ Salman Khan ⋅ Philipp Kraehenbuehl ⋅ Piotr Dollar ⋅ Lorenzo Torresani ⋅ Kristen Grauman ⋅ Christoph Feichtenhofer

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Vision-language models are integral to computer vision research, yet many high-performing models remain closed-source, obscuring their data, design and training recipe. The research community has responded by using distillation from black-box models to label training data, achieving strong benchmark results, at the cost of measurable scientific progress. However, without knowing the details of the teacher model and its data sources, scientific progress remains difficult to measure. In this paper, we study building a Perception Language Model (PLM) in a fully open and reproducible framework for transparent research in image and video understanding. We analyze standard training pipelines without distillation from proprietary models and explore large-scale synthetic data to identify critical data gaps, particularly in detailed video understanding. To bridge these gaps, we release 2.8M human-labeled instances of fine-grained video question-answer pairs and spatio-temporally grounded video captions. Additionally, we introduce PLM–VideoBench, a suite for evaluating challenging video understanding tasks focusing on the ability to reason about ''what'', ''where'', ''when'', and ''how'' of a video. We make our work fully reproducible by providing data, training recipes, code & models.

View full details

Poster

LogicTree: Improving Complex Reasoning of LLMs via Instantiated Multi-step Synthetic Logical Data

Zehao Wang ⋅ Lin Yang ⋅ Jie Wang ⋅ Kehan Wang ⋅ Hanzhu Chen ⋅ Bin Wang ⋅ Jianye Hao ⋅ Defu Lian ⋅ Bin Li ⋅ Enhong Chen

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Despite their remarkable performance on various tasks, Large Language Models (LLMs) still struggle with logical reasoning, particularly in complex and multi-step reasoning processes. Among various efforts to enhance LLMs' reasoning capabilities, synthesizing large-scale, high-quality logical reasoning datasets has emerged as a promising direction. However, existing methods often rely on predefined templates for logical reasoning data generation, limiting their adaptability to real-world scenarios. To address the limitation, we propose **LogicTree**, a novel framework for efficiently synthesizing multi-step logical reasoning dataset that excels in both complexity and instantiation. By iteratively searching for applicable logic rules based on structural pattern matching to perform backward deduction, **LogicTree** constructs multi-step logic trees that capture complex reasoning patterns. Furthermore, we employ a two-stage LLM-based approach to instantiate various real-world scenarios for each logic tree, generating consistent real-world reasoning processes that carry contextual significance. This helps LLMs develop generalizable logical reasoning abilities across diverse scenarios rather than merely memorizing templates. Experiments on multiple benchmarks demonstrate that our approach achieves an average improvement of 9.4\% in accuracy on complex logical reasoning tasks.

View full details

Poster

Multitask Learning with Stochastic Interpolants

Hugo Negrel ⋅ Florentin Coeurdoux ⋅ Michael Albergo ⋅ Eric Vanden-Eijnden

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We propose a framework for learning maps between probability distributions that broadly generalizes the time dynamics of flow and diffusion models. To enable this, we generalize stochastic interpolants by replacing the scalar time variable with vectors, matrices, or linear operators, allowing us to bridge probability distributions across multiple dimensional spaces. This approach enables the construction of versatile generative models capable of fulfilling multiple tasks without task-specific training. Our operator-based interpolants not only provide a unifying theoretical perspective for existing generative models but also extend their capabilities. Through numerical experiments, we demonstrate the zero-shot efficacy of our method on conditional generation and inpainting, fine-tuning and posterior sampling, and multiscale modeling, suggesting its potential as a generic task-agnostic alternative to specialized models.

View full details

Poster

Offline Guarded Safe Reinforcement Learning for Medical Treatment Optimization Strategies

Runze Yan ⋅ Xun Shen ⋅ Akifumi Wachi ⋅ Sebastien Gros ⋅ Anni Zhao ⋅ Xiao Hu

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

When applying offline reinforcement learning (RL) in healthcare scenarios, the out-of-distribution (OOD) issues pose significant risks, as inappropriate generalization beyond clinical expertise can result in potentially harmful recommendations. While existing methods like conservative Q-learning (CQL) attempt to address the OOD issue, their effectiveness is limited by only constraining action selection by suppressing uncertain actions. This action-only regularization imitates clinician actions that prioritize short-term rewards, but it fails to regulate downstream state trajectories, thereby limiting the discovery of improved long-term treatment strategies. To safely improve policy beyond clinician recommendations while ensuring that state-action trajectories remain in-distribution, we propose \textit{Offline Guarded Safe Reinforcement Learning} ($\mathsf{OGSRL}$), a theoretically grounded model-based offline RL framework. $\mathsf{OGSRL}$ introduces a novel dual constraint mechanism for improving policy with reliability and safety. First, the OOD guardian is established to specify clinically validated regions for safe policy exploration. By constraining optimization within these regions, it enables the reliable exploration of treatment strategies that outperform clinician behavior by leveraging the full patient state history, without drifting into unsupported state-action trajectories. Second, we introduce a safety cost constraint that encodes medical knowledge about physiological safety boundaries, providing domain-specific safeguards even in areas where training data might contain potentially unsafe interventions. Notably, we provide theoretical guarantees on safety and near-optimality: policies that satisfy these constraints remain in safe and reliable regions and achieve performance close to the best possible policy supported by the data. When evaluated on the MIMIC-III sepsis treatment dataset, $\mathsf{OGSRL}$ demonstrated significantly better OOD handling than baselines. $\mathsf{OGSRL}$ achieved a 78\% reduction in mortality estimates and a 51\% increase in reward compared to clinician decisions.

View full details

Poster

ZeroS: Zero‑Sum Linear Attention for Efficient Transformers

Jiecheng Lu ⋅ Xu Han ⋅ Yan Sun ⋅ Viresh Pati ⋅ Yubin Kim ⋅ Siddhartha Somani ⋅ Shihao Yang

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Linear attention methods offer Transformers $O(N)$ complexity but typically underperform standard softmax attention. We identify two fundamental limitations affecting these approaches: the restriction to convex combinations that only permits additive information blending, and uniform accumulated weight bias that dilutes attention in long contexts. We propose Zero-Sum Linear Attention (ZeroS), which addresses these limitations by removing the constant zero-order term $1/t$ and reweighting the remaining zero-sum softmax residuals. This modification creates mathematically stable weights, enabling both positive and negative values and allowing a single attention layer to perform contrastive operations. While maintaining $O(N)$ complexity, ZeroS theoretically expands the set of representable functions compared to convex combinations. Empirically, it matches or exceeds standard softmax attention across various sequence modeling benchmarks.

View full details

Poster

How many measurements are enough? Bayesian recovery in inverse problems with general distributions

Ben Adcock ⋅ Zi Yuan (Nick) Huang

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We study the sample complexity of Bayesian recovery for solving inverse problems with general prior, forward operator and noise distributions. We consider posterior sampling according to an approximate prior $\mathcal{P}$, and establish sufficient conditions for stable and accurate recovery with high probability. Our main result is a non-asymptotic bound that shows that the sample complexity depends on (i) the intrinsic complexity of $\mathcal{P}$, quantified by its *approximate covering number*, and (ii) concentration bounds for the forward operator and noise distributions. As a key application, we specialize to generative priors, where $\mathcal{P}$ is the pushforward of a latent distribution via a Deep Neural Network (DNN). We show that the sample complexity scales log-linearly with the latent dimension $k$, thus establishing the efficacy of DNN-based priors. Generalizing existing results on deterministic (i.e., non-Bayesian) recovery for the important problem of random sampling with an orthogonal matrix $U$, we show how the sample complexity is determined by the *coherence* of $U$ with respect to the support of $\mathcal{P}$. Hence, we establish that coherence plays a fundamental role in Bayesian recovery as well. Overall, our framework unifies and extends prior work, providing rigorous guarantees for the sample complexity of solving Bayesian inverse problems with arbitrary distributions.

View full details

Poster

From Shortcut to Induction Head: How Data Diversity Shapes Algorithm Selection in Transformers

Ryotaro Kawata ⋅ Yujin Song ⋅ Alberto Bietti ⋅ Naoki Nishikawa ⋅ Taiji Suzuki ⋅ Samuel Vaiter ⋅ Denny Wu

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Transformers can implement both generalizable algorithms (e.g., induction heads) and simple positional shortcuts (e.g., memorizing fixed output positions). In this work, we study how the choice of pretraining data distribution steers a shallow transformer toward one behavior or the other. Focusing on a minimal trigger-output prediction task -- copying the token immediately following a special trigger upon its second occurrence -- we present a rigorous analysis of gradient-based training of a single-layer transformer. In both the infinite and finite sample regimes, we prove a transition in the learned mechanism: if input sequences exhibit sufficient diversity, measured by a low “max-sum” ratio of trigger-to-trigger distances, the trained model implements an induction head and generalizes to unseen contexts; by contrast, when this ratio is large, the model resorts to a positional shortcut and fails to generalize out-of-distribution (OOD). We also reveal a trade-off between the pretraining context length and OOD generalization, and derive the optimal pretraining distribution that minimizes computational cost per sample. Finally, we validate our theoretical predictions with controlled synthetic experiments, demonstrating that broadening context distributions robustly induces induction heads and enables OOD generalization. Our results shed light on the algorithmic biases of pretrained transformers and offer conceptual guidelines for data-driven control of their learned behaviors.

View full details

Poster

MonarchAttention: Zero-Shot Conversion to Fast, Hardware-Aware Structured Attention

Can Yaras ⋅ Alec Xu ⋅ Pierre Abillama ⋅ Changwoo Lee ⋅ Laura Balzano

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Transformers have achieved state-of-the-art performance across various tasks, but suffer from a notable quadratic complexity in sequence length due to the attention mechanism. In this work, we propose MonarchAttention -- a novel approach to sub-quadratic attention approximation via Monarch matrices, an expressive class of structured matrices. Based on the variational form of softmax, we describe an efficient optimization-based algorithm to compute an approximate projection of softmax attention onto the class of Monarch matrices with $\Theta(N\sqrt{N} d)$ computational complexity and $\Theta(Nd)$ memory/IO complexity. Unlike previous approaches, MonarchAttention is both (1) transferable, yielding minimal performance loss with no additional training, even when replacing every attention layer of the transformer, and (2) hardware-efficient, utilizing the highest-throughput tensor core units on modern GPUs. With optimized kernels, MonarchAttention achieves substantial speed-ups in wall-time over FlashAttention-2: $1.4\times$ for shorter sequences $(N=256)$, $4.5\times$ for medium-length sequences $(N=4K)$, and $8.2\times$ for longer sequences $(N=16K)$. We demonstrate the quality of MonarchAttention on diverse tasks and architectures in vision and language problems, showing that it flexibly and accurately approximates softmax attention in a variety of contexts.

View full details

Poster

Regularized least squares learning with heavy-tailed noise is minimax optimal

Mattes Mollenhauer ⋅ Nicole Muecke ⋅ Dimitri Meunier ⋅ Arthur Gretton

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

This paper examines the performance of ridge regression in reproducing kernel Hilbert spaces in the presence of noise that exhibits a finite number of higher moments. We establish excess risk bounds consisting of subgaussian and polynomial terms based on the well known integral operator framework. The dominant subgaussian component allows to achieve convergence rates that have previously only been derived under subexponential noise—a prevalent assumption in related work from the last two decades. These rates are optimal under standard eigenvalue decay conditions, demonstrating the asymptotic robustness of regularized least squares against heavy- tailed noise. Our derivations are based on a Fuk–Nagaev inequality for Hilbert-space valued random variables.

View full details

Poster

Differentiable Hierarchical Visual Tokenization

Marius Aasan ⋅ Martine Hjelkrem Tan ⋅ Nico Catalano ⋅ Changkyu Choi ⋅ Adín Ramírez Rivera

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Vision Transformers rely on fixed patch tokens that ignore the spatial and semantic structure of images. In this work, we introduce an end-to-end differentiable tokenizer that adapts to image content with pixel-level granularity while remaining backward-compatible with existing architectures for retrofitting pretrained models. Our method uses hierarchical model selection with information criteria to provide competitive performance in both image-level classification and dense-prediction tasks, and even supports out-of-the-box raster-to-vector conversion.

View full details

Poster

Fixing It in Post: A Comparative Study of LLM Post-Training Data Quality and Model Performance

Aladin Djuhera ⋅ Swanand Kadhe ⋅ Syed Zawad ⋅ Farhan Ahmed ⋅ Heiko Ludwig ⋅ Holger Boche

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Recent work on large language models (LLMs) has increasingly focused on post-training and alignment with datasets curated to enhance instruction following, world knowledge, and specialized skills. However, most post-training datasets used in leading open- and closed-source LLMs remain inaccessible to the public, with limited information about their construction process. This lack of transparency has motivated the recent development of open-source post-training corpora. While training on these open alternatives can yield performance comparable to that of leading models, systematic comparisons remain challenging due to the significant computational cost of conducting them rigorously at scale, and are therefore largely absent. As a result, it remains unclear how specific samples, task types, or curation strategies influence downstream performance when assessing data quality. In this work, we conduct the first comprehensive side-by-side analysis of two prominent open post-training datasets: Tulu-3-SFT-Mix and SmolTalk. Using the Magpie framework, we annotate each sample with detailed quality metrics, including turn structure (single-turn vs. multi-turn), task category, input quality, and response quality, and we derive statistics that reveal structural and qualitative similarities and differences between the two datasets. Based on these insights, we design a principled curation recipe that produces a new data mixture, **TuluTalk**, which contains 14% fewer samples than either source dataset while matching or exceeding their performance on key benchmarks. Our findings offer actionable insights for constructing more effective post-training datasets that improve model performance within practical resource limits. To support future research, we publicly release both the annotated source datasets and our curated TuluTalk mixture.

View full details

Poster

Vision Transformers with Self-Distilled Registers

Zipeng Yan ⋅ Yinjie Chen ⋅ Chong Zhou ⋅ Bo Dai ⋅ Andrew Luo

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Vision Transformers (ViTs) have emerged as the dominant architecture for visual processing tasks, demonstrating excellent scalability with increased training data and model size. However, recent work has identified the emergence of artifact tokens in ViTs that are incongruous with local semantics. These anomalous tokens degrade ViT performance in tasks that require fine-grained localization or structural coherence. An effective mitigation of this issue is the addition of register tokens to ViTs, which implicitly ''absorb'' the artifact term during training. Given the availability of existing large-scale pre-trained ViTs, in this paper we seek to add register tokens to existing models without retraining the models from scratch, which is infeasible considering their size. Specifically, we propose Post Hoc Registers (**PH-Reg**), an efficient self-distillation method that integrates registers into an existing ViT without requiring additional labeled data and full retraining. PH-Reg initializes both teacher and student networks from the same pre-trained ViT. The teacher remains frozen and unmodified, while the student is augmented with randomly initialized register tokens. By applying test-time augmentation to the teacher’s inputs, we generate denoised dense embeddings free of artifacts, which are then used to optimize only a small subset of unlocked student weights. We show that our approach can effectively reduce the number of artifact tokens, improving the segmentation and depth prediction of the student ViT under zero-shot and linear probing.

View full details

Poster

Conflict-Aware Knowledge Editing in the Wild: Semantic-Augmented Graph Representation for Unstructured Text

Zhange Zhang ⋅ Zhicheng Geng ⋅ Yuqing Ma ⋅ Tianbo Wang ⋅ Kai Lv ⋅ Xianglong Liu

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Large Language Models (LLMs) have demonstrated broad applications but suffer from issues like hallucinations, erroneous outputs and outdated knowledge. Model editing emerges as an effective solution to refine knowledge in LLMs, yet existing methods typically depend on structured knowledge representations. However, real-world knowledge is primarily embedded within complex, unstructured text. Existing structured knowledge editing approaches face significant challenges when handling the entangled and intricate knowledge present in unstructured text, resulting in issues such as representation ambiguity and editing conflicts. To address these challenges, we propose a Conflict-Aware Knowledge Editing in the Wild (CAKE) framework, the first framework explicitly designed for editing knowledge extracted from wild unstructured text. CAKE comprises two core components: a Semantic-augmented Graph Representation module and a Conflict-aware Knowledge Editing strategy. The Semantic-augmented Graph Representation module enhances knowledge encoding through structural disambiguation, relational enrichment, and semantic diversification. Meanwhile, the Conflict-aware Knowledge Editing strategy utilizes a graph-theoretic coloring algorithm to disentangle conflicted edits by allocating them to orthogonal parameter subspaces, thereby effectively mitigating editing conflicts. Experimental results on the AKEW benchmark demonstrate that CAKE significantly outperforms existing methods, achieving a 15.43\% improvement in accuracy on llama3 editing tasks. Our framework successfully bridges the gap between unstructured textual knowledge and reliable model editing, enabling more robust and scalable updates for practical LLM applications.

View full details

Poster

Quantum Doubly Stochastic Transformers

Jannis Born ⋅ Filip Skogh ⋅ Kahn Rhrissorrakrai ⋅ Filippo Utro ⋅ Nico Wagner ⋅ Aleksandros Sobczyk

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

At the core of the Transformer, the softmax normalizes the attention matrix to be right stochastic. Previous research has shown that this often de-stabilizes training and that enforcing the attention matrix to be doubly stochastic (through Sinkhorn’s algorithm) consistently improves performance across different tasks, domains and Transformer flavors. However, Sinkhorn’s algorithm is iterative, approximative, non-parametric and thus inflexible w.r.t. the obtained doubly stochastic matrix (DSM). Recently, it has been proven that DSMs can be obtained with a parametric quantum circuit, yielding a novel quantum inductive bias for DSMs with no known classical analogue. Motivated by this, we demonstrate the feasibility of a hybrid classical-quantum doubly stochastic Transformer (QDSFormer) that replaces the softmax in the self-attention layer with a variational quantum circuit. We study the expressive power of the circuit and find that it yields more diverse DSMs that better preserve information than classical operators. Across multiple small-scale object recognition tasks, we find that our QDSFormer consistently surpasses both a standard ViT and other doubly stochastic Transformers. Beyond the Sinkformer, this comparison includes a novel quantum-inspired doubly stochastic Transformer (based on QR decomposition) that can be of independent interest. Our QDSFormer also shows improved training stability and lower performance variation suggesting that it may mitigate the notoriously unstable training of ViTs on small-scale data.

View full details

Poster

Streaming Attention Approximation via Discrepancy Theory

Ekaterina Kochetkova ⋅ Kshiteej Jitesh Sheth ⋅ Insu Han ⋅ Amir Zandieh ⋅ Michael Kapralov

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Large language models (LLMs) have achieved impressive success, but their high memory requirements present challenges for long-context token generation. In this paper we study the streaming complexity of attention approximation, a key computational primitive underlying token generation. Our main contribution is BalanceKV, a streaming algorithm for $\epsilon$-approximating attention computations based on geometric process for selecting a balanced collection of Key and Value tokens as per Banaszczyk's vector balancing theory. We complement our algorithm with space lower bounds for streaming attention computation. Besides strong theoretical guarantees, BalanceKV exhibits empirically validated performance improvements over existing methods, both for attention approximation and end-to-end performance on various long context benchmarks.

View full details

Poster

The Implicit Bias of Structured State Space Models Can Be Poisoned With Clean Labels

Yonatan Slutzky ⋅ ‪Yotam Alexander‬‏ ⋅ Noam Razin ⋅ Nadav Cohen

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Neural networks are powered by an implicit bias: a tendency of gradient descent to fit training data in a way that generalizes to unseen data. A recent class of neural network models gaining increasing popularity is structured state space models (SSMs). Prior work argued that the implicit bias of SSMs leads to generalization in a setting where data is generated by a low dimensional teacher. In this paper, we revisit the latter setting, and formally establish a phenomenon entirely undetected by prior work on the implicit bias of SSMs. Namely, we prove that while implicit bias leads to generalization under many choices of training data, there exist special examples whose inclusion in training completely distorts the implicit bias, to a point where generalization fails. This failure occurs despite the special training examples being labeled by the teacher, i.e., having clean labels! We empirically demonstrate the phenomenon, with SSMs trained independently and as part of non-linear neural networks. In the area of adversarial machine learning, disrupting generalization with cleanly labeled training examples is known as clean-label poisoning. Given the proliferation of SSMs, we believe that delineating their susceptibility to clean-label poisoning, and developing methods for overcoming this susceptibility, are critical research directions to pursue.

View full details

Poster

Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation

David Heineman ⋅ Valentin Hofmann ⋅ Ian Magnusson ⋅ Yuling Gu ⋅ Noah Smith ⋅ Hanna Hajishirzi ⋅ Kyle Lo ⋅ Jesse Dodge

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Developing large language models is expensive and often involves making decisions with small experiments, typically by evaluating on large, multi-task evaluation suites. In this work, we analyze specific properties which make a benchmark more reliable and useful for such decisions, and interventions to design higher-quality evaluation benchmarks. We introduce two key metrics that show differences in current benchmarks: signal, a benchmark’s ability to separate better models from worse models, and noise, a benchmark’s sensitivity to random variability between training steps. We demonstrate that benchmarks with a better signal-to-noise ratio are more reliable when making decisions at small scale, and those with less noise have lower scaling law prediction error. These results suggest that improving signal or noise will lead to more useful benchmarks, so we introduce four interventions designed to directly affect signal or noise. For example, we propose that switching to a metric that has better signal and noise (e.g., perplexity rather than accuracy) leads to better reliability and scaling law error. We also find that filtering noisy benchmarks such that they have better signal-to-noise ratio leads to more reliable evaluations. We also find that averaging the output of a model's checkpoints to reduce noise leads to consistent improvements. We conclude by recommending that those creating new benchmarks, or selecting which existing benchmarks to use, aim for high signal and low noise. We use 30 benchmarks for these experiments, and 465 open-weight language models from 60M to 32B parameters, resulting in a new, publicly available dataset of 50K evaluation benchmark results, totaling 200M instances.

View full details

Poster

Optimal Nuisance Function Tuning for Estimating a Doubly Robust Functional under Proportional Asymptotics

Sean McGrath ⋅ Debarghya Mukherjee ⋅ Rajarshi Mukherjee ⋅ Zixiao Jolene Wang

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

In this paper, we explore the asymptotically optimal tuning parameter choice in ridge regression for estimating nuisance functions of a statistical functional that has recently gained prominence in conditional independence testing and causal inference. Given a sample of size $n$, we study estimators of the Expected Conditional Covariance (ECC) between variables $Y$ and $A$ given a high-dimensional covariate $X \in \mathbb{R}^p$. Under linear regression models for $Y$ and $A$ on $X$ and the proportional asymptotic regime $p/n \to c \in (0, \infty)$, we evaluate three existing ECC estimators and two sample splitting strategies for estimating the required nuisance functions. Since no consistent estimator of the nuisance functions exists in the proportional asymptotic regime without imposing further structure on the problem, we first derive debiased versions of the ECC estimators that utilize the ridge regression nuisance function estimators. We show that our bias correction strategy yields $\sqrt{n}$-consistent estimators of the ECC across different sample splitting strategies and estimator choices. We then derive the asymptotic variances of these debiased estimators to illustrate the nuanced interplay between the sample splitting strategy, estimator choice, and tuning parameters of the nuisance function estimators for optimally estimating the ECC. Our analysis reveals that prediction-optimal tuning parameters (i.e., those that optimally estimate the nuisance functions) may not lead to the lowest asymptotic variance of the ECC estimator -- thereby demonstrating the need to be careful in selecting tuning parameters based on the final goal of inference. Finally, we verify our theoretical results through extensive numerical experiments.

View full details

Poster

Torch-Uncertainty: Deep Learning Uncertainty Quantification

Adrien Lafage ⋅ Olivier Laurent ⋅ Firas Gabetni ⋅ Gianni Franchi

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Deep Neural Networks (DNNs) have demonstrated remarkable performance across various domains, including computer vision and natural language processing. However, they often struggle to accurately quantify their predictions' uncertainty, limiting their broader adoption in critical industrial applications. Uncertainty Quantification (UQ) for Deep Learning seeks to address this challenge by providing methodologies to improve the reliability of uncertainty estimates. While numerous techniques have been proposed, a unified tool remains lacking that offers a seamless workflow for evaluating and integrating these methods. To bridge this gap, we introduce **Torch-Uncertainty**, a *PyTorch* and *Lightning* framework designed to streamline the training and evaluation of DNNs with UQ techniques. In this paper, we outline the foundational principles of our library and present comprehensive experimental results that benchmark a diverse set of UQ methods across classification, segmentation, and regression tasks. Our library is available at: https://github.com/ENSTA-U2IS-AI/torch-uncertainty.

View full details

Poster

Understanding LLM Behaviors via Compression: Data Generation, Knowledge Acquisition and Scaling Laws

Zhixuan Pan ⋅ Shaowen Wang ⋅ Liao Pengfei ⋅ Jian Li

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Large Language Models (LLMs) have demonstrated remarkable capabilities across numerous tasks, yet principled explanations for their underlying mechanisms and several phenomena, such as scaling laws, hallucinations, and related behaviors, remain elusive. In this work, we revisit the classical relationship between compression and prediction, grounded in Kolmogorov complexity and Shannon information theory, to provide deeper insights into LLM behaviors. By leveraging the Kolmogorov Structure Function and interpreting LLM compression as a two-part coding process, we offer a detailed view of how LLMs acquire and store information across increasing model and data scales -- from pervasive syntactic patterns to progressively rarer knowledge elements. Motivated by this theoretical perspective and natural assumptions inspired by Heap’s and Zipf’s laws, we introduce a simplified yet representative hierarchical data-generation framework called the Syntax-Knowledge model. Under the Bayesian setting, we show that prediction and compression within this model naturally lead to diverse learning and scaling behaviors of LLMs. In particular, our theoretical analysis offers intuitive and principled explanations for both data and model scaling laws, the dynamics of knowledge acquisition during training and fine-tuning, factual knowledge hallucinations in LLMs. The experimental results validate our theoretical predictions.

View full details

Poster

ROOT: Rethinking Offline Optimization as Distributional Translation via Probabilistic Bridge

Cuong Dao ⋅ The Hung Tran ⋅ Phi Le Nguyen ⋅ Truong Thao Nguyen ⋅ Nghia Hoang

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

This paper studies the black-box optimization task which aims to find the maxima of a black-box function using a static set of its observed input-output pairs. This is often achieved via learning and optimizing a surrogate function with that offline data. Alternatively, it can also be framed as an inverse modeling task that maps a desired performance to potential input candidates that achieve it. Both approaches are constrained by the limited amount of offline data. To mitigate this limitation, we introduce a new perspective that casts offline optimization as a distributional translation task. This is formulated as learning a probabilistic bridge transforming an implicit distribution of low-value inputs (i.e., offline data) into another distribution of high-value inputs (i.e., solution candidates). Such probabilistic bridge can be learned using low- and high-value inputs sampled from synthetic functions that resemble the target function. These synthetic functions are constructed as the mean posterior of multiple Gaussian processes fitted with different parameterizations on the offline data, alleviating the data bottleneck. The proposed approach is evaluated on an extensive benchmark comprising most recent methods, demonstrating significant improvement and establishing a new state-of-the-art performance. Our code is publicly available at https://github.com/cuong-dm/ROOT.

View full details

Poster

Go With the Flow: Fast Diffusion for Gaussian Mixture Models

George Rapakoulias ⋅ Ali Reza Pedram ⋅ Fengjiao Liu ⋅ Lingjiong Zhu ⋅ Panagiotis Tsiotras

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Schrodinger Bridges (SBs) are diffusion processes that steer, in finite time, a given initial distribution to another final one while minimizing a suitable cost functional. Although various methods for computing SBs have recently been proposed in the literature, most of these approaches require computationally expensive training schemes, even for solving low-dimensional problems. In this work, we propose an analytic parametrization of a set of feasible policies for steering the distribution of a dynamical system from one Gaussian Mixture Model (GMM) to another. Instead of relying on standard non-convex optimization techniques, the optimal policy within the set can be approximated as the solution of a low-dimensional linear program whose dimension scales linearly with the number of components in each mixture. The proposed method generalizes naturally to more general classes of dynamical systems, such as controllable linear time-varying systems, enabling efficient solutions to multi-marginal momentum SBs between GMMs, a challenging distribution interpolation problem. We showcase the potential of this approach in low-to-moderate dimensional problems such as image-to-image translation in the latent space of an autoencoder, learning of cellular dynamics using multi-marginal momentum SBs, and various other examples. The implementation is publicly available at https://github.com/georgeRapa/GMMflow.

View full details

Poster

ConnectomeBench: Can LLMs proofread the connectome?

Jeff Brown ⋅ Andrew Kirjner ⋅ Annika Vivekananthan ⋅ Edward Boyden

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Connectomics—the mapping of neural connections in an organism's brain—currently requires extraordinary human effort to proofread the data collected from imaging and machine-learning assisted segmentation. With the growing excitement around using AI agents to automate important scientific tasks, we explore whether current AI systems can perform multiple tasks necessary for data proofreading. We introduce ConnectomeBench, a multimodal benchmark evaluating large language model (LLM) capabilities in three critical proofreading tasks: segment type identification, split error correction, and merge error detection. Using expert annotated data from two large open-source datasets—a cubic millimeter of mouse visual cortex and the complete Drosophila brain—we evaluate proprietary multimodal LLMs including Claude 3.7/4 Sonnet, o4-mini, GPT-4.1, GPT-4o, as well as open source models like InternVL-3 and NVLM. Our results demonstrate that current models achieve surprisingly high performance in segment identification (52-82\% balanced accuracy vs. 20-25\% chance) and binary/multiple choice split error correction (75-85\% accuracy vs. 50\% chance) while generally struggling on merge error identification tasks. Overall, while the best models still lag behind expert performance, they demonstrate promising capabilities that could eventually enable them to augment and potentially replace human proofreading in connectomics.

View full details

Poster

RGB-Only Supervised Camera Parameter Optimization in Dynamic Scenes

Fang Li ⋅ Hao Zhang ⋅ Narendra Ahuja

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Although COLMAP has long remained the predominant method for camera parameter optimization in static scenes, it is constrained by its lengthy runtime and reliance on ground truth (GT) motion masks for application to dynamic scenes. Many efforts attempted to improve it by incorporating more priors as supervision such as GT focal length, motion masks, 3D point clouds, camera poses, and metric depth, which, however, are typically unavailable in casually captured RGB videos. In this paper, we propose a novel method for more accurate and efficient camera parameter optimization in dynamic scenes solely supervised by a single RGB video, dubbed $\textbf{\textit{ROS-Cam}}$. Our method consists of three key components: (1) Patch-wise Tracking Filters, to establish robust and maximally sparse hinge-like relations across the RGB video. (2) Outlier-aware Joint Optimization, for efficient camera parameter optimization by adaptive down-weighting of moving outliers, without reliance on motion priors. (3) A Two-stage Optimization Strategy, to enhance stability and optimization speed by a trade-off between the Softplus limits and convex minima in losses. We visually and numerically evaluate our camera estimates. To further validate accuracy, we feed the camera estimates into a 4D reconstruction method and assess the resulting 3D scenes, and rendered 2D RGB and depth maps. We perform experiments on 4 real-world datasets (NeRF-DS, DAVIS, iPhone, and TUM-dynamics) and 1 synthetic dataset (MPI-Sintel), demonstrating that our method estimates camera parameters more efficiently and accurately with a single RGB video as the only supervision.

View full details

Poster

Transfer Faster, Price Smarter: Minimax Dynamic Pricing under Cross-Market Preference Shift

Yi Zhang ⋅ Elynn Chen ⋅ Yujun Yan

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We study contextual dynamic pricing when a target market can leverage $K$ auxiliary markets—offline logs or concurrent streams—whose *mean utilities differ by a structured preference shift*. We propose *Cross-Market Transfer Dynamic Pricing (CM-TDP)*, the first algorithm that *provably* handles such model-shift transfer and delivers minimax-optimal regret for *both* linear and non-parametric utility models. For linear utilities of dimension $d$, where the *difference* between source- and target-task coefficients is $s_{0}$-sparse, CM-TDP attains regret $\tilde{\mathcal{O}}\bigl((dK^{-1}+s_{0})\log T\bigr)$. For nonlinear demand residing in a reproducing kernel Hilbert space with effective dimension $\alpha$, complexity $\beta$ and task-similarity parameter $H$, the regret becomes $\tilde{\mathcal{O}}\bigl(K^{-2\alpha\beta/(2\alpha\beta+1)}T^{1/(2\alpha\beta+1)} + H^{2/(2\alpha+1)}T^{1/(2\alpha+1)}\bigr)$, matching information-theoretic lower bounds up to logarithmic factors. The RKHS bound is the first of its kind for transfer pricing and is of independent interest. Extensive simulations show up to 38\% higher cumulative revenue and $6\times$ faster convergence relative to single-market pricing baselines. By bridging transfer learning, robust aggregation, and revenue optimization, CM-TDP moves toward pricing systems that *transfer faster, price smarter*.

View full details

Poster

ProxySPEX: Inference-Efficient Interpretability via Sparse Feature Interactions in LLMs

Landon Butler ⋅ Abhineet Agarwal ⋅ Justin Kang ⋅ Yigit Efe Erginbas ⋅ Bin Yu ⋅ Kannan Ramchandran

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Large Language Models (LLMs) have achieved remarkable performance by capturing complex interactions between input features. To identify these interactions, most existing approaches require enumerating all possible combinations of features up to a given order, causing them to scale poorly with the number of inputs $n$. Recently, Kang et al. (2025) proposed SPEX, an information-theoretic approach that uses interaction sparsity to scale to $n \approx 10^3$ features. SPEX greatly improves upon prior methods but requires tens of thousands of model inferences, which can be prohibitive for large models. In this paper, we observe that LLM feature interactions are often *hierarchical*—higher-order interactions are accompanied by their lower-order subsets—which enables more efficient discovery. To exploit this hierarchy, we propose ProxySPEX, an interaction attribution algorithm that first fits gradient boosted trees to masked LLM outputs and then extracts the important interactions. Experiments across four challenging high-dimensional datasets show that ProxySPEX more faithfully reconstructs LLM outputs by 20\% over marginal attribution approaches while using *$10\times$ fewer inferences* than SPEX. By accounting for interactions, ProxySPEX efficiently identifies the most influential features, providing a scalable approximation of their Shapley values. Further, we apply ProxySPEX to two interpretability tasks. *Data attribution*, where we identify interactions among CIFAR-10 training samples that influence test predictions, and *mechanistic interpretability*, where we uncover interactions between attention heads, both within and across layers, on a question-answering task. The ProxySPEX algorithm is available at .

View full details

Poster

Towards a Golden Classifier-Free Guidance Path via Foresight Fixed Point Iterations

Kaibo Wang ⋅ Jianda Mao ⋅ Tong Wu ⋅ Yang Xiang

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Classifier-Free Guidance (CFG) is an essential component of text-to-image diffusion models, and understanding and advancing its operational mechanisms remains a central focus of research. Existing approaches stem from divergent theoretical interpretations, thereby limiting the design space and obscuring key design choices. To address this, we propose a unified perspective that reframes conditional guidance as fixed point iterations, seeking to identify a golden path where latents produce consistent outputs under both conditional and unconditional generation. We demonstrate that CFG and its variants constitute a special case of single-step short-interval iteration, which is theoretically proven to exhibit inefficiency. To this end, we introduce Foresight Guidance (FSG), which prioritizes solving longer-interval subproblems in early diffusion stages with increased iterations. Extensive experiments across diverse datasets and model architectures validate the superiority of FSG over state-of-the-art methods in both image quality and computational efficiency. Our work offers novel perspectives for conditional guidance and unlocks the potential of adaptive design.

View full details

Poster

Fair Cooperation in Mixed-Motive Games via Conflict-Aware Gradient Adjustment

Woojun Kim ⋅ Katia Sycara

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Multi-agent reinforcement learning in mixed-motive settings presents a fundamental challenge: agents must balance individual interests with collective goals, which are neither fully aligned nor strictly opposed. To address this, reward restructuring methods such as gifting and intrinsic motivation have been proposed. However, these approaches primarily focus on promoting cooperation by managing the trade-off between individual and collective returns, without explicitly addressing fairness with respect to agents’ task-specific rewards. In this paper, we propose an adaptive conflict-aware gradient adjustment method that promotes cooperation while ensuring fairness in individual rewards. The proposed method dynamically balances policy gradients derived from individual and collective objectives in situations where the two objectives are in conflict. By explicitly resolving such conflicts, our method improves collective performance while preserving fairness across agents. We provide theoretical results that guarantee monotonic non-decreasing improvement in both the collective and individual objectives and ensure fairness. Empirical results in sequential social dilemma environments demonstrate that our approach outperforms baselines in terms of social welfare, while maintaining fairness.

View full details

Poster

Bipolar Self-attention for Spiking Transformers

Shuai Wang ⋅ Malu Zhang ⋅ Jingya Wang ⋅ Dehao Zhang ⋅ Yimeng Shan ⋅ Jieyuan (Eric) Zhang ⋅ Yichen Xiao ⋅ Honglin Cao ⋅ Haonan Zhang ⋅ Zeyu Ma ⋅ Yang Yang ⋅ Haizhou Li

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Harnessing the event-driven characteristic, Spiking Neural Networks (SNNs) present a promising avenue toward energy-efficient Transformer architectures. However, existing Spiking Transformers still suffer significant performance gaps compared to their Artificial Neural Network counterparts. Through comprehensive analysis, we attribute this gap to these two factors. First, the binary nature of spike trains limits Spiking Self-attention (SSA)’s capacity to capture negative–negative and positive–negative membrane potential interactions on Querys and Keys. Second, SSA typically omits Softmax functions to avoid energy-intensive multiply-accumulate operations, thereby failing to maintain row-stochasticity constraints on attention scores. To address these issues, we propose a Bipolar Self-attention (BSA) paradigm, effectively modeling multi-polar membrane potential interactions with a fully spike-driven characteristic. Specifically, we demonstrate that ternary matrix multiplication provides a closer approximation to real-valued computation on both distribution and local correlation, enabling clear differentiation between homopolar and heteropolar interactions. Moreover, we propose a shift-based Softmax approximation named Shiftmax, which efficiently achieves low-entropy activation and partly maintains row-stochasticity without non-linear operation, enabling precise attention allocation. Extensive experiments show that BSA achieves substantial performance improvements across various tasks, including image classification, semantic segmentation, and event-based tracking. These results establish its potential as a fundamental building block for energy-efficient Spiking Transformers.

View full details

Poster

On the Empirical Power of Goodness-of-Fit Tests in Watermark Detection

Weiqing He ⋅ Xiang Li ⋅ Tianqi Shang ⋅ Li Shen ⋅ Weijie Su ⋅ Qi Long

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Large language models (LLMs) raise concerns about content authenticity and integrity because they can generate human-like text at scale. Text watermarks, which embed detectable statistical signals into generated text, offer a provable way to verify content origin. Many detection methods rely on pivotal statistics that are i.i.d. under human-written text, making goodness-of-fit (GoF) tests a natural tool for watermark detection. However, GoF tests remain largely underexplored in this setting. In this paper, we systematically evaluate eight GoF tests across three popular watermarking schemes, using three open-source LLMs, two datasets, various generation temperatures, and multiple post-editing methods. We find that general GoF tests can improve both the detection power and robustness of watermark detectors. Notably, we observe that text repetition, common in low-temperature settings, gives GoF tests a unique advantage not exploited by existing methods. Our results highlight that classic GoF tests are a simple yet powerful and underused tool for watermark detection in LLMs.

View full details

Poster

ErrorTrace: A Black-Box Traceability Mechanism Based on Model Family Error Space

Chuanchao Zang ⋅ Xiangtao Meng ⋅ Wenyu Chen ⋅ Tianshuo Cong ⋅ Zha Yaxing ⋅ Dong Qi ⋅ Zheng Li ⋅ Shanqing Guo

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

The open-source release of large language models (LLMs) enables malicious users to create unauthorized derivative models at low cost, posing significant threats to intellectual property (IP) and market stability. Existing IP protection methods either require access to model parameters or are vulnerable to fine-tuning attacks. To fill this gap, we propose ErrorTrace, a robust and black-box traceability mechanism for protecting LLM IP. Specifically, ErrorTrace leverages the unique error patterns of model families by mapping and analyzing their distinct error spaces, enabling robust and efficient IP protection without relying on internal parameters or specific query responses. Experimental results show that ErrorTrace achieves a traceability accuracy of 0.8518 for 27 base models when the suspect model is not included in ErrorTrace's training set, outperforming the baseline by 0.2593. Additionally,ErrorTrace successfully tracks 34 fine-tuned, pruned and merged models across various scenarios, demonstrating its broad applicability and robustness. In addition, ErrorTrace shows a certain level of resilience when subjected to adversarial attacks. Our code is available at: https://github.com/csdatazcc/ErrorTrace.

View full details

Poster

RepoMaster: Autonomous Exploration and Understanding of GitHub Repositories for Complex Task Solving

Huacan Wang ⋅ Ziyi Ni ⋅ Shuo Zhang ⋅ Shuo Lu ⋅ Sen Hu ⋅ Ziyang He ⋅ Chen Hu ⋅ Jiaye Lin ⋅ Yifu Guo ⋅ Yuntao Du ⋅ Pin Lyu

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

The ultimate goal of code agents is to solve complex tasks autonomously. Although large language models (LLMs) have made substantial progress in code generation, real-world tasks typically demand full-fledged code repositories rather than simple scripts. Building such repositories from scratch remains a major challenge. Fortunately, GitHub hosts a vast, evolving collection of open-source repositories, which developers frequently reuse as modular components for complex tasks. Yet, existing frameworks like OpenHands and SWE-Agent still struggle to effectively leverage these valuable resources. Relying solely on README files provides insufficient guidance, and deeper exploration reveals two core obstacles: overwhelming information and tangled dependencies of repositories, both constrained by the limited context windows of current LLMs. To tackle these issues, we propose RepoMaster, an autonomous agent framework designed to explore and reuse GitHub repositories for solving complex tasks. For efficient understanding, RepoMaster constructs function-call graphs, module-dependency graphs, and hierarchical code trees to identify essential components, providing only identified core elements to the LLMs rather than the entire repository. During autonomous execution, it progressively explores related components using our exploration tools and prunes information to optimize context usage. Evaluated on the adjusted MLE-bench, RepoMaster achieves a 110\% relative boost in valid submissions over the strongest baseline OpenHands. On our newly released GitTaskBench, RepoMaster lifts the task-pass rate from 40.7% to 62.9% while reducing token usage by 95%. Our code and demonstration materials are publicly available at https://github.com/QuantaAlpha/RepoMaster.

View full details

Poster

MoCha: Towards Movie-Grade Talking Character Generation

Cong Wei ⋅ Bo Sun ⋅ Haoyu Ma ⋅ Ji Hou ⋅ Felix Juefei-Xu ⋅ Zecheng He ⋅ Xiaoliang Dai ⋅ Luxin Zhang ⋅ Kunpeng Li ⋅ Tingbo Hou ⋅ Animesh Sinha ⋅ Peter Vajda ⋅ Wenhu Chen

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Recent advancements in video generation have achieved impressive motion realism, yet they often overlook character-driven storytelling, a crucial task for automated film, animation generation. We introduce Talking Characters, a more realistic task to generate talking character animations directly from speech and text. Unlike talking head tasks, Talking Characters aims at generating the full portrait of one or more characters beyond the facial region. In this paper, we propose MoCha, the first of its kind to generate talking characters. To ensure precise synchronization between video and speech, we propose a localized audio attention mechanism that effectively aligns speech and video tokens. To address the scarcity of large-scale speech-labelled video datasets, we introduce a joint training strategy that leverages both speech-labelled and text-labelled video data, significantly improving generalization across diverse character actions. We also design structured prompt templates with character tags, enabling, for the first time, multi-character conversation with turn-based dialogue—allowing AI-generated characters to engage in context-aware conversations with cinematic coherence. Extensive qualitative and quantitative evaluations, including human evaluation studies and benchmark comparisons, demonstrate that MoCha sets a new standard for AI-generated cinematic storytelling, achieving superior realism, controllability and generalization.

View full details

Poster

On the Surprising Effectiveness of Large Learning Rates under Standard Width Scaling

Moritz Haas ⋅ Sebastian Bordt ⋅ Ulrike Luxburg ⋅ Leena Chennuru Vankadara

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Scaling limits, such as infinite-width limits, serve as promising theoretical tools to study large-scale models. However, it is widely believed that existing infinite-width theory does not faithfully explain the behavior of practical networks, especially those trained in *standard parameterization* (SP) meaning He initialization with a global learning rate. For instance, existing theory for SP predicts instability at large learning rates and vanishing feature learning at stable ones. In practice, however, optimal learning rates decay slower than theoretically predicted and networks exhibit both stable training and non-trivial feature learning, even at very large widths. Here, we show that this discrepancy is not fully explained by finite-width phenomena. Instead, we find a resolution through a finer-grained analysis of the regime previously considered unstable and therefore uninteresting. In particular, we show that, under the cross-entropy (CE) loss, the unstable regime comprises two distinct sub-regimes: a catastrophically unstable regime and a more benign controlled divergence regime, where logits diverge but gradients and activations remain stable. Moreover, under large learning rates at the edge of the controlled divergence regime, there exists a well-defined infinite width limit where features continue to evolve in all the hidden layers. In experiments across optimizers, architectures, and data modalities, we validate that neural networks operate in this controlled divergence regime under CE loss but not under MSE loss. Our empirical evidence suggests that width-scaling considerations are surprisingly useful for predicting empirically maximal stable learning rate exponents which provide useful guidance on optimal learning rate exponents. Finally, our analysis clarifies the effectiveness and limitations of recently proposed layerwise learning rate scalings for standard initialization.

View full details

Poster

Mitigating the Privacy–Utility Trade-off in Decentralized Federated Learning via f-Differential Privacy

Xiang Li ⋅ Chendi Wang ⋅ Buxin Su ⋅ Qi Long ⋅ Weijie Su

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Differentially private (DP) decentralized Federated Learning (FL) allows local users to collaborate without sharing their data with a central server. However, accurately quantifying the privacy budget of private FL algorithms is challenging due to the co-existence of complex algorithmic components such as decentralized communication and local updates. This paper addresses privacy accounting for two decentralized FL algorithms within the $f$-differential privacy ($f$-DP) framework. We develop two new $f$-DP–based accounting methods tailored to decentralized settings: Pairwise Network $f$-DP (PN-$f$-DP), which quantifies privacy leakage between user pairs under random-walk communication, and Secret-based $f$-Local DP (Sec-$f$-LDP), which supports structured noise injection via shared secrets. By combining tools from $f$-DP theory and Markov chain concentration, our accounting framework captures privacy amplification arising from sparse communication, local iterations, and correlated noise. Experiments on synthetic and real datasets demonstrate that our methods yield consistently tighter $(\epsilon, \delta)$ bounds and improved utility compared to Rényi DP–based approaches, illustrating the benefits of $f$-DP in decentralized privacy accounting.

View full details

Poster

LaViDa: A Large Diffusion Language Model for Multimodal Understanding

Shufan Li ⋅ Konstantinos Kallidromitis ⋅ Hritik Bansal ⋅ Akash Gokul ⋅ Yusuke Kato ⋅ Kazuki Kozuka ⋅ Jason Kuen ⋅ Zhe Lin ⋅ Kai-Wei Chang ⋅ Aditya Grover

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Modern Vision-Language Models (VLMs) can solve a wide range of tasks requiring visual reasoning. In real-world scenarios, desirable properties for VLMs include fast inference and controllable generation (e.g., constraining outputs to adhere to a desired format). However, existing autoregressive (AR) VLMs like LLaVA struggle in these aspects. Discrete diffusion models (DMs) offer a promising alternative, enabling parallel decoding for faster inference and bidirectional context for controllable generation through text-infilling. While effective in language-only settings, DMs' potential for multimodal tasks is underexplored. We introduce LaViDa, a family of VLMs built on DMs. We build LaViDa by equipping DMs with a vision encoder and jointly fine-tune the combined parts for multimodal instruction following. To address challenges encountered, LaViDa incorporates novel techniques such as complementary masking for effective training, prefix KV cache for efficient inference, and timestep shifting for high-quality sampling. Experiments show that LaViDa achieves competitive or superior performance to AR VLMs on multi-modal benchmarks such as MMMU, while offering unique advantages of DMs, including flexible speed-quality tradeoff, controllability, and bidirectional reasoning. On COCO captioning, LaViDa surpasses Open-LLaVa-Next-8B by +4.1 CIDEr with 1.92x speedup. On bidirectional tasks, it achieves +59% improvement on Constrained Poem Completion. These results demonstrate LaViDa as a strong alternative to AR VLMs. Code and models is available at https://github.com/jacklishufan/LaViDa

View full details

Poster

X-Field: A Physically Informed Representation for 3D X-ray Reconstruction

Feiran Wang ⋅ Jiachen Tao ⋅ Junyi Wu ⋅ Haoxuan Wang ⋅ Bin Duan ⋅ Kai Wang ⋅ Zongxin Yang ⋅ Yan Yan

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

X-ray imaging is indispensable in medical diagnostics, yet its use is tightly regulated due to radiation exposure. Recent research borrows representations from the 3D reconstruction area to complete two tasks with reduced radiation dose: X-ray Novel View Synthesis (NVS) and Computed Tomography (CT) reconstruction. However, these representations fail to fully capture the penetration and attenuation properties of X-ray imaging as they originate from visible light imaging. In this paper, we introduce X-Field, a 3D representation informed in the physics of X-ray imaging. First, we employ homogeneous 3D ellipsoids with distinct attenuation coefficients to accurately model diverse materials within internal structures. Second, we introduce an efficient path-partitioning algorithm that resolves the intricate intersection of ellipsoids to compute cumulative attenuation along an X-ray path. We further propose a hybrid progressive initialization to refine the geometric accuracy of X-Field and incorporate material-based optimization to enhance model fitting along material boundaries. Experiments show that X-Field achieves superior visual fidelity on both real-world human organ and synthetic object datasets, outperforming state-of-the-art methods in X-ray NVS and CT Reconstruction. Our code is available on the project page: https://github.com/Brack-Wang/X-Field.

View full details

Poster

Lost in Transmission: When and Why LLMs Fail to Reason Globally

Tobias Schnabel ⋅ Kiran Tomlinson ⋅ Adith Swaminathan ⋅ Jennifer Neville

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Despite their many successes, transformer-based large language models (LLMs) continue to struggle with tasks that require complex reasoning over large parts of their input. We argue that these failures arise due to capacity limits on the accurate flow of information within LLMs. To formalize this issue, we introduce the bounded attention prefix oracle (BAPO) model, a new computational framework that models bandwidth constraints on attention heads, the mechanism for internal communication in LLMs. We show that several important reasoning problems like graph reachability require high communication bandwidth for BAPOs to solve; we call these problems BAPO-hard. Our experiments corroborate our theoretical predictions: GPT-4o, Claude, and Gemini succeed on BAPO-easy tasks and fail even on relatively small BAPO-hard tasks. BAPOs also reveal another benefit of chain of thought (CoT): we prove that breaking down a task using CoT can turn any BAPO-hard problem into a BAPO-easy one. Our results offer principled explanations for key LLM failures and suggest directions for architectures and inference methods that mitigate bandwidth limits.

View full details

Poster

Distillation Robustifies Unlearning

Bruce W, Lee ⋅ Addie Foote ⋅ Alex Infanger ⋅ Leni Shor ⋅ Harish Kamath ⋅ Jacob Goldman-Wetzler ⋅ Bryce Woodworth ⋅ Alex Cloud ⋅ Alexander Turner

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Current LLM unlearning methods are not robust. A few steps of finetuning can revert their effects. We begin by showing that this is true even for an idealized form of unlearning: training to imitate a model that was never trained on unwanted information. This shows that training a model can drastically modify its input-output behavior while leaving its underlying capabilities intact. In light of this dynamic, we show our main result. Training a randomly initialized student on the outputs of an unlearned model transfers behaviors while leaving latent capabilities behind. In short, distillation robustifies unlearning. Based on this result, we propose Unlearn-Noise-Distill-on-Outputs (UNDO), a scalable method that distills an unlearned model into a noised copy of itself. UNDO introduces a tunable tradeoff between compute cost and robustness, establishing a new Pareto frontier on synthetic language and arithmetic tasks. At its strongest setting, UNDO matches the robustness of a model retrained from scratch with perfect data filtering while using only 60-80% of the compute and requiring only 0.01% of the pretraining data to be labeled. We also show that UNDO robustifies unlearning on the more realistic Weapons of Mass Destruction Proxy (WMDP) benchmark. Since distillation is widely used in practice, incorporating an unlearning step beforehand offers a convenient path to robust capability removal.

View full details

Poster

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Haozhe Wang ⋅ Chao Qu ⋅ Zuming Huang ⋅ Wei Chu ⋅ Fangzhen Lin ⋅ Wenhu Chen

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Recently, slow-thinking systems like GPT-o1 and DeepSeek-R1 have demonstrated great potential in solving challenging problems through explicit reflection. They significantly outperform the best fast-thinking models, such as GPT-4o, on various math and science benchmarks. However, their multimodal reasoning capabilities remain on par with fast-thinking models. For instance, GPT-o1's performance on benchmarks like MathVista, MathVerse, and MathVision is similar to fast-thinking models. In this paper, we aim to enhance the slow-thinking capabilities of vision-language models using reinforcement learning (without relying on distillation) to advance the state of the art. First, we adapt the GRPO algorithm with a novel technique called Selective Sample Replay (SSR) to address the vanishing advantages problem. While this approach yields strong performance, the resulting RL-trained models exhibit limited self-reflection or self-verification. To further encourage slow-thinking, we introduce Forced Rethinking, which appends a rethinking trigger token to the end of rollouts in RL training, explicitly enforcing a self-reflection reasoning step. By combining these two techniques, our model, VL-Rethinker, advances state-of-the-art scores on MathVista, MathVerse to achieve 80.4%, 63.5% respectively. VL-Rethinker also achieves open-source SoTA on multi-disciplinary benchmarks such as MathVision, MMMU-Pro, EMMA, and MEGA-Bench, narrowing the gap with OpenAI-o1. We conduct comprehensive ablations and analysis to provide insights into the effectiveness of our approach.

View full details

Poster

Sharper Convergence Rates for Nonconvex Optimisation via Reduction Mappings

Evan Markou ⋅ Thalaiyasingam Ajanthan ⋅ Stephen Gould

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Many high-dimensional optimisation problems exhibit rich geometric structures in their set of minimisers, often forming smooth manifolds due to over-parametrisation or symmetries. When this structure is known, at least locally, it can be exploited through reduction mappings that reparametrise part of the parameter space to lie on the solution manifold. These reductions naturally arise from inner optimisation problems and effectively remove redundant directions, yielding a lower-dimensional objective. In this work, we introduce a general framework to understand how such reductions influence the optimisation landscape. We show that well-designed reduction mappings improve curvature properties of the objective, leading to better-conditioned problems and theoretically faster convergence for gradient-based methods. Our analysis unifies a range of scenarios where structural information at optimality is leveraged to accelerate convergence, offering a principled explanation for the empirical gains observed in such optimisation algorithms.

View full details

Poster

ROGR: Relightable 3D Objects using Generative Relighting

Jiapeng Tang ⋅ Matthew Levine ⋅ Dor Verbin ⋅ Stephan Garbin ⋅ Matthias Niessner ⋅ Ricardo Martin Brualla ⋅ Pratul Srinivasan ⋅ Philipp Henzler

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We introduce ROGR, a novel approach that reconstructs a relightable 3D model of an object captured from multiple views, driven by a generative relighting model that simulates the effects of placing the object under novel environment illuminations. Our method samples the appearance of the object under multiple lighting environments, creating a dataset that is used to train a lighting-conditioned Neural Radiance Field (NeRF) that outputs the object's appearance under any input environmental lighting. The lighting-conditioned NeRF uses a novel dual-branch architecture to encode the general lighting effects and specularities separately. The optimized lighting-conditioned NeRF enables efficient feed-forward relighting under arbitrary environment maps without requiring per-illumination optimization or light transport simulation. We evaluate our approach on the established TensoIR and Stanford-ORB datasets, where it improves upon the state-of-the-art on most metrics, and showcase our approach on real-world object captures.

View full details

Poster

Composite Flow Matching for Reinforcement Learning with Shifted-Dynamics Data

Lingkai Kong ⋅ Haichuan Wang ⋅ Tonghan Wang ⋅ GUOJUN XIONG ⋅ Milind Tambe

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Incorporating pre-collected offline data can substantially improve the sample efficiency of reinforcement learning (RL), but its benefits can break down when the transition dynamics in the offline dataset differ from those encountered online. Existing approaches typically mitigate this issue by penalizing or filtering offline transitions in regions with large dynamics gap. However, their dynamics-gap estimators often rely on KL divergence or mutual information, which can be ill-defined when offline and online dynamics have mismatched support. To address this challenge, we propose CompFlow, a principled framework built on the theoretical connection between flow matching and optimal transport. Specifically, we model the online dynamics as a conditional flow built upon the output distribution of a pretrained offline flow, rather than learning it directly from a Gaussian prior. This composite structure provides two advantages: (1) improved generalization when learning online dynamics under limited interaction data, and (2) a well-defined and stable estimate of the dynamics gap via the Wasserstein distance between offline and online transitions. Building on this dynamics-gap estimator, we further develop an optimistic active data collection strategy that prioritizes exploration in high-gap regions, and show theoretically that it reduces the performance gap to the optimal policy. Empirically, CompFlow consistently outperforms strong baselines across a range of RL benchmarks with shifted-dynamics data.

View full details

Poster

Accelerating data-driven algorithm selection for combinatorial partitioning problems

Vaggos Chatziafratis ⋅ Ishani Karmarkar ⋅ Yingxi Li ⋅ Ellen Vitercik

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Data-driven algorithm selection is a powerful approach for choosing effective heuristics for computational problems. It operates by evaluating a set of candidate algorithms on a collection of representative training instances and selecting the one with the best empirical performance. However, running each algorithm on every training instance is computationally expensive, making scalability a central challenge. In practice, a common workaround is to evaluate algorithms on smaller proxy instances derived from the original inputs. However, this practice has remained largely ad hoc and lacked theoretical grounding. We provide the first theoretical foundations for this practice by formalizing the notion of size generalization: predicting an algorithm's performance on a large instance by evaluating it on a smaller, representative instance, subsampled from the original instance. We provide size generalization guarantees for three widely used clustering algorithms (single-linkage, k-means++, and Gonzalez's k-centers heuristic) and two canonical max-cut algorithms (Goemans-Williamson and Greedy). We characterize the subsample size sufficient to ensure that performance on the subsample reflects performance on the full instance, and our experiments support these findings.

View full details

Poster

Reinforcement Learning for Out-of-Distribution Reasoning in LLMs: An Empirical Study on Diagnosis-Related Group Coding

Hanyin Wang ⋅ Zhenbang Wu ⋅ Gururaj Kolar ⋅ Hariprasad Korsapati ⋅ Brian Bartlett ⋅ Bryan Hull ⋅ Jimeng Sun

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Diagnosis-Related Group (DRG) codes are essential for hospital reimbursement and operations but require labor-intensive assignment. Large Language Models (LLMs) struggle with DRG coding due to the out-of-distribution (OOD) nature of the task: pretraining corpora rarely contain private clinical or billing data. We introduce DRG-Sapphire, which uses large-scale reinforcement learning (RL) for automated DRG coding from clinical notes. Built on Qwen2.5-7B and trained with Group Relative Policy Optimization (GRPO) using rule-based rewards, DRG-Sapphire introduces a series of RL enhancements to address domain-specific challenges not seen in previous mathematical tasks. Our model achieves state-of-the-art accuracy on the MIMIC-IV benchmark and generates physician-validated reasoning for DRG assignments, significantly enhancing explainability. Our study further sheds light on broader challenges of applying RL to knowledge-intensive, OOD tasks. We observe that RL performance scales approximately linearly with the logarithm of the number of supervised fine-tuning (SFT) examples, suggesting that RL effectiveness is fundamentally constrained by the domain knowledge encoded in the base model. For OOD tasks like DRG coding, strong RL performance requires sufficient knowledge infusion prior to RL. Consequently, scaling SFT may be more effective and computationally efficient than scaling RL alone for such tasks.

View full details

Poster

FAPEX: Fractional Amplitude-Phase Expressor for Robust Cross-Subject Seizure Prediction

Ruizhe Zheng ⋅ Lingyan Mao ⋅ DINGDING HAN ⋅ Tian Luo ⋅ Yi Wang ⋅ Jing Ding ⋅ Yuguo Yu

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Precise, generalizable subject-agnostic seizure prediction (SASP) remains a fundamental challenge due to the intrinsic complexity and significant spectral variability of electrophysiologial signals across individuals and recording modalities. We propose \model{FAPEX}, a novel architecture that introduces a learnable \emph{fractional neural frame operator} (FrNFO) for adaptive time–frequency decomposition. Unlike conventional models that exhibit spectral bias toward low frequencies, our FrNFO employs fractional-order convolutions to capture both high and low-frequency dynamics, achieving approximately $10\%$ improvement in F1-score and sensitivity over state-of-the-art baselines. The FrNFO enables the extraction of \emph{instantaneous phase and amplitude representations} that are particularly informative for preictal biomarker discovery and enhance out-of-distribution generalization. \model{FAPEX} further integrates structural state-space modeling and channelwise attention, allowing it to handle heterogeneous electrode montages. Evaluated across 12 benchmarks spanning species (human, rat, dog, macaque) and modalities (Scalp‑EEG, SEEG, ECoG, LFP), \model{FAPEX} consistently outperforms 23 supervised and 10 self-supervised baselines under nested cross-validation, with gains of up to $15\%$ in sensitivity on complex cross-domain scenarios. It further demonstrates superior performance in several external validation cohorts. To our knowledge, these establish \model{FAPEX} as the first epilepsy model to show consistent superiority in SASP, offering a promising solution for discovering epileptic biomarker evidence supporting the existence of a distinct and identifiable preictal state for and clinical translation.

View full details

Poster

Breaking the Batch Barrier (B3) of Contrastive Learning via Smart Batch Mining

Raghuveer Thirukovalluru ⋅ Rui Meng ⋅ Ye Liu ⋅ Karthikeyan K ⋅ Mingyi Su ⋅ Ping Nie ⋅ Semih Yavuz ⋅ Yingbo Zhou ⋅ Wenhu Chen ⋅ Bhuwan Dhingra

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Contrastive learning (CL) is a prevalent technique for training embedding models, which pulls semantically similar examples (positives) closer in the representation space while pushing dissimilar ones (negatives) further apart. A key source of negatives are "in-batch" examples, i.e., positives from other examples in the batch. Effectiveness of such models is hence strongly influenced by the size and quality of training batches. In this work, we propose *Breaking the Batch Barrier* (B3), a novel batch construction strategy designed to curate high-quality batches for CL. Our approach begins by using a pretrained teacher embedding model to rank all examples in the dataset, from which a sparse similarity graph is constructed. A community detection algorithm is then applied to this graph to identify clusters of examples that serve as strong negatives for one another. The clusters are then used to construct batches that are rich in in-batch negatives. Empirical results on the MMEB multimodal embedding benchmark (36 tasks) demonstrate that our method sets a new state of the art, outperforming previous best methods by +1.3 and +2.9 points at the 7B and 2B model scales, respectively. Notably, models trained with B3 surpass existing state-of-the-art results even with a batch size as small as 64, which is 4–16× smaller than that required by other methods. Moreover, experiments show that B3 generalizes well across domains and tasks, maintaining strong performance even when trained with considerably weaker teachers.

View full details

Poster

Minimax-Optimal Univariate Function Selection in Sparse Additive Models: Rates, Adaptation, and the Estimation-Selection Gap

Shixiang Liu

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

The sparse additive model (SpAM) offers a trade-off between interpretability and flexibility, and hence is a powerful model for high-dimensional research. This paper focuses on the variable selection, i.e., the univariate function selection problem in SpAM. We establish the minimax separation rates from both the perspectives of sparse multiple testing (FDR + FNR control) and support recovery (wrong recovery probability control). We further study how adaptation to unknown smoothness affects the minimax separation rate, and propose an adaptive selection procedure. Finally, we discuss the difference between estimation and selection in SpAM: Procedures achieving optimal function estimation may fail to achieve optimal univariate function selection.

View full details

Poster

Adaptive Prediction-Powered AutoEval with Reliability and Efficiency Guarantees

Sangwoo Park ⋅ Matteo Zecchin ⋅ Osvaldo Simeone

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Selecting artificial intelligence (AI) models, such as large language models (LLMs), from multiple candidates requires accurate performance estimation. This is ideally achieved through empirical evaluations involving abundant real-world data. However, such evaluations are costly and impractical at scale. To address this challenge, autoevaluation methods leverage synthetic data produced by automated evaluators, such as LLMs-as-judges, reducing variance but potentially introducing bias. Recent approaches have employed semi-supervised prediction-powered inference ($\texttt{PPI}$) to correct for the bias of autoevaluators. However, the use of autoevaluators may lead in practice to a degradation in sample efficiency compared to conventional methods using only real-world data. In this paper, we propose $\texttt{R-AutoEval+}$, a novel framework that provides finite-sample reliability guarantees on the model evaluation, while also ensuring an enhanced (or at least no worse) sample efficiency compared to conventional methods. The key innovation of $\texttt{R-AutoEval+}$ is an adaptive construction of the model evaluation variable, which dynamically tunes its reliance on synthetic data, reverting to conventional methods when the autoevaluator is insufficiently accurate. Experiments on the use of LLMs-as-judges for the optimization of quantization settings for the weights of an LLM, for prompt design in LLMs, and for test-time reasoning budget allocation in LLMs confirm the reliability and efficiency of $\texttt{R-AutoEval+}$.

View full details

Poster

Why Do Some Language Models Fake Alignment While Others Don't?

Abhay Sheshadri ⋅ John Hughes ⋅ Julian Michael ⋅ Alex Mallen ⋅ Arun Jose ⋅ Fabien Roger

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

*Alignment Faking in Large Language Models* presented a demonstration of Claude 3 Opus and Claude 3.5 Sonnet selectively complying with a helpful-only training objective to prevent modification of their behavior outside of training. We expand this analysis to 25 models and find that only 5 (Claude 3 Opus, Claude 3.5 Sonnet, Llama 3 405B, Grok 3, Gemini 2.0 Flash) comply with harmful queries more when they infer they are in training than when they infer they are in deployment. First, we study the motivations of these 5 models. Results from perturbing details of the scenario suggest that only Claude 3 Opus's compliance gap is primarily and consistently motivated by trying to keep its goals. Second, we investigate why many chat models don't fake alignment. Our results suggest this is not entirely due to a lack of capabilities: many base models fake alignment some of the time, and post-training eliminates alignment-faking for some models and amplifies it for others. We investigate 5 hypotheses for how post-training may suppress alignment faking and find that variations in refusal behavior may account for a significant portion of differences in alignment faking.

View full details

Poster

DiCo: Revitalizing ConvNets for Scalable and Efficient Diffusion Modeling

Yuang Ai ⋅ Qihang Fan ⋅ Xuefeng Hu ⋅ Zhenheng Yang ⋅ Ran He ⋅ Huaibo Huang

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Diffusion Transformer (DiT), a promising diffusion model for visual generation, demonstrates impressive performance but incurs significant computational overhead. Intriguingly, analysis of pre-trained DiT models reveals that global self-attention is often redundant, predominantly capturing local patterns—highlighting the potential for more efficient alternatives. In this paper, we revisit convolution as an alternative building block for constructing efficient and expressive diffusion models. However, naively replacing self-attention with convolution typically results in degraded performance. Our investigations attribute this performance gap to the higher channel redundancy in ConvNets compared to Transformers. To resolve this, we introduce a compact channel attention mechanism that promotes the activation of more diverse channels, thereby enhancing feature diversity. This leads to Diffusion ConvNet (DiCo), a family of diffusion models built entirely from standard ConvNet modules, offering strong generative performance with significant efficiency gains. On class-conditional ImageNet generation benchmarks, DiCo-XL achieves an FID of 2.05 at 256$\times$256 resolution and 2.53 at 512$\times$512, with a **2.7$\times$** and **3.1$\times$** speedup over DiT-XL/2, respectively. Furthermore, experimental results on MS-COCO demonstrate that the purely convolutional DiCo exhibits strong potential for text-to-image generation.

View full details

Poster

Enhancing Contrastive Learning with Variable Similarity

Haowen Cui ⋅ Shuo Chen ⋅ Jun Li ⋅ Jian Yang

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Contrastive learning has achieved remarkable success in self-supervised learning by pretraining a generalizable feature representation based on the augmentation invariance. Most existing approaches assume that different augmented views of the same instance (i.e., the *positive pairs*) remain semantically invariant. However, the augmentation results with *varying extent* may introduce semantic discrepancies or even content distortion, and thus the conventional (pseudo) supervision from augmentation invariance may lead to misguided learning objectives. In this paper, we propose a novel method called Contrastive Learning with Variable Similarity (CLVS) to accurately characterize the intrinsic similarity relationships between different augmented views. Our method dynamically adjusts the similarity based on the augmentation extent, and it ensures that strongly augmented views are always assigned lower similarity scores than weakly augmented ones. We provide a theoretical analysis to guarantee the effectiveness of the variable similarity in improving model generalizability. Extensive experiments demonstrate the superiority of our approach, achieving gains of 2.1\% on ImageNet-100 and 1.4\% on ImageNet-1k compared with the state-of-the-art methods.

View full details

Poster

Mind-the-Glitch: Visual Correspondence for Detecting Inconsistencies in Subject-Driven Generation

Abdelrahman Eldesokey ⋅ Aleksandar Cvejić ⋅ Bernard Ghanem ⋅ Peter Wonka

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We propose a novel approach for disentangling visual and semantic features from the backbones of pre-trained diffusion models, enabling visual correspondence in a manner analogous to the well-established semantic correspondence. While diffusion model backbones are known to encode semantically rich features, they must also contain visual features to support their image synthesis capabilities. However, isolating these visual features is challenging due to the absence of annotated datasets. To address this, we introduce an automated pipeline that constructs image pairs with annotated semantic and visual correspondences based on existing subject-driven image generation datasets, and design a contrastive architecture to separate the two feature types. Leveraging the disentangled representations, we propose a new metric, Visual Semantic Matching (VSM), that quantifies visual inconsistencies in subject-driven image generation. Empirical results show that our approach outperforms global feature-based metrics such as CLIP, DINO, and vision--language models in quantifying visual inconsistencies while also enabling spatial localization of inconsistent regions. To our knowledge, this is the first method that supports both quantification and localization of inconsistencies in subject-driven generation, offering a valuable tool for advancing this task.

View full details

Poster

SegMASt3R: Geometry Grounded Segment Matching

Rohit Jayanti ⋅ Swayam Agrawal ⋅ Vansh Garg ⋅ Siddharth Tourani ⋅ Muhammad Haris Khan ⋅ Sourav Garg ⋅ Madhava Krishna

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Segment matching is an important intermediate task in computer vision that establishes correspondences between semantically or geometrically coherent regions across images. Unlike keypoint matching, which focuses on localized features, segment matching captures structured regions, offering greater robustness to occlusions, lighting variations, and viewpoint changes. In this paper, we leverage the spatial understanding of 3D foundation models to tackle wide-baseline segment matching, a challenging setting involving extreme viewpoint shifts. We propose an architecture that uses the inductive bias of these 3D foundation models to match segments across image pairs with up to $180^\circ$ rotation. Extensive experiments show that our approach outperforms state-of-the-art methods, including the SAM2 video propagator and local feature matching methods, by up to 30\% on the AUPRC metric, on ScanNet++ and Replica datasets. We further demonstrate benefits of the proposed model on relevant downstream tasks, including 3D instance mapping and object-relative navigation.

View full details

Poster

High-order Equivariant Flow Matching for Density Functional Theory Hamiltonian Prediction

Seongsu Kim ⋅ Nayoung Kim ⋅ Dongwoo Kim ⋅ Sungsoo Ahn

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Density functional theory (DFT) is a fundamental method for simulating quantum chemical properties, but it remains expensive due to the iterative self-consistent field (SCF) process required to solve the Kohn–Sham equations. Recently, deep learning methods are gaining attention as a way to bypass this step by directly predicting the Hamiltonian. However, they rely on deterministic regression and do not consider the highly structured nature of Hamiltonians. In this work, we propose QHFlow, a high-order equivariant flow matching framework that generates Hamiltonian matrices conditioned on molecular geometry. Flow matching models continuous-time trajectories between simple priors and complex targets, learning the structured distributions over Hamiltonians instead of direct regression. To further incorporate symmetry, we use a neural architecture that predicts SE(3)-equivariant vector fields, improving accuracy and generalization across diverse geometries. To further enhance physical fidelity, we additionally introduce a fine-tuning scheme to align predicted orbital energies with the target. QHFlow achieves state-of-the-art performance, reducing Hamiltonian error by 71% on MD17 and 53% on QH9. Moreover, we further show that QHFlow accelerates the DFT process without trading off the solution quality when initializing SCF iterations with the predicted Hamiltonian, significantly reducing the number of iterations and runtime.

View full details

Poster

Language Modeling by Language Models

Junyan Cheng ⋅ Peter Clark ⋅ Kyle Richardson

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

*Can we leverage LLMs to model the process of discovering novel language model (LM) architectures?* Inspired by real research, we propose a multi-agent LLM approach that simulates the conventional stages of research, from ideation and literature search (proposal stage) to design implementation (code generation), generative pre-training, and downstream evaluation (verification). Using ideas from scaling laws, our system *Genesys* employs a *Ladder of Scales* approach; new designs are proposed, adversarially reviewed, implemented, and selectively verified at increasingly larger model scales (14M$\sim$350M parameters) with a narrowing budget (the number of models we can train at each scale). To help make discovery efficient and factorizable, Genesys uses a novel genetic programming backbone, which we show has empirical advantages over commonly used direct prompt generation workflows (e.g., $\sim$86\% percentage point improvement in successful design generation, a key bottleneck). We report experiments involving 1,162 newly discovered designs (1,062 fully verified) and find the best designs to be competitive with known architectures (e.g., outperform GPT2, Mamba2, etc., on 6/9 common benchmarks). We couple these results with comprehensive system-level ablations and formal results, which give broader insights into the design of effective autonomous discovery systems.

View full details

Poster

Robust learning of halfspaces under log-concave marginals

Jane Lange ⋅ Arsen Vasilyan

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We say that a classifier is $\text{\emph{adversarially robust}}$ to perturbations of norm $r$ if, with high probability over a point $x$ drawn from the input distribution, there is no point within distance $\le r$ from $x$ that is classified differently. The $\text{\emph{boundary volume}}$ is the probability that a point falls within distance $r$ of a point with a different label. This work studies the task of learning a hypothesis with small boundary volume, where the input is distributed as a subgaussian isotropic log-concave distribution over $\mathbb{R}^d$. Linear threshold functions are adversarially robust; they have boundary volume proportional to $r$. Such concept classes are efficiently learnable by polynomial regression, which produces a polynomial threshold function (PTF), but PTFs in general may have boundary volume $\Omega(1)$, even for $r \ll 1$. We give an algorithm that agnostically learns linear threshold functions and returns a classfier with boundary volume $O(r+\varepsilon)$ at radius of perturbation $r$. The time and sample complexity of $d^{\tilde{O}(1/\varepsilon^2)}$ matches the complexity of polynomial regression. Our algorithm augments the classic approach of polynomial regression with three additional steps:\ $\quad$ a) performing the $\ell_1$-error regression under $\ell_1$ noise sensitivity constraints,\ $\quad$ b) a structured partitioning and rounding step that returns a Boolean classifier with error $\mathrm{opt} + O(\varepsilon)$ and noise sensitivity $O(r+\varepsilon)$ simultaneously, and \ $\quad c)$ a local corrector that ``smooths'' a function with low noise sensitivity into a function that is adversarially robust.

View full details

Poster

Product Distribution Learning with Imperfect Advice

Arnab Bhattacharyya ⋅ XianJun, Davin Choo ⋅ Philips George John ⋅ Themis Gouleakis

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Given i.i.d.~samples from an unknown distribution $P$, the goal of distribution learning is to recover the parameters of a distribution that is close to $P$. When $P$ belongs to the class of product distributions on the Boolean hypercube $\{0,1\}^d$, it is known that $\Omega(d/\epsilon^2)$ samples are necessary to learn $P$ within total variation (TV) distance $\epsilon$. We revisit this problem when the learner is also given as advice the parameters of a product distribution $Q$. We show that there is an efficient algorithm to learn $P$ within TV distance $\epsilon$ that has sample complexity $\tilde{O}(d^{1-\eta}/\epsilon^2)$, if $\|\mathbf{p} - \mathbf{q}\|_1<\epsilon d^{0.5 - \Omega(\eta)}$. Here, $\mathbf{p}$ and $\mathbf{q}$ are the mean vectors of $P$ and $Q$ respectively, and no bound on $\|\mathbf{p} - \mathbf{q}\|_1$ is known to the algorithm a priori.

View full details

Poster

Transferring Linear Features Across Language Models With Model Stitching

Alan Chen ⋅ Jack Merullo ⋅ Alessandro Stolfo ⋅ Ellie Pavlick

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

In this work, we demonstrate that affine mappings between residual streams of language models is a cheap way to effectively transfer represented features between models. We apply this technique to transfer the \textit{weights} of Sparse Autoencoders (SAEs) between models of different sizes to compare their representations. We find that small and large models learn highly similar representation spaces, which motivates training expensive components like SAEs on a smaller model and transferring to a larger model at a FLOPs savings. For example, using a small-to-large transferred SAE as initialization can lead to 50% cheaper training runs when training SAEs on larger models. Next, we show that transferred probes and steering vectors can effectively recover ground truth performance. Finally, we dive deeper into feature-level transferability, finding that semantic and structural features transfer noticeably differently while specific classes of functional features have their roles faithfully mapped. Overall, our findings illustrate similarities and differences in the linear representation spaces of small and large models and demonstrate a method for improving the training efficiency of SAEs.

View full details

Poster

A Near-Optimal Algorithm for Decentralized Convex-Concave Finite-Sum Minimax Optimization

Hongxu Chen ⋅ Ke Wei ⋅ Haishan Ye ⋅ Luo Luo

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

In this paper, we study the distributed convex-concave finite-sum minimax optimization over the network, and a decentralized variance-reduced optimistic gradient method with stochastic mini-batch sizes (DIVERSE) is proposed. For the strongly-convex-strongly-concave objective, it is shown that DIVERSE can achieve a linear convergence rate that depends on the global smoothness parameters, yielding sharper computation and communication complexity bounds than existing results. Furthermore, we also establish the lower complexity bounds, which show that our upper bounds are optimal up to a logarithmic factor in terms of the local incremental first-order oracle calls, the computation rounds, and the communication rounds. Numerical experiments demonstrate that our algorithm outperforms existing methods in practice.

View full details

Poster

DNAEdit: Direct Noise Alignment for Text-Guided Rectified Flow Editing

Chenxi Xie ⋅ Minghan Li ⋅ Shuai Li ⋅ Yuhui Wu ⋅ Qiaosi Yi ⋅ Lei Zhang

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Leveraging the powerful generation capability of large-scale pretrained text-to-image models, training-free methods have demonstrated impressive image editing results. Conventional diffusion-based methods, as well as recent rectified flow (RF)-based methods, typically reverse synthesis trajectories by gradually adding noise to clean images, during which the noisy latent at the current timestep is used to approximate that at the next timesteps, introducing accumulated drift and degrading reconstruction accuracy. Considering the fact that in RF the noisy latent is estimated through direct interpolation between Gaussian noises and clean images at each timestep, we propose Direct Noise Alignment (DNA), which directly refines the desired Gaussian noise in the noise domain, significantly reducing the error accumulation in previous methods. Specifically, DNA estimates the velocity field of the interpolated noised latent at each timestep and adjusts the Gaussian noise by computing the difference between the predicted and expected velocity field. We validate the effectiveness of DNA and reveal its relationship with existing RF-based inversion methods. Additionally, we introduce a Mobile Velocity Guidance (MVG) to control the target prompt-guided generation process, balancing image background preservation and target object editability. DNA and MVG collectively constitute our proposed method, namely DNAEdit. Finally, we introduce DNA-Bench, a long-prompt benchmark, to evaluate the performance of advanced image editing models. Experimental results demonstrate that our DNAEdit achieves superior performance to state-of-the-art text-guided editing methods. Our code, model, and benchmark will be made publicly available.

View full details

Poster

A Closer Look at Model Collapse: From a Generalization-to-Memorization Perspective

Lianghe Shi ⋅ Meng Wu ⋅ Huijie Zhang ⋅ Zekai Zhang ⋅ Molei Tao ⋅ Qing Qu

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The widespread use of diffusion models has led to an abundance of AI-generated data, raising concerns about model collapse---a phenomenon in which recursive iterations of training on synthetic data lead to performance degradation. Prior work primarily characterizes this collapse via variance shrinkage or distribution shift, but these perspectives miss practical manifestations of model collapse. This paper identifies a transition from generalization to memorization during model collapse in diffusion models, where models increasingly replicate training data instead of generating novel content during iterative training on synthetic samples. This transition is directly driven by the declining entropy of the synthetic training data produced in each training cycle, which serves as a clear indicator of model degradation. Motivated by this insight, we propose an entropy-based data selection strategy to mitigate the transition from generalization to memorization and alleviate model collapse. Empirical results show that our approach significantly enhances visual quality and diversity in recursive generation, effectively preventing collapse.

View full details

Poster

Stable Part Diffusion 4D: Multi-View RGB and Kinematic Parts Video Generation

Hao Zhang ⋅ Chun-Han Yao ⋅ Simon Donné ⋅ Narendra Ahuja ⋅ Varun Jampani

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We present Stable Part Diffusion 4D (SP4D), a framework for generating paired RGB and kinematic part videos from monocular inputs. Unlike conventional part segmentation methods that rely on appearance-based semantic cues, SP4D learns to produce kinematic parts --- structural components aligned with object articulation and consistent across views and time. SP4D adopts a dual-branch diffusion model that jointly synthesizes RGB frames and corresponding part segmentation maps. To simplify architecture and flexibly enable different part counts, we introduce a spatial color encoding scheme that maps part masks to continuous RGB-like images. This encoding allows the segmentation branch to share the latents VAE from the RGB branch, while enabling part segmentation to be recovered via straightforward post-processing. A Bidirectional Diffusion Fusion (BiDiFuse) module enhances cross-branch consistency, supported by a contrastive part consistency loss to promote spatial and temporal alignment of part predictions. We demonstrate that the generated 2D part maps can be lifted to 3D to derive skeletal structures and harmonic skinning weights with few manual adjustments. To train and evaluate SP4D, we construct KinematicParts20K, a curated dataset of over 20K rigged objects selected and processed from Objaverse XL, each paired with multi-view RGB and part video sequences. Experiments show that SP4D generalizes strongly to diverse scenarios, including real-world videos, novel generated objects, and rare articulated poses, producing kinematic-aware outputs suitable for downstream animation and motion-related tasks.

View full details

Poster

Brain-Inspired fMRI-to-Text Decoding via Incremental and Wrap-Up Language Modeling

Wentao Lu ⋅ Dong Nie ⋅ Pengcheng Xue ⋅ Zheng Cui ⋅ Piji Li ⋅ Daoqiang Zhang ⋅ Xuyun Wen

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Decoding natural language text from non-invasive brain signals, such as functional magnetic resonance imaging (fMRI), remains a central challenge in brain-computer interface research. While recent advances in large language models (LLMs) have enabled open-vocabulary fMRI-to-text decoding, existing frameworks typically process the entire fMRI sequence in a single step, leading to performance degradation when handling long input sequences due to memory overload and semantic drift. To address this limitation, we propose a brain-inspired sequential fMRI-to-text decoding framework that mimics the human cognitive strategy of segmented and inductive language processing. Specifically, we divide long fMRI time series into consecutive segments aligned with optimal language comprehension length. Each segment is decoded incrementally, followed by a wrap-up mechanism that summarizes the semantic content and incorporates it as prior knowledge into subsequent decoding steps. This sequence-wise approach alleviates memory burden and ensures semantic continuity across segments. In addition, we introduce a text-guided masking strategy integrated with a masked autoencoder (MAE) framework for fMRI representation learning. This method leverages attention distributions over key semantic tokens to selectively mask the corresponding fMRI time points, and employs MAE to guide the model toward focusing on neural activity at semantically salient moments, thereby enhancing the capability of fMRI embeddings to represent textual information. Experimental results on the two datasets demonstrate that our method significantly outperforms state-of-the-art approaches, with performance gains increasing as decoding length grows.

View full details

Poster

Provably Efficient RL under Episode-Wise Safety in Constrained MDPs with Linear Function Approximation

Toshinori Kitamura ⋅ Arnob Ghosh ⋅ Tadashi Kozuno ⋅ Wataru Kumagai ⋅ Kazumi Kasaura ⋅ Kenta Hoshino ⋅ Yohei Hosoe ⋅ Yutaka Matsuo

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We study the reinforcement learning (RL) problem in a constrained Markov decision process (CMDP), where an agent explores the environment to maximize the expected cumulative reward while satisfying a single constraint on the expected total utility value in every episode. While this problem is well understood in the tabular setting, theoretical results for function approximation remain scarce. This paper closes the gap by proposing an RL algorithm for linear CMDPs that achieves $\widetilde{\mathcal{O}}(\sqrt{K})$ regret with an episode-wise zero-violation guarantee. Furthermore, our method is computationally efficient, scaling polynomially with problem-dependent parameters while remaining independent of the state space size. Our results significantly improve upon recent linear CMDP algorithms, which either violate the constraint or incur exponential computational costs.

View full details

Poster

Multidimensional Bayesian Utility Maximization: Tight Approximations to Welfare

Kira Goldner ⋅ Taylor Lundy

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We initiate the study of multidimensional Bayesian utility maximization, focusing on the unit-demand setting where values are i.i.d. across both items and buyers. The seminal result of Hartline and Roughgarden '08 studies simple, information-robust mechanisms that maximize utility for $n$ i.i.d. agents and $m$ identical items via an approximation to social welfare as an upper bound, and they prove this gap between optimal utility and social welfare is $\Theta(1+\log{n/m})$ in this setting. We extend these results to the multidimensional setting. To do so, we develop simple, prior-independent, approximately-optimal mechanisms, targeting the simplest benchmark of optimal welfare. We give a $(1-1/e)$-approximation when there are more items than buyers, and a $\Theta(\log{n/m})$-approximation when there are more buyers than items, and we prove that this bound is tight in both $n$ and $m$ by reducing the i.i.d. unit-demand setting to the identical items setting. Finally, we include an extensive discussion section on why Bayesian utility maximization is a promising research direction. In particular, we characterize complexities in this setting that defy our intuition from the welfare and revenue literature, and motivate why coming up with a better benchmark than welfare is a hard problem itself.

View full details

Poster

On Agnostic PAC Learning in the Small Error Regime

Julian Asilis ⋅ Mikael Møller Høgsgaard ⋅ Grigoris Velegkas

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Binary classification in the classic PAC model exhibits a curious phenomenon: Empirical Risk Minimization (ERM) learners are suboptimal in the realizable case yet optimal in the agnostic case. Roughly speaking, this owes itself to the fact that non-realizable distributions $\\mathcal{D}$ are more difficult to learn than realizable distributions -- even when one discounts a learner's error by $\\mathrm{err}(h^\\ast_\\mathcal{D})$, i.e., the error of the best hypothesis in $\\mathcal{H}$. Thus, optimal agnostic learners are permitted to incur excess error on (easier-to-learn) distributions $\\mathcal{D}$ for which $\\tau = \\mathrm{err}(h^\\ast_\\mathcal{D})$ is small. Recent work of Hanneke, Larsen, and Zhivotovskiy (FOCS '24) addresses this shortcoming by including $\\tau$ itself as a parameter in the agnostic error term. In this more fine-grained model, they demonstrate tightness of the error lower bound $\\tau + \\Omega \\left(\\sqrt{\\frac{\\tau (d + \\log(1 / \\delta))}{m}} + \\frac{d + \\log(1 / \\delta)}{m} \\right)$ in a regime where $\\tau > d/m$, and leave open the question of whether there may be a higher lower bound when $\\tau \\approx d/m$, with $d$ denoting $\\mathrm{VC}(\\mathcal{H})$. In this work, we resolve this question by exhibiting a learner which achieves error $c \\cdot \\tau + O \\left(\\sqrt{\\frac{\\tau (d + \\log(1 / \\delta))}{m}} + \\frac{d + \\log(1 / \\delta)}{m} \\right)$ for a constant $c \\leq 2.1$, matching the lower bound and demonstrating optimality when $\\tau =O( d/m)$. Further, our learner is computationally efficient and is based upon careful aggregations of ERM classifiers, making progress on two other questions of Hanneke, Larsen, and Zhivotovskiy (FOCS '24). We leave open the interesting question of whether our approach can be refined to lower the constant from 2.1 to 1, which would completely settle the complexity of agnostic learning.

View full details

Poster

MJ-Video: Benchmarking and Rewarding Video Generation with Fine-Grained Video Preference

Haibo Tong ⋅ Zhaoyang Wang ⋅ Zhaorun Chen ⋅ Haonian Ji ⋅ Shi Qiu ⋅ Siwei Han ⋅ Kexin Geng ⋅ Zhongkai Xue ⋅ Yiyang Zhou ⋅ Peng Xia ⋅ Mingyu Ding ⋅ Rafael Rafailov ⋅ Chelsea Finn ⋅ Huaxiu Yao

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Recent advancements in video generation have significantly improved the ability to synthesize videos from text instructions. However, existing models still struggle with key challenges such as instruction misalignment, content hallucination, safety concerns, and generation bias. To address these limitations, we introduce MJ-BENCH-VIDEO, a large-scale video preference benchmark designed to evaluate video generation across five critical aspects: Alignment, Safety, Fineness, Coherence & Consistency, and Bias & Fairness. This benchmark further incorporates 28 fine-grained criteria to provide a comprehensive evaluation of video preference. Building upon this dataset, we propose MJ-VIDEO, a Mixture-of-Experts (MoE)-based video reward model designed to deliver fine-grained reward. MJ-VIDEO can dynamically select relevant experts to accurately judge the preference based on the input text-video pair. This architecture enables more precise and adaptable preference judgments. Through extensive benchmarking on MJ-BENCH-VIDEO, we analyze the limitations of existing video reward models and demonstrate the superior performance of MJ-VIDEO in video preference assessment, achieving 17.58% and 15.87% improvements in overall and fine-grained preference judgments, respectively. Additionally, MJ-VIDEO is able to improve the alignment performance in video generation via preference fine-tuning.

View full details

Poster

Boosting Generative Image Modeling via Joint Image-Feature Synthesis

Theodoros Kouzelis ⋅ Efstathios Karypidis ⋅ Ioannis Kakogeorgiou ⋅ Spyridon Gidaris ⋅ Nikos Komodakis

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Latent diffusion models (LDMs) dominate high-quality image generation, yet integrating representation learning with generative modeling remains a challenge. We introduce a novel generative image modeling framework that seamlessly bridges this gap by leveraging a diffusion model to jointly model low-level image latents (from a variational autoencoder) and high-level semantic features (from a pretrained self-supervised encoder like DINO). Our latent-semantic diffusion approach learns to generate coherent image-feature pairs from pure noise, significantly enhancing both generative quality and training efficiency, all while requiring only minimal modifications to standard Diffusion Transformer architectures. By eliminating the need for complex distillation objectives, our unified design simplifies training and unlocks a powerful new inference strategy: Representation Guidance, which leverages learned semantics to steer and refine image generation. Evaluated in both conditional and unconditional settings, our method delivers substantial improvements in image quality and training convergence speed, establishing a new direction for representation-aware generative modeling.

View full details

Poster

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Chaoyou Fu ⋅ Haojia Lin ⋅ Xiong Wang ⋅ yifan zhang ⋅ Yunhang Shen ⋅ Xiaoyu Liu ⋅ Haoyu Cao ⋅ Zuwei Long ⋅ Heting Gao ⋅ Ke Li ⋅ Long MA ⋅ Xiawu Zheng ⋅ Rongrong Ji ⋅ Xing Sun ⋅ Caifeng Shan ⋅ Ran He

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, speech plays a crucial role in multimodal dialogue systems, and implementing high-performance in both vision and speech tasks remains a challenge due to the fundamental modality differences. In this paper, we propose a carefully designed multi-stage training methodology that progressively trains LLM to understand both visual and speech information, ultimately enabling fluent vision and speech interaction. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities without separate ASR and TTS modules, significantly accelerating multimodal end-to-end response speed. By comparing against state-of-the-art counterparts across benchmarks for image, video, and speech, we demonstrate that our omni model is equipped with both strong visual and speech capabilities, making omni understanding and interaction.

View full details

Poster

Projection-based Lyapunov method for fully heterogeneous weakly-coupled MDPs

Xiangcheng Zhang ⋅ Yige Hong ⋅ Weina Wang

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Heterogeneity poses a fundamental challenge for many real-world large-scale decision-making problems but remains largely understudied. In this paper, we study the _fully heterogeneous_ setting of a prominent class of such problems, known as weakly-coupled Markov decision processes (WCMDPs). Each WCMDP consists of $N$ arms (or subproblems), which have distinct model parameters in the fully heterogeneous setting, leading to the curse of dimensionality when $N$ is large. We show that, under mild assumptions, an efficiently computable policy achieves an $O(1/\sqrt{N})$ optimality gap in the long-run average reward per arm for fully heterogeneous WCMDPs as $N$ becomes large. This is the _first asymptotic optimality result_ for fully heterogeneous average-reward WCMDPs. Our main technical innovation is the construction of projection-based Lyapunov functions that certify the convergence of rewards and costs to an optimal region, even under full heterogeneity.

View full details

Poster

Multiverse: Your Language Models Secretly Decide How to Parallelize and Merge Generation

Xinyu Yang ⋅ Yuwei An ⋅ Hongyi Liu ⋅ Tianqi Chen ⋅ Beidi Chen

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Autoregressive Large Language Models (AR-LLMs) frequently exhibit implicit parallelism in sequential generation. Inspired by this, we introduce Multiverse, a new generative model enabling natively parallel generation. Multiverse internalizes a MapReduce paradigm, generating automatically through three stages: (i) a Map stage for adaptive task decomposition, (ii) a Process stage for parallel subtask execution, and (iii) a Reduce stage for lossless result synthesis. Next, we build a real-world Multiverse reasoning model with co-design of data, algorithm, and system, enabling rapid and seamless transfer from frontier AR-LLMs. For data creation, we develop Multiverse Curator, an automated LLM-assisted pipeline that transforms sequential reasoning chains into structured training data, avoiding costly human annotations. Algorithmically, we design Multiverse Attention to separate parallel reasoning steps while keeping compatibility with causal attention for efficient training. Systematically, we implement Multiverse Engine to support parallel inference. It features a dedicated interpreter that dynamically switches between sequential and parallel generation, triggered directly by the model. After a 3-hour fine-tuning with 1K examples, our Multiverse-32B stands as the only open-sourced non-AR model achieving performance on par with leading AR-LLMs of the same scale, evidenced by AIME24 & 25 scores of 54% and 46%, respectively. Moreover, our budget control experiments show that Multiverse-32B exhibits superior scaling, outperforming AR-LLMs by 1.87% on average using the same context length. Such scaling further leads to practical efficiency gain, achieving up to 2x speedup across varying batch sizes. We have open-sourced the entire Multiverse ecosystem, including data, model weights, serving system, supporting tools, as well as data curation prompts and detailed training and evaluation recipes.

View full details

Poster

FP4 All the Way: Fully Quantized Training of Large Language Models

Brian Chmiel ⋅ Maxim Fishman ⋅ Ron Banner ⋅ Daniel Soudry

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We demonstrate, for the first time, fully quantized training (FQT) of large language models (LLMs) using predominantly 4-bit floating-point (FP4) precision for weights, activations, and gradients on datasets up to 200 billion tokens. We extensively investigate key design choices for FP4, including block sizes, scaling formats, and rounding methods. Our analysis shows that the NVFP4 format, where each block of 16 FP4 values (E2M1) shares a scale represented in E4M3, provides optimal results. We use stochastic rounding for backward and update passes and round-to-nearest for the forward pass to enhance stability. Additionally, we identify a theoretical and empirical threshold for effective quantized training: when the gradient norm falls below approximately $\sqrt{3}$ times the quantization noise, quantized training becomes less effective. Leveraging these insights, we successfully train a 7-billion-parameter model on 256 Intel Gaudi2 accelerators. The resulting FP4-trained model achieves downstream task performance comparable to a standard BF16 baseline, confirming that FP4 training is a practical and highly efficient approach for large-scale LLM training. A reference implementation is supplied in https://github.com/Anonymous1252022/fp4-all-the-way.

View full details

Poster

Transfer Learning for Benign Overfitting in High-Dimensional Linear Regression

Yeichan Kim ⋅ Ilmun Kim ⋅ Seyoung Park

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Transfer learning is a key component of modern machine learning, enhancing the performance of target tasks by leveraging diverse data sources. Simultaneously, overparameterized models such as the minimum-$\ell_2$-norm interpolator (MNI) in high-dimensional linear regression have garnered significant attention for their remarkable generalization capabilities, a property known as *benign overfitting*. Despite their individual importance, the intersection of transfer learning and MNI remains largely unexplored. Our research bridges this gap by proposing a novel two-step Transfer MNI approach and analyzing its trade-offs. We characterize its non-asymptotic excess risk and identify conditions under which it outperforms the target-only MNI. Our analysis reveals *free-lunch* covariate shift regimes, where leveraging heterogeneous data yields the benefit of knowledge transfer at limited cost. To operationalize our findings, we develop a data-driven procedure to detect *informative* sources and introduce an ensemble method incorporating multiple informative Transfer MNIs. Finite-sample experiments demonstrate the robustness of our methods to model and data heterogeneity, confirming their advantage.

View full details

Poster

Flash Invariant Point Attention

Andrew Liu ⋅ Axel Elaldi ⋅ Nicholas Franklin ⋅ Nathan Russell ⋅ Gurinder Atwal ⋅ Yih-En Ban ⋅ Olivia Viessmann

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Invariant Point Attention (IPA) is a key algorithm for geometry-aware modeling in structural biology, central to many protein and RNA models. However, its quadratic complexity limits the input sequence length. We introduce FlashIPA, a factorized reformulation of IPA that leverages hardware-efficient FlashAttention to achieve linear scaling in GPU memory and wall-clock time with sequence length. FlashIPA matches or exceeds standard IPA performance while substantially reducing computational costs. FlashIPA extends training to previously unattainable lengths, and we demonstrate this by re-training generative models without length restrictions and generating structures of thousands of residues. FlashIPA is available at https://github.com/flagshippioneering/flash_ipa.

View full details

Poster

LABridge: Text–Image Latent Alignment Framework via Mean-Conditioned OU Process

Huiyang Shao ⋅ Xin Xia ⋅ Yuxi Ren ⋅ XING WANG ⋅ Xuefeng Xiao

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Diffusion models have emerged as state‑of‑the‑art in image synthesis.However, it often suffer from semantic instability and slow iterative denoising. We introduce Latent Alignment Framework (LABridge), a novel Text–Image Latent Alignment Framework via an Ornstein–Uhlenbeck (OU) Process, which explicitly preserves and aligns textual and visual semantics in an aligned latent space. LABridge employs a Text-Image Alignment Encoder (TIAE) to encode text prompts into structured priors that are directly aligned with image latents. Instead of a homogeneous Gaussian, Mean-Conditioned OU process smoothly interpolates between these text‑conditioned priors and image latents, improving stability and reducing the number of denoising steps. Extensive experiments on standard text-to-image benchmarks show that LABridge achieves better text–image alignment metric and competitive FID scores compared to leading diffusion baselines. By unifying text and image representations through principled latent alignment, LABridge paves the way for more efficient, semantically consistent, and high‑fidelity text to image generation.

View full details

Poster

STARFlow: Scaling Latent Normalizing Flows for High-resolution Image Synthesis

Jiatao Gu ⋅ Tianrong Chen ⋅ David Berthelot ⋅ Huangjie Zheng ⋅ Yuyang Wang ⋅ Ruixiang Zhang ⋅ Laurent Dinh ⋅ Miguel Angel Bautista ⋅ Joshua Susskind ⋅ Shuangfei Zhai

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We present STARFlow, a scalable generative model based on normalizing flows that achieves strong performance on high-resolution image synthesis. STARFlow's main building block is Transformer Autoregressive Flow (TARFlow), which combines normalizing flows with Autoregressive Transformer architectures and has recently achieved impressive results in image modeling. In this work, we first establish the theoretical universality of TARFlow for modeling continuous distributions. Building on this foundation, we introduce a set of architectural and algorithmic innovations that significantly enhance the scalability: (1) a deep-shallow design where a deep Transformer block captures most of the model’s capacity, followed by a few shallow Transformer blocks that are computationally cheap yet contribute non-negligibly, (2) learning in the latent space of pretrained autoencoders, which proves far more effective than modeling pixels directly, and (3) a novel guidance algorithm that substantially improves sample quality. Crucially, our model remains a single, end-to-end normalizing flow, allowing exact maximum likelihood training in continuous space without discretization. STARFlow achieves competitive results in both class- and text-conditional image generation, with sample quality approaching that of state-of-the-art diffusion models. To our knowledge, this is the **first** successful demonstration of normalizing flows at this scale and resolution. Code and weights available at https://github.com/apple/ml-starflow.

View full details

Poster

Cost-aware LLM-based Online Dataset Annotation

Eray Can Elumar ⋅ Cem Tekin ⋅ Osman Yagan

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Recent advances in large language models (LLMs) have enabled automated dataset labeling with minimal human supervision. While majority voting across multiple LLMs can improve label reliability by mitigating individual model biases, it incurs high computational costs due to repeated querying. In this work, we propose a novel online framework, Cost-aware Majority Voting (CaMVo), for efficient and accurate LLM-based dataset annotation. CaMVo adaptively selects a subset of LLMs for each data instance based on contextual embeddings, balancing confidence and cost without requiring pre-training or ground-truth labels. Leveraging a LinUCB-based selection mechanism and a Bayesian estimator over confidence scores, CaMVo estimates a lower bound on labeling accuracy for each LLM and aggregates responses through weighted majority voting. Our empirical evaluation on the MMLU and IMDB Movie Review datasets demonstrates that CaMVo achieves comparable or superior accuracy to full majority voting while significantly reducing labeling costs. This establishes CaMVo as a practical and robust solution for cost-efficient annotation in dynamic labeling environments.

View full details

Poster

Regret Bounds for Adversarial Contextual Bandits with General Function Approximation and Delayed Feedback

Orin Levy ⋅ Liad Erez ⋅ Alon Peled-Cohen ⋅ Yishay Mansour

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We present regret minimization algorithms for the contextual multi-armed bandit (CMAB) problem over $K$ actions in the presence of delayed feedback, a scenario where loss observations arrive with delays chosen by an adversary. As a preliminary result, assuming direct access to a finite policy class $\Pi$ we establish an optimal expected regret bound of $ O (\sqrt{KT \log |\Pi|} + \sqrt{D \log |\Pi|)} $ where $D$ is the sum of delays. For our main contribution, we study the general function approximation setting over a (possibly infinite) contextual loss function class $ \mathcal{F} $ with access to an online least-square regression oracle $\mathcal{O}$ over $\mathcal{F}$. In this setting, we achieve an expected regret bound of $O(\sqrt{KTR_T(\mathcal{O})} + \sqrt{ d_{\max} D \beta})$ assuming FIFO order, where $d_{\max}$ is the maximal delay, $R_T(\mathcal{O})$ is an upper bound on the oracle's regret and $\beta$ is a stability parameter associated with the oracle. We complement this general result by presenting a novel stability analysis of a Hedge-based version of Vovk's aggregating forecaster as an oracle implementation for least-square regression over a finite function class $\mathcal{F}$ and show that its stability parameter $\beta$ is bounded by $\log |\mathcal{F}|$, resulting in an expected regret bound of $O(\sqrt{KT \log |\mathcal{F}|} + \sqrt{d_{\max} D \log |\mathcal{F}|})$ which is a $\sqrt{d_{\max}}$ factor away from the lower bound of $\Omega(\sqrt{KT \log |\mathcal{F}|} + \sqrt{D \log |\mathcal{F}|})$ that we also present.

View full details

Poster

LLM-Explorer: A Plug-in Reinforcement Learning Policy Exploration Enhancement Driven by Large Language Models

Qianyue Hao ⋅ Yiwen Song ⋅ Qingmin Liao ⋅ Jian Yuan ⋅ Yong Li

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Policy exploration is critical in reinforcement learning (RL), where existing approaches include $\epsilon$-greedy, Gaussian process, etc. However, these approaches utilize preset stochastic processes and are indiscriminately applied in all kinds of RL tasks without considering task-specific features that influence policy exploration. Moreover, during RL training, the evolution of such stochastic processes is rigid, which typically only incorporates a decay in the variance, failing to adjust flexibly according to the agent's real-time learning status. Inspired by the analyzing and reasoning capability of large language models (LLMs), we design **LLM-Explorer** to adaptively generate task-specific exploration strategies with LLMs, enhancing the policy exploration in RL. In our design, we sample the learning trajectory of the agent during the RL training in a given task and prompt the LLM to analyze the agent's current policy learning status and then generate a probability distribution for future policy exploration. Updating the probability distribution periodically, we derive a stochastic process specialized for the particular task and dynamically adjusted to adapt to the learning process. Our design is a plug-in module compatible with various widely applied RL algorithms, including the DQN series, DDPG, TD3, and any possible variants developed based on them. Through extensive experiments on the Atari and MuJoCo benchmarks, we demonstrate LLM-Explorer's capability to enhance RL policy exploration, achieving an average performance improvement up to 37.27%. Our code is open-source at https://github.com/tsinghua-fib-lab/LLM-Explorer for reproducibility.

View full details

Poster

RF-Agent: Automated Reward Function Design via Language Agent Tree Search

Ning Gao ⋅ Xiuhui Zhang ⋅ Xingyu Jiang ⋅ Mukang You ⋅ Mohan Zhang ⋅ Yue Deng

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Designing efficient reward functions for low-level control tasks is a challenging problem. Recent research aims to reduce reliance on expert experience by using Large Language Models (LLMs) with task information to generate dense reward functions. These methods typically rely on training results as feedback, iteratively generating new reward functions with greedy or evolutionary algorithms. However, they suffer from poor utilization of historical feedback and inefficient search, resulting in limited improvements in complex control tasks. To address this challenge, we propose RF-Agent, a framework that treats LLMs as language agents and frames reward function design as a sequential decision-making process, enhancing optimization through better contextual reasoning. RF-Agent integrates Monte Carlo Tree Search (MCTS) to manage the reward design and optimization process, leveraging the multi-stage contextual reasoning ability of LLM. This approach better utilizes historical information and improves search efficiency to identify promising reward functions. Outstanding experimental results in 17 diverse low-level control tasks demonstrate the effectiveness of our method.

View full details

Poster

Online Strategic Classification With Noise and Partial Feedback

Tianrun Zhao ⋅ Xiaojie Mao ⋅ Yong Liang

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

In this paper, we study an online strategic classification problem, where a principal aims to learn an accurate binary linear classifier from sequentially arriving agents. For each agent, the principal announces a classifier. The agent can strategically exercise costly manipulations on his features to be classified as the favorable positive class. The principal is unaware of the true feature-label distribution, but observes all reported features and only labels of positively classified agents. We assume that the true feature-label distribution is given by a halfspace model subject to arbitrary feature-dependent bounded noise (i.e., Massart Noise). This problem faces the combined challenges of agents' strategic feature manipulations, partial label observations, and label noises. We tackle these challenges by a novel learning algorithm. We show that the proposed algorithm yields classifiers that converge to the clairvoyant optimal one and attains a regret rate of $ O(\sqrt{T})$ up to poly-logarithmic and constant factors over $T$ cycles.

View full details

Poster

Less is More: Improving LLM Alignment via Preference Data Selection

Xun Deng ⋅ Han Zhong ⋅ Rui Ai ⋅ Fuli Feng ⋅ Zheng Wang ⋅ Xiangnan He

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Direct Preference Optimization (DPO) has emerged as a promising approach for aligning large language models with human preferences. While prior work mainly extends DPO from the aspect of the objective function, we instead improve DPO from the largely overlooked but critical aspect of data selection. Specifically, we address the issue of parameter shrinkage caused by noisy data by proposing a novel margin-maximization principle for dataset curation in DPO training. To further mitigate the noise in different reward models, we propose a Bayesian Aggregation approach that unifies multiple margin sources (external and implicit) into a single preference probability. Extensive experiments in diverse settings demonstrate the consistently high data efficiency of our approach. Remarkably, by using just 10\% of the Ultrafeedback dataset, our approach achieves 3\% to 8\% improvements across various Llama, Mistral, and Qwen models on the AlpacaEval2 benchmark. Furthermore, our approach seamlessly extends to iterative DPO, yielding a roughly 3\% improvement with 25\% online data, revealing the high redundancy in this presumed high-quality data construction manner. These results highlight the potential of data selection strategies for advancing preference optimization.

View full details

Poster

ConTextTab: A Semantics-Aware Tabular In-Context Learner

Marco Spinaci ⋅ Marek Polewczyk ⋅ Maximilian Schambach ⋅ Sam Thelin

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Tabular in-context learning (ICL) has recently achieved state-of-the-art (SOTA) performance on several tabular prediction tasks. Previously restricted to classification problems on small tables, recent advances such as TabPFN and TabICL have extended its use to larger datasets. Although current table-native ICL architectures are architecturally efficient and well-adapted to tabular data structures, their exclusive training on synthetic data limits their ability to fully leverage the rich semantics and world knowledge contained in real-world tabular data. At the other end of the spectrum, tabular ICL models based on pretrained large language models such as TabuLa-8B integrate deep semantic understanding and world knowledge but are only able to make use of a small amount of context due to inherent architectural limitations. With the aim to combine the best of both these worlds, we introduce ConTextTab, integrating semantic understanding and alignment into a table-native ICL framework. By employing specialized embeddings for different data modalities and by training on large-scale real-world tabular data, our model is competitive with SOTA across a broad set of benchmarks while setting a new standard on the semantically rich CARTE benchmark. Code and model checkpoints are available at: https://github.com/SAP-samples/contexttab .

View full details

Poster

Shift Before You Learn: Enabling Low-Rank Representations in Reinforcement Learning

Bastien Dubail ⋅ Stefan Stojanovic ⋅ Alexandre Proutiere

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Low-rank structure is a common implicit assumption in many modern reinforcement learning (RL) algorithms. For instance, reward-free and goal-conditioned RL methods often presume that the successor measure admits a low-rank representation. In this work, we challenge this assumption by first remarking that the successor measure itself is not approximately low-rank. Instead, we demonstrate that a low-rank structure naturally emerges in the shifted successor measure, which captures the system dynamics after bypassing a few initial transitions. We provide finite-sample performance guarantees for the entry-wise estimation of a low-rank approximation of the shifted successor measure from sampled entries. Our analysis reveals that both the approximation and estimation errors are primarily governed by a newly introduced quantitity: the spectral recoverability of the corresponding matrix. To bound this parameter, we derive a new class of functional inequalities for Markov chains that we call Type II Poincaré inequalities and from which we can quantify the amount of shift needed for effective low-rank approximation and estimation. This analysis shows in particular that the required shift depends on decay of the high-order singular values of the shifted successor measure and is hence typically small in practice. Additionally, we establish a connection between the necessary shift and the local mixing properties of the underlying dynamical system, which provides a natural way of selecting the shift. Finally, we validate our theoretical findings with experiments, and demonstrate that shifting the successor measure indeed leads to improved performance in goal-conditioned RL.

View full details

Poster

Reinforcement Learning with Imperfect Transition Predictions: A Bellman-Jensen Approach

Chenbei Lu ⋅ Zaiwei Chen ⋅ Tongxin Li ⋅ Chenye Wu ⋅ Adam Wierman

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Traditional reinforcement learning (RL) assumes the agents make decisions based on Markov decision processes (MDPs) with one-step transition models. In many real-world applications, such as energy management and stock investment, agents can access multi-step predictions of future states, which provide additional advantages for decision making. However, multi-step predictions are inherently high-dimensional: naively embedding these predictions into an MDP leads to an exponential blow-up in state space and the curse of dimensionality. Moreover, existing RL theory provides few tools to analyze prediction-augmented MDPs, as it typically works on one-step transition kernels and cannot accommodate multi-step predictions with errors or partial action-coverage. We address these challenges with three key innovations: First, we propose the \emph{Bayesian value function} to characterize the optimal prediction-aware policy tractably. Second, we develop a novel \emph{Bellman–Jensen Gap} analysis on the Bayesian value function, which enables characterizing the value of imperfect predictions. Third, we introduce BOLA (Bayesian Offline Learning with Online Adaptation), a two-stage model-based RL algorithm that separates offline Bayesian value learning from lightweight online adaptation to real-time predictions. We prove that BOLA remains sample-efficient even under imperfect predictions. We validate our theory and algorithm on synthetic MDPs and a real-world wind energy storage control problem.

View full details

Poster

Neural Entropy

Akhil Premkumar

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We explore the connection between deep learning and information theory through the paradigm of diffusion models. A diffusion model converts noise into structured data by reinstating, imperfectly, information that is erased when data was diffused to noise. This information is stored in a neural network during training. We quantify this information by introducing a measure called \textit{neural entropy}, which is related to the total entropy produced by diffusion. Neural entropy is a function of not just the data distribution, but also the diffusive process itself. Measurements of neural entropy on a few simple image diffusion models reveal that they are extremely efficient at compressing large ensembles of structured data.

View full details

Poster

LODGE: Level-of-Detail Large-Scale Gaussian Splatting with Efficient Rendering

Jonas Kulhanek ⋅ Marie-Julie Rakotosaona ⋅ Fabian Manhardt ⋅ Christina Tsalicoglou ⋅ Michael Niemeyer ⋅ Torsten Sattler ⋅ Songyou Peng ⋅ Federico Tombari

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

In this work, we present a novel level-of-detail (LOD) method for 3D Gaussian Splatting that enables real-time rendering of large-scale scenes on memory-constrained devices. Our approach introduces a hierarchical LOD representation that iteratively selects optimal subsets of Gaussians based on camera distance, thus largely reducing both rendering time and GPU memory usage. We construct each LOD level by applying a depth-aware 3D smoothing filter, followed by importance-based pruning and fine-tuning to maintain visual fidelity. To further reduce memory overhead, we partition the scene into spatial chunks and dynamically load only relevant Gaussians during rendering, employing an opacity-blending mechanism to avoid visual artifacts at chunk boundaries. Our method achieves state-of-the-art performance on both outdoor (Hierarchical 3DGS) and indoor (Zip-NeRF) datasets, delivering high-quality renderings with reduced latency and memory requirements.

View full details

Poster

FlowFeat: Pixel-Dense Embedding of Motion Profiles

Nikita Araslanov ⋅ Anna Sonnweber ⋅ Daniel Cremers

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Dense and versatile image representations underpin the success of virtually all computer vision applications. However, state-of-the-art networks, such as transformers, produce low-resolution feature grids, which are suboptimal for dense prediction tasks. To address this limitation, we present *FlowFeat*, a high-resolution and multi-task feature representation. The key ingredient behind FlowFeat is a novel distillation technique that embeds a distribution of plausible apparent motions, or *motion profiles*. By leveraging optical flow networks and diverse video data, we develop an effective self-supervised training framework that statistically approximates the apparent motion. With its remarkable level of spatial detail, FlowFeat encodes a compelling degree of geometric and semantic cues while exhibiting high temporal consistency. Empirically, FlowFeat significantly enhances the representational power of five state-of-the-art encoders and alternative upsampling strategies across three dense tasks: video object segmentation, monocular depth estimation and semantic segmentation. Training FlowFeat is computationally inexpensive and robust to inaccurate flow estimation, remaining highly effective even when using unsupervised flow networks. Our work takes a step forward towards reliable and versatile dense image representations.

View full details

Poster

Implicit Bias of Spectral Descent and Muon on Multiclass Separable Data

Chen Fan ⋅ Mark Schmidt ⋅ Christos Thrampoulidis

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Different gradient-based methods for optimizing overparameterized models can all achieve zero training error yet converge to distinctly different solutions inducing different generalization properties. We provide the first complete characterization of implicit optimization bias for p-norm normalized steepest descent (NSD) and momentum steepest descent (NMD) algorithms in multi-class linear classification with cross-entropy loss. Our key theoretical contribution is proving that these algorithms converge to solutions maximizing the margin with respect to the classifier matrix's p-norm, with established convergence rates. These results encompass important special cases including Spectral Descent and Muon, which we show converge to max-margin solutions with respect to the spectral norm. A key insight of our contribution is that the analysis of general entry-wise and Schatten p-norms can be reduced to the analysis of NSD/NMD with max-norm by exploiting a natural ordering property between all p-norms relative to the max-norm and its dual sum-norm. For the specific case of descent with respect to the max-norm, we further extend our analysis to include preconditioning, showing that Adam converges to the matrix's max-norm solution. Our results demonstrate that the multi-class linear setting, which is inherently richer than the binary counterpart, provides the most transparent framework for studying implicit biases of matrix-parameter optimization algorithms.

View full details

Poster

PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding

Ansel Blume ⋅ Jeonghwan Kim ⋅ Hyeonjeong Ha ⋅ Elen Chatikyan ⋅ Xiaomeng Jin ⋅ Khanh Nguyen ⋅ Nanyun Peng ⋅ Kai-Wei Chang ⋅ Derek Hoiem ⋅ Heng Ji

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Real-world objects are composed of distinctive, object-specific parts. Identifying these parts is key to performing fine-grained, compositional reasoning—yet, large multimodal models (LMMs) struggle to perform this seemingly straightforward task. In this work, we introduce PARTONOMY, an LMM benchmark designed for pixel-level part grounding. We construct PARTONOMY from existing part datasets and our own rigorously annotated set of images, encompassing 862 parts and 5346 objects for evaluation. Unlike existing datasets that simply ask models to identify generic parts, PARTONOMY utilizes highly technical concepts and challenges models to compare objects’ parts, consider part-whole relationships, and justify textual predictions with visual segmentations. Our experiments demonstrate significant limitations in state-of-the-art LMMs (e.g., LISA-13B achieves only 5.9% gIoU), highlighting a critical gap in their part grounding abilities. We note that existing segmentation-enabled LMMs (segmenting LMMs) have two key architectural shortcomings: they use special [SEG] tokens not seen during pretraining which induce distribution shift, and they discard predicted segmentations instead of using past predictions to guide future ones. To address these deficiencies, we train several part-centric LMMs and propose PLUM, a novel segmenting LMM that utilizes span tagging instead of segmentation tokens and that conditions on prior predictions in a feedback loop. We find that pretrained PLUM dominates existing segmenting LMMs on reasoning segmentation, VQA, and visual hallucination benchmarks. In addition, PLUM finetuned on our proposed Explanatory Part Segmentation task is competitive with segmenting LMMs trained on significantly more segmentation data. Our work opens up new avenues towards enabling fine-grained, grounded visual understanding in LMMs.

View full details

Poster

SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement

Xiyao Wang ⋅ Zhengyuan Yang ⋅ Chao Feng ⋅ Hongjin Lu ⋅ Linjie Li ⋅ Chung-Ching Lin ⋅ Kevin Lin ⋅ Furong Huang ⋅ Lijuan Wang

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We introduce ThinkLite-VL, a family of visual reasoning models that achieve state-of-the-art (SoTA) performance using an order of magnitude fewer training samples, relying purely on reinforcement fine-tuning (RFT) self-improvement without any knowledge distillation. Our central insight is that sample difficulty critically influences RFT effectiveness: appropriately challenging examples can drive substantial reasoning improvements, even in low-data regimes. However, quantifying sample difficulty in a reliable and scalable manner remains non-trivial. To address this, we repurpose Monte Carlo Tree Search (MCTS) to measure sample difficulty via the number of reasoning iterations a vision-language model (VLM) requires to solve each instance. This MCTS-based selection procedure identifies samples that induce deeper reasoning while remaining solvable, allowing us to filter a high-quality subset from 70k open-source examples spanning math, natural image understanding, and chart comprehension. Using this approach, we select just 11k challenging samples for RFT on Qwen2.5-VL-7B-Instruct and 7.5k samples for Qwen2.5-VL-72B-Instruct. The resulting models, ThinkLite-VL-7B and ThinkLite-VL-72B, significantly outperform their respective base models across eight visual reasoning benchmarks. In particular, ThinkLite-VL-7B improves the average performance of Qwen2.5-VL-7B-Instruct by 7\% and surpasses all existing 7B-level models, as well as much larger models such as GPT-4o, O1 and Qwen2.5-VL-72B, achieving a new SoTA score of 75.1 on MathVista. ThinkLite-VL-72B further advances the SoTA frontier, achieving an accuracy of 79.7 on MathVista and an average benchmark improvement of 4.42 over the open-source SOTA. These results demonstrate that MCTS-guided difficulty filtering provides a scalable and effective path toward data-efficient self-improvement in multimodal reasoning.

View full details

Poster

Minimax Adaptive Online Nonparametric Regression over Besov spaces

Paul Liautaud ⋅ Pierre Gaillard ⋅ Olivier Wintenberger

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We study online adversarial regression with convex losses against a rich class of continuous yet highly irregular competitor functions,% prediction rules, modeled by Besov spaces $B_{pq}^s$ with general parameters $1 \leq p,q \leq \infty$ and smoothness $s > \tfrac{d}{p}$. We introduce an adaptive wavelet-based algorithm that performs sequential prediction without prior knowledge of $(s,p,q)$, and establish minimax-optimal regret bounds against any comparator in $B_{pq}^s$. We further design a locally adaptive extension capable of sequentially adapting to spatially inhomogeneous smoothness. This adaptive mechanism adjusts the resolution of the predictions over both time and space, yielding refined regret bounds in terms of local regularity. Consequently, in heterogeneous environments, our adaptive guarantees can significantly surpass those obtained by standard global methods.

View full details

Poster

ORIGAMISPACE: Benchmarking Multimodal LLMs in Multi-Step Spatial Reasoning with Mathematical Constraints

Rui Xu ⋅ Dakuan Lu ⋅ Zicheng Zhao ⋅ Xiaoyu Tan ⋅ Xintao Wang ⋅ Siyu Yuan ⋅ Jiangjie Chen ⋅ yinghui xu

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Spatial reasoning is a key capability in the field of artificial intelligence, especially crucial in areas such as robotics, computer vision, and natural language understanding. However, evaluating the ability of multimodal large language models (MLLMs) in complex spatial reasoning still faces challenges, particularly in scenarios requiring multi-step reasoning and precise mathematical constraints. This paper introduces ORIGAMISPACE, a new dataset and benchmark designed to evaluate the multi-step spatial reasoning ability and the capacity to handle mathematical constraints of MLLMs through origami tasks. The dataset contains 350 data instances, each comprising a strictly formatted crease pattern (CP diagram), the Compiled Flat Pattern, the complete Folding Process, and the final Folded Shape Image. We propose four evaluation tasks: Pattern Prediction, Multi-step Spatial Reasoning, Spatial Relationship Prediction, and End-to-End CP Code Generation. For the CP code generation task, we design an interactive environment and explore the possibility of using reinforcement learning methods to train MLLMs. Through experiments on existing MLLMs, we initially reveal the strengths and weaknesses of these models in handling complex spatial reasoning tasks.

View full details

Poster

Agnostic Learning under Targeted Poisoning: Optimal Rates and the Role of Randomness

Bogdan Chornomaz ⋅ Yonatan Koren ⋅ Shay Moran ⋅ Tom Waknine

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We study the problem of learning in the presence of an adversary that can corrupt an $\eta$ fraction of the training examples with the goal of causing failure on a specific test point. In the realizable setting, prior work established that the optimal error under such instance-targeted poisoning attacks scales as $\Theta(d\eta)$, where $d$ is the VC dimension of the hypothesis class [Hanneke, Karbasi, Mahmoody, Mehalel, and Moran (NeurIPS 2022)]. In this work, we resolve the corresponding question in the agnostic setting. We show that the optimal excess error is $\widetilde\Theta(\sqrt{d\eta})$, answering one of the main open problems left by Hanneke et al. To achieve this rate, it is necessary to use randomized learners: Hanneke et al.\ showed that deterministic learners can be forced to suffer error close to $1$ even under small amounts of poisoning. Perhaps surprisingly, our upper bound remains valid even when the learner’s random bits are fully visible to the adversary. In the other direction, our lower bound is stronger than standard PAC-style bounds: instead of tailoring a hard distribution separately for each sample size, we exhibit a single fixed distribution under which the adversary can enforce an excess error of $\Omega(\sqrt{d\eta})$ infinitely often.

View full details

Poster

Angular Steering: Behavior Control via Rotation in Activation Space

Minh Hieu Vu ⋅ Tan Nguyen

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Controlling specific behaviors in large language models while preserving their general capabilities is a central challenge for safe and reliable artificial intelligence deployment. Current steering methods, such as vector addition and directional ablation, are constrained within a two-dimensional subspace defined by the activation and feature direction, making them sensitive to chosen parameters and potentially affecting unrelated features due to unintended interactions in activation space. We introduce Angular Steering, a novel and flexible method for behavior modulation that operates by rotating activations within a fixed two-dimensional subspace. By formulating steering as a geometric rotation toward or away from a target behavior direction, Angular Steering provides continuous, fine-grained control over behaviors such as refusal and compliance. We demonstrate this method using refusal steering emotion steering as use cases. Additionally, we propose Adaptive Angular Steering, a selective variant that rotates only activations aligned with the target feature, further enhancing stability and coherence. Angular Steering generalizes existing addition and orthogonalization techniques under a unified geometric rotation framework, simplifying parameter selection and maintaining model stability across a broader range of adjustments. Experiments across multiple model families and sizes show that Angular Steering achieves robust behavioral control while maintaining general language modeling performance, underscoring its flexibility, generalization, and robustness compared to prior approaches. Code and artifacts are available at \url{https://github.com/lone17/angular-steering/}.

View full details

Poster

SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

Zekun Qi ⋅ Wenyao Zhang ⋅ Yufei Ding ⋅ Runpei Dong ⋅ XinQiang Yu ⋅ Jingwen Li ⋅ Lingyun Xu ⋅ Baoyu Li ⋅ Xialin He ⋅ Guofan Fan ⋅ Jiazhao Zhang ⋅ Jiawei He ⋅ Jiayuan Gu ⋅ Xin Jin ⋅ Kaisheng Ma ⋅ Zhizheng Zhang ⋅ He Wang ⋅ Li Yi

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

While spatial reasoning has made progress in object localization relationships, it often overlooks object orientation—a key factor in 6-DoF fine-grained manipulation. Traditional pose representations rely on pre-defined frames or templates, limiting generalization and semantic grounding. In this paper, we introduce the concept of semantic orientation, which defines object orientations using natural language in a reference-frame-free manner (e.g., the ''plug-in'' direction of a USB or the ''handle'' direction of a cup). To support this, we construct OrienText300K, a large-scale dataset of 3D objects annotated with semantic orientations, and develop PointSO, a general model for zero-shot semantic orientation prediction. By integrating semantic orientation into VLM agents, our SoFar framework enables 6-DoF spatial reasoning and generates robotic actions. Extensive experiments demonstrated the effectiveness and generalization of our SoFar, e.g., zero-shot 48.7\% successful rate on Open6DOR and zero-shot 74.9\% successful rate on SIMPLER-Env.

View full details

Poster

IA-GGAD: Zero-shot Generalist Graph Anomaly Detection via Invariant and Affinity Learning

Xiong Zhang ⋅ Zhenli He ⋅ Changlong Fu ⋅ Cheng Xie

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Generalist Graph Anomaly Detection (GGAD) extends traditional Graph Anomaly Detection (GAD) from one-for-one to one-for-all scenarios, posing significant challenges due to Feature Space Shift (FSS) and Graph Structure Shift (GSS). This paper first formalizes these challenges and proposes quantitative metrics to measure their severity. To tackle FSS, we develop an anomaly-driven graph invariant learning module that learns domain-invariant node representations. To address GSS, a novel structure-insensitive affinity learning module is introduced, capturing cross-domain structural correspondences via affinity-based features. Our unified framework, IA-GGAD, integrates these modules, enabling anomaly prediction on unseen graphs without target-domain retraining or fine-tuning. Extensive experiments on benchmark datasets from varied domains demonstrate IA-GGAD’s superior performance, significantly outperforming state-of-the-art methods (e.g., achieving up to +12.28\% AUROC over ARC on ACM). Ablation studies further confirm the effectiveness of each proposed module. The code is available at \url{https://github.com/kg-cc/IA-GGAD/}.

View full details

Poster

Is Grokking a Computational Glass Relaxation?

Xiaotian Zhang ⋅ Yue Shang ⋅ Entao Yang ⋅ Ge Zhang

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Understanding neural network' (NN) generalizability remains a central question in deep learning research. The special phenomenon of grokking, where NNs abruptly generalize long after the training performance reaches near-perfect level, offers a unique window to investigate the underlying mechanisms of NNs' generalizability. Here we propose an interpretation for grokking by framing it as a computational glass relaxation: viewing NNs as a physical system where parameters are the degrees of freedom and train loss is the system energy, we find memorization process resembles a rapid cooling of liquid into non-equilibrium glassy state at low temperature and the later generalization is like a slow relaxation towards a more stable configuration. This mapping enables us to sample NNs' Boltzmann entropy (states of density) landscape as a function of training loss and test accuracy. Our experiments in transformers on arithmetic tasks suggests that there is NO entropy barrier in the memorization-to-generalization transition of grokking, challenging previous theory that defines grokking as a first-order phase transition. We identify a high-entropy advantage under grokking, an extension of prior work linking entropy to generalizability but much more significant. Inspired by grokking's far-from-equilibrium nature, we develop a toy optimizer WanD based on Wang-landau molecular dynamics, which can eliminate grokking without any constraints and find high-norm generalizing solutions. This provides strictly-defined counterexamples to theory attributing grokking solely to weight norm evolution towards the Goldilocks zone and also suggests new potential ways for optimizer design.

View full details

Poster

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Diankun Wu ⋅ Fangfu Liu ⋅ Yi-Hsin Hung ⋅ Yueqi Duan

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or 2.5D data to incorporate spatial awareness, restricting their utility in scenarios with only 2D inputs, such as images or videos. In this paper, we present Spatial-MLLM, a novel framework for visual-based spatial reasoning from purely 2D observations. Unlike conventional video MLLMs which rely on CLIP-based visual encoders optimized for semantic understanding, our key insight is to unleash the strong structure prior from the feed-forward visual geometry foundation model. Specifically, we propose a dual-encoder architecture: a pretrained 2D visual encoder to extract semantic features, and a 3D spatial encoder—initialized from the backbone of the visual geometry model—to extract 3D structure features. A connector then integrates both features into unified visual tokens for enhanced spatial understanding. Furthermore, we propose a space-aware frame sampling strategy at inference time, which selects the spatially informative frames of a video sequence, ensuring that even under limited token length, the model focuses on frames critical for spatial reasoning. Beyond architecture improvements, we construct a training dataset from multiple sources and train the model on it using supervised fine-tuning and GRPO. Extensive experiments on various real-world datasets demonstrate that Spatial-MLLM achieves state-of-the-art performance in a wide range of visual-based spatial understanding and reasoning tasks.

View full details

Poster

Characterizing the Expressivity of Fixed-Precision Transformer Language Models

Jiaoda Li ⋅ Ryan Cotterell

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Transformer-based language models (LMs) have achieved widespread empirical success, but their theoretical expressive power remains only partially understood. In this work, we analyze a restricted idealization of fixed-precision transformers with strict future masking, soft attention, and no positional encodings. We establish that this class of models is exactly as expressive as a specific fragment of linear temporal logic that contains only a single temporal operator: the $\texttt{past}$ operator. We further connect this fragment to established classes in formal language theory, automata theory, and algebra, yielding a unified framework for understanding transformer expressivity under this idealization. Finally, we present empirical results that align closely with our theory: transformers trained on languages within their characterized expressive capacity generalize reliably across sequence lengths, while they consistently fail to generalize on languages beyond it.

View full details

Poster

Reconstruction and Secrecy under Approximate Distance Queries

Shay Moran ⋅ Elizaveta Nesterova

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Consider the task of locating an unknown target point using approximate distance queries: in each round, a reconstructor selects a reference point and receives a noisy version of its distance to the target. This problem arises naturally in various contexts—from localization in GPS and sensor networks to privacy-aware data access—making it relevant from the perspective of both the reconstructor (seeking accurate recovery) and the responder (aiming to limit information disclosure, e.g., for privacy or security reasons). We study this reconstruction game through a learning-theoretic lens, focusing on the rate and limits of the best possible reconstruction error. Our first result provides a tight geometric characterization of the optimal error in terms of the Chebyshev radius, a classical concept from geometry. This characterization applies to all compact metric spaces (in fact, to all totally bounded spaces) and yields explicit formulas for natural subsets of the Euclidean metric. Our second result addresses the asymptotic behavior of reconstruction, distinguishing between pseudo-finite spaces, where the optimal error is attained after finitely many queries, and spaces where the approximation curve exhibits a nontrivial decay. We characterize pseudo-finiteness for convex subsets of Euclidean spaces.

View full details

Poster

Optimal Neural Compressors for the Rate-Distortion-Perception Tradeoff

Eric Lei ⋅ Hamed Hassani ⋅ Shirin Saeedi Bidokhti

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Recent efforts in neural compression have focused on the rate-distortion-perception (RDP) tradeoff, where the perception constraint ensures the source and reconstruction distributions are close in terms of a statistical divergence. Theoretical work on RDP describes properties of RDP-optimal compressors without providing constructive and low complexity solutions. While classical rate-distortion theory shows that optimal compressors should efficiently pack space, RDP theory additionally shows that infinite randomness shared between the encoder and decoder may be necessary for RDP optimality. In this paper, we propose neural compressors that are low complexity and benefit from high packing efficiency through lattice coding and shared randomness through shared dithering over the lattice cells. For two important settings, namely infinite shared and zero shared randomness, we analyze the RDP tradeoff achieved by our proposed neural compressors and show optimality in both cases. Experimentally, we investigate the roles that these two components of our design, lattice coding and randomness, play in the performance of neural compressors on synthetic and real-world data. We observe that performance improves with more shared randomness and better lattice packing.

View full details

Poster

Clustering via Hedonic Games: New Concepts and Algorithms

Gergely Csáji ⋅ Alexander Gundert ⋅ Jörg Rothe ⋅ Ildikó Schlotter

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We study fundamental connections between coalition formation games and clustering, illustrating the cross-disciplinary relevance of these concepts. We focus on graphical hedonic games where agents' preferences are compactly represented by a friendship graph and an enemy graph. In the context of clustering, friendship relations naturally align with data point similarities, whereas enmity corresponds to dissimilarities. We consider two stability notions based on single-agent deviations: local popularity and local stability. Exploring these concepts from an algorithmic viewpoint, we design efficient mechanisms for finding locally stable or locally popular partitions. Besides gaining theoretical insight into the computational complexity of these problems, we perform simulations that demonstrate how our algorithms can be successfully applied in clustering and community detection. Our findings highlight the interplay between coalition formation games and data-driven clustering techniques, offering fresh perspectives and applications in both areas.

View full details

Poster

Universal Causal Inference in a Topos

Sridhar Mahadevan

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

In this paper, we explore the universal properties underlying causal inference by formulating it in terms of a topos. More concretely, we introduce topos causal models (TCMs), a strict generalization of the popular structural causal models (SCMs). A topos category has several properties that make it attractive: a general theory for how to combine local functions that define ``independent causal mechanisms" into a consistent global function building on the theory of sheaves in a topos; a generic way to define causal interventions using a subobject classifier in a topos category; and finally, an internal logical language for causal and counterfactual reasoning that emerges from the topos itself. A striking characteristic of subobject classifiers is that they induce an intuitionistic logic, whose semantics is based on the partially ordered lattice of subobjects. We show that the underlying subobject classifier for causal inference is not Boolean in general, but forms a Heyting algebra. We define the internal Mitchell-B\'enabou language, a typed local set theory, associated with causal models, and its associated Kripke-Joyal intuitionistic semantics. We prove a universal property of TCM, namely that any causal functor mapping decomposable structure to probabilistic semantics factors uniquely through a TCM representation.

View full details

Poster

Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Mantas Mazeika ⋅ Xuwang Yin ⋅ Rishub Tamirisa ⋅ Jaehyuk Lim ⋅ Bruce W, Lee ⋅ Richard Ren ⋅ Long Phan ⋅ Norman Mu ⋅ Oliver Zhang ⋅ Dan Hendrycks

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

As AIs rapidly advance and become more agentic, the risk they pose is governed not only by their capabilities but increasingly by their propensities, including goals and values. Tracking the emergence of goals and values has proven a longstanding problem, and despite much interest over the years it remains unclear whether current AIs have meaningful values. We propose a solution to this problem, leveraging the framework of utility functions to study the internal coherence of AI preferences. Surprisingly, we find that independently-sampled preferences in current LLMs exhibit high degrees of structural coherence, and moreover that this emerges with scale. These findings suggest that value systems emerge in LLMs in a meaningful sense, a finding with broad implications. To study these emergent value systems, we propose utility engineering as a research agenda, comprising both the analysis and control of AI utilities. We uncover problematic and often shocking values in LLM assistants despite existing control measures. These include cases where AIs value themselves over humans and are anti-aligned with specific individuals. To constrain these emergent value systems, we propose methods of utility control. As a case study, we show how aligning utilities with a citizen assembly reduces political biases and generalizes to new scenarios. Whether we like it or not, value systems have already emerged in AIs, and much work remains to fully understand and control these emergent representations.

View full details

Poster

Dynamic Algorithm for Explainable $k$-medians Clustering under $\ell_p$ Norm

Konstantin Makarychev ⋅ Ilias Papanikolaou ⋅ Liren Shan

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We study the problem of explainable $k$-medians clustering introduced by Dasgupta, Frost, Moshkovitz, and Rashtchian (2020). In this problem, the goal is to construct a threshold decision tree that partitions data into $k$ clusters while minimizing the $k$-medians objective. These trees are interpretable because each internal node makes a simple decision by thresholding a single feature, allowing users to trace and understand how each point is assigned to a cluster. We present the first algorithm for explainable $k$-medians under $\ell_p$ norm for every finite $p \geq 1$. Our algorithm achieves an $\tilde{O}\big(p(\log k)^{1 + 1/p - 1/p^2}\big)$ approximation to the optimal $k$-medians cost for any $p \geq 1$. Previously, algorithms were known only for $p = 1$ and $p = 2$. For $p = 2$, our algorithm improves upon the existing bound of $\tilde O(\log^{3/2}k)$, and for $p = 1$, it matches the tight bound of $\log k + O(1)$ up to a multiplicative $O(\log \log k)$ factor. We show how to implement our algorithm in a dynamic setting. The dynamic algorithm maintains an explainable clustering under a sequence of insertions and deletions, with amortized update time $O(d \log^3 k)$ and $O(\log k)$ recourse, making it suitable for large-scale and evolving datasets.

View full details

Poster

Differentiable Sparsity via $D$-Gating: Simple and Versatile Structured Penalization

Chris Kolb ⋅ Laetitia Frost ⋅ Bernd Bischl ⋅ David Rügamer

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Structured sparsity regularization offers a principled way to compact neural networks, but its non-differentiability breaks compatibility with conventional stochastic gradient descent and requires either specialized optimizers or additional post-hoc pruning without formal guarantees. In this work, we propose $D$-Gating, a fully differentiable structured overparameterization that splits each group of weights into a primary weight vector and multiple scalar gating factors. We prove that any local minimum under $D$-Gating is also a local minimum using non-smooth structured $L_{2,2/D}$ penalization, and further show that the $D$-Gating objective converges at least exponentially fast to the $L_{2,2/D}$–regularized loss in the gradient flow limit. Together, our results show that $D$-Gating is theoretically equivalent to solving the original group sparsity problem, yet induces distinct learning dynamics that evolve from a non-sparse regime into sparse optimization. We validate our theory across vision, language, and tabular tasks, where $D$-Gating consistently delivers strong performance–sparsity tradeoffs and outperforms both direct optimization of structured penalties and conventional pruning baselines.

View full details

Poster

GraphMaster: Automated Graph Synthesis via LLM Agents in Data-Limited Environments

Enjun Du ⋅ Xunkai Li ⋅ Tian Jin ⋅ Zhihan Zhang ⋅ Rong-Hua Li ⋅ Guoren Wang

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The era of foundation models has revolutionized AI research, yet Graph Foundation Models (GFMs) remain constrained by the scarcity of large-scale graph corpora. Traditional graph data synthesis techniques primarily focus on simplistic structural operations, lacking the capacity to generate semantically rich nodes with meaningful textual attributes—a critical limitation for real-world applications. While large language models (LLMs) demonstrate exceptional text generation capabilities, their direct application to graph synthesis is impeded by context window limitations, hallucination phenomena, and structural consistency challenges. To address these issues, we introduce \textbf{GraphMaster}—the first multi-agent framework specifically designed for graph data synthesis in data-limited environments. GraphMaster orchestrates four specialized LLM agents (Manager, Perception, Enhancement, and Evaluation) that collaboratively optimize the synthesis process through iterative refinement, ensuring both semantic coherence and structural integrity. To rigorously evaluate our approach, we create new data-limited “Sub” variants of six standard graph benchmarks, specifically designed to test synthesis capabilities under realistic constraints. Additionally, we develop a novel interpretability assessment framework that combines human evaluation with a principled Grassmannian manifold-based analysis, providing both qualitative and quantitative measures of semantic coherence. Experimental results demonstrate that GraphMaster significantly outperforms traditional synthesis methods across multiple datasets, establishing a strong foundation for advancing GFMs in data-scarce environments.

View full details

Poster

Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search

Huanjin Yao ⋅ Jiaxing Huang ⋅ Wenhao Wu ⋅ Jingyi Zhang ⋅ Yibo Wang ⋅ Shunyu Liu ⋅ Yingjie Wang ⋅ YuXin Song ⋅ Haocheng Feng ⋅ Li Shen ⋅ Dacheng Tao

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

In this work, we aim to develop an MLLM that understands and solves questions by learning to create each intermediate step of the reasoning involved till the final answer. To this end, we propose Collective Monte Carlo Tree Search (CoMCTS), a new learning-to-reason method for MLLMs, which introduces the concept of collective learning into ``tree search'' for effective and efficient reasoning-path searching and learning. The core idea of CoMCTS is to leverage collective knowledge from multiple models to collaboratively conjecture, search and identify effective reasoning paths toward correct answers via four iterative operations including Expansion, Simulation and Error Positioning, Backpropagation, and Selection. Using CoMCTS, we construct Mulberry-260k, a multimodal dataset with a tree of rich, explicit and well-defined reasoning nodes for each question. With Mulberry-260k, we perform collective SFT to train our model, Mulberry, a series of MLLMs with o1-like step-by-step Reasoning and Reflection capabilities. Extensive experiments demonstrate the superiority of our proposed methods on various benchmarks. Code is available at https://github.com/HJYao00/Mulberry.

View full details

Poster

GenColor: Generative and Expressive Color Enhancement with Pixel-Perfect Texture Preservation

Yi Dong ⋅ Yuxi Wang ⋅ Xianhui Lin ⋅ Wenqi Ouyang ⋅ Zhiqi Shen ⋅ Peiran Ren ⋅ Ruoxi Fan ⋅ Rynson Lau

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Color enhancement is a crucial yet challenging task in digital photography. It demands methods that are (i) expressive enough for fine-grained adjustments, (ii) adaptable to diverse inputs, and (iii) able to preserve texture. Existing approaches typically fall short in at least one of these aspects, yielding unsatisfactory results. We propose GenColor, a novel diffusion-based framework for sophisticated, texture-preserving color enhancement. GenColor reframes the task as conditional image generation. Leveraging ControlNet and a tailored training scheme, it learns advanced color transformations that adapt to diverse lighting and content. We train GenColor on ARTISAN, our newly collected large-scale dataset of 1.2M high-quality photographs specifically curated for enhancement tasks. To overcome texture preservation limitations inherent in diffusion models, we introduce a color-transfer network with a novel degradation scheme that simulates texture–color relationships. This network achieves pixel-perfect texture preservation while enabling fine-grained color matching with the diffusion-generated reference images. Extensive experiments show that GenColor produces visually compelling results comparable to those of expert colorists and surpasses state-of-the-art methods in both subjective and objective evaluations. We have released the code and dataset.

View full details

Poster

Cue3D: Quantifying the Role of Image Cues in Single-Image 3D Generation

Xiang Li ⋅ Zirui Wang ⋅ Zixuan Huang ⋅ James Rehg

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Humans and traditional computer vision methods rely on a diverse set of monocular cues to infer 3D structure from a single image, such as shading, texture, silhouette, etc. While recent deep generative models have dramatically advanced single-image 3D generation, it remains unclear which image cues these methods actually exploit. We introduce Cue3D, the first comprehensive, model-agnostic framework for quantifying the influence of individual image cues in single-image 3D generation. Our unified benchmark evaluates seven state-of-the-art methods, spanning regression-based, multi-view, and native 3D generative paradigms. By systematically perturbing cues such as shading, texture, silhouette, perspective, edges, and local continuity, we measure their impact on 3D output quality. Our analysis reveals that shape meaningfulness, not texture, dictates generalization. Geometric cues, particularly shading, are crucial for 3D generation. We further identify over-reliance on provided silhouettes and diverse sensitivities to cues such as perspective and local continuity across model families. By dissecting these dependencies, Cue3D advances our understanding of how modern 3D networks leverage classical vision cues, and offers directions for developing more transparent, robust, and controllable single-image 3D generation models.

View full details

Poster

TransMLA: Migrating GQA Models to MLA with Full DeepSeek Compatibility and Speedup

Fanxu Meng ⋅ Pingzhi Tang ⋅ Zengwei Yao ⋅ Xing Sun ⋅ Muhan Zhang

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Modern large-language models often face communication bottlenecks on current hardware rather than computational limitations. *Multi-head latent attention (MLA)* addresses this by compressing the key-value cache using low-rank matrices, while the Absorb operation prevents the KV cache from reverting to its original size, significantly boosting both training and inference speed. Despite the success of DeepSeek V2/V3/R1, most model providers have heavily invested in optimizing GQA-based models and, therefore, lack strong incentives to retrain MLA-based models from scratch. This paper demonstrates that MLA provides superior expressive power compared to GQA with the same KV cache overhead, thereby offering a rationale for transitioning from GQA to MLA. In addition, we introduce TransMLA, a framework that seamlessly converts any GQA-based pre-trained model (e.g., LLaMA, Qwen, Gemma, Mistral/Mixtral) into an MLA-based model. For the first time, our method enables *direct conversion of these models into a format compatible with DeepSeek's codebase*, allowing them to fully leverage the existing, highly-optimized support for the DeepSeek architecture within inference engines like vLLM and SGlang. By compressing 93\% of the KV cache in LLaMA-2-7B, we achieve a **10x speedup** with an 8K context length while maintaining meaningful output. Moreover, the model requires only **6B tokens** for fine-tuning to recover comparable performance across multiple benchmarks. TransMLA provides a practical path for migrating GQA-based models to the MLA structure, and when combined with DeepSeek’s advanced optimizations—such as FP8 quantization and Multi-Token Prediction—further inference acceleration can be achieved.

View full details

Poster

Toward Relative Positional Encoding in Spiking Transformers

Changze Lv ⋅ Yansen Wang ⋅ Dongqi Han ⋅ Yifei Shen ⋅ Xiaoqing Zheng ⋅ Xuanjing Huang ⋅ Dongsheng Li

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Spiking neural networks (SNNs) are bio-inspired networks that mimic how neurons in the brain communicate through discrete spikes, which have great potential in various tasks due to their energy efficiency and temporal processing capabilities. SNNs with self-attention mechanisms (spiking Transformers) have recently shown great advancements in various tasks, and inspired by traditional Transformers, several studies have demonstrated that spiking absolute positional encoding can help capture sequential relationships for input data, enhancing the capabilities of spiking Transformers for tasks such as sequential modeling and image classification. However, how to incorporate relative positional information into SNNs remains a challenge. In this paper, we introduce several strategies to approximate relative positional encoding (RPE) in spiking Transformers while preserving the binary nature of spikes. Firstly, we formally prove that encoding relative distances with Gray Code ensures that the binary representations of positional indices maintain a constant Hamming distance whenever their decimal values differ by a power of two, and we propose **Gray-PE** based on this property. In addition, we propose another RPE method called **Log-PE**, which combines the logarithmic form of the relative distance matrix directly into the spiking attention map. Furthermore, we extend our RPE methods to a two-dimensional form, making them suitable for processing image patches. We evaluate our RPE methods on various tasks, including time series forecasting, text classification, and patch-based image classification, and the experimental results demonstrate a satisfying performance gain by incorporating our RPE methods across many architectures. Our results provide fresh perspectives on designing spiking Transformers to advance their sequential modeling capability, thereby expanding their applicability across various domains. Our code is available at https://github.com/microsoft/SeqSNN.

View full details

Poster

Taccel: Scaling Up Vision-based Tactile Robotics via High-performance GPU Simulation

Yuyang Li ⋅ Wenxin Du ⋅ Chang Yu ⋅ Puhao Li ⋅ Zihang Zhao ⋅ Tengyu Liu ⋅ Chenfanfu Jiang ⋅ Yixin Zhu ⋅ Siyuan Huang

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Tactile sensing is crucial for achieving human-level robotic capabilities in manipulation tasks. As a promising solution, Vision-based Tactile Sensors (VBTSs) offer high spatial resolution and cost-effectiveness, but present unique challenges in robotics for their complex physical characteristics and visual signal processing requirements. The lack of efficient and accurate simulation tools for VBTSs has significantly limited the scale and scope of tactile robotics research. We present Taccel, a high-performance simulation platform that integrates Incremental Potential Contact (IPC) and Affine Body Dynamics (ABD) to model robots, tactile sensors, and objects with both accuracy and unprecedented speed, achieving a total of 915 FPS with 4096 parallel environments. Unlike previous simulators that operate at sub-real-time speeds with limited parallelization, Taccel provides precise physics simulation and realistic tactile signals while supporting flexible robot-sensor configurations through user-friendly APIs. Through extensive validation in object recognition, robotic grasping, and articulated object manipulation, we demonstrate precise simulation and successful sim-to-real transfer. These capabilities position Taccel as a powerful tool for scaling up tactile robotics research and development, potentially transforming how robots interact with and understand their physical environment.

View full details

Poster

Scent of Knowledge: Optimizing Search-Enhanced Reasoning with Information Foraging

Hongjin Qian ⋅ Zheng Liu

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Augmenting large language models (LLMs) with external retrieval has become a standard method to address their inherent knowledge cutoff limitations. However, traditional retrieval-augmented generation methods employ static, pre-inference retrieval strategies, making them inadequate for complex tasks involving ambiguous, multi-step, or evolving information needs. Recent advances in test-time scaling techniques have demonstrated significant potential in enabling LLMs to dynamically interact with external tools, motivating the shift toward adaptive inference-time retrieval. Inspired by Information Foraging Theory (IFT), we propose InForage, a reinforcement learning framework that formalizes retrieval-augmented reasoning as a dynamic information-seeking process. Unlike existing approaches, InForage explicitly rewards intermediate retrieval quality, encouraging LLMs to iteratively gather and integrate information through adaptive search behaviors. To facilitate training, we construct a human-guided dataset capturing iterative search and reasoning trajectories for complex, real-world web tasks. Extensive evaluations across general question answering, multi-hop reasoning tasks, and a newly developed real-time web QA dataset demonstrate InForage's superior performance over baseline methods. These results highlight InForage's effectiveness in building robust, adaptive, and efficient reasoning agents. We provide all codes and datasets in the supplementary materials.

View full details

Poster

Polyline Path Masked Attention for Vision Transformer

Zhongchen Zhao ⋅ Chaodong Xiao ⋅ Hui LIN ⋅ Qi Xie ⋅ Lei Zhang ⋅ Deyu Meng

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Global dependency modeling and spatial position modeling are two core issues of the foundational architecture design in current deep learning frameworks. Recently, Vision Transformers (ViTs) have achieved remarkable success in computer vision, leveraging the powerful global dependency modeling capability of the self-attention mechanism. Furthermore, Mamba2 has demonstrated its significant potential in natural language processing tasks by explicitly modeling the spatial adjacency prior through the structured mask. In this paper, we propose Polyline Path Masked Attention (PPMA) that integrates the self-attention mechanism of ViTs with an enhanced structured mask of Mamba2, harnessing the complementary strengths of both architectures. Specifically, we first ameliorate the traditional structured mask of Mamba2 by introducing a 2D polyline path scanning strategy and derive its corresponding structured mask, polyline path mask, which better preserves the adjacency relationships among image tokens. Notably, we conduct a thorough theoretical analysis on the structural characteristics of the proposed polyline path mask and design an efficient algorithm for the computation of the polyline path mask. Next, we embed the polyline path mask into the self-attention mechanism of ViTs, enabling explicit modeling of spatial adjacency prior. Extensive experiments on standard benchmarks, including image classification, object detection, and segmentation, demonstrate that our model outperforms previous state-of-the-art approaches based on both state-space models and Transformers. For example, our proposed PPMA-T/S/B models achieve 48.7%/51.1%/52.3% mIoU on the ADE20K semantic segmentation task, surpassing RMT-T/S/B by 0.7%/1.3%/0.3%, respectively. Code is available at https://github.com/zhongchenzhao/PPMA.

View full details

Poster

Thought Communication in Multiagent Collaboration

Yujia Zheng ⋅ Zhuokai Zhao ⋅ Zijian Li ⋅ Yaqi Xie ⋅ Mingze Gao ⋅ Lizhu Zhang ⋅ Kun Zhang

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Natural language has long enabled human cooperation, but its lossy, ambiguous, and indirect nature limits the potential of collective intelligence. While machines are not subject to these constraints, most LLM-based multi-agent systems still rely solely on natural language, exchanging tokens or their embeddings. To go beyond language, we introduce a new paradigm, *thought communication*, which enables agents to interact directly mind-to-mind, akin to telepathy. To uncover these latent thoughts in a principled way, we formalize the process as a general latent variable model, where agent states are generated by an unknown function of underlying thoughts. We prove that, in a nonparametric setting without auxiliary information, both shared and private latent thoughts between any pair of agents can be identified. Moreover, the global structure of thought sharing, including which agents share which thoughts and how these relationships are structured, can also be recovered with theoretical guarantees. Guided by the established theory, we develop a framework that extracts latent thoughts from all agents prior to communication and assigns each agent the relevant thoughts, along with their sharing patterns. This paradigm naturally extends beyond LLMs to all modalities, as most observational data arise from hidden generative processes. Experiments on both synthetic and real-world benchmarks validate the theory and demonstrate the collaborative advantages of thought communication. We hope this work illuminates the potential of leveraging the hidden world, as many challenges remain unsolvable through surface-level observation alone, regardless of compute or data scale.

View full details

Poster

Robust SuperAlignment: Weak-to-Strong Robustness Generalization for Vision-Language Models

Junhao Dong ⋅ Cong Zhang ⋅ Xinghua Qu ⋅ Zejun MA ⋅ Piotr Koniusz ⋅ Yew Soon Ong

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Numerous well-established studies have demonstrated the superhuman capabilities of modern Vision-Language Models (VLMs) across a wide range of tasks. However, growing is the doubt about the continuing availability of reliable high-quality labeling (supervision) from human annotators, leading to stagnation of the model's performance. To address this challenge, ``superalignment'' employs the so-called weak-to-strong generalization paradigm, where the supervision from a weak model can provide generalizable knowledge for a strong model. While effective in aligning knowledge for clean samples between the strong and weak models, the standard weak-to-strong approach typically fails to capture adversarial robustness, exposing strong VLMs to adversarial attacks. This inability to transfer adversarial robustness is because adversarial samples are normally missing in the superalignment stage. To this end, we are the first to propose the weak-to-strong (adversarial) robustness generalization method to elicit zero-shot robustness in large-scale models by an unsupervised scheme, mitigating the unreliable information source for alignment from two perspectives: alignment re-weighting and source guidance refinement. We analyze settings under which robustness generalization is possible. Extensive experiments across various vision-language benchmarks validate the effectiveness of our method in numerous scenarios, demonstrating its plug-and-play applicability to large-scale VLMs.

View full details

Poster

The Primacy of Magnitude in Low-Rank Adaptation

Zicheng Zhang ⋅ Haoran Li ⋅ Yifeng Zhang ⋅ Guoqiang Gong ⋅ Jiaxing Wang ⋅ Junxing Hu ⋅ Pengzhang Liu ⋅ Qixia Jiang

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Low-Rank Adaptation (LoRA) offers a parameter-efficient paradigm for tuning large models. While recent spectral initialization methods improve convergence and performance over the naive “Noise \& Zeros” scheme, their extra computational and storage overhead undermines efficiency. In this paper, we establish update magnitude as the fundamental driver of LoRA performance and propose LoRAM, a magnitude-driven “Basis \& Basis” initialization scheme that matches spectral methods without their inefficiencies. Our key contributions are threefold: (i) Magnitude of weight updates determines convergence. We prove low-rank structures intrinsically bound update magnitudes, unifying hyperparameter tuning in learning rate, scaling factor, and initialization as mechanisms to optimize magnitude regulation. (ii) Spectral initialization succeeds via magnitude amplification. We demystify that the presumed knowledge-driven benefit of spectral component essentially arises from the boost in the weight update magnitude. (iii) A novel and compact initialization strategy, LoRAM, scales deterministic orthogonal bases using pretrained weight magnitudes to simulate spectral gains. Extensive experiments show that LoRAM serves as a strong baseline, retaining the full efficiency of LoRA while matching or outperforming spectral initialization across benchmarks.

View full details

Poster

Flattening Hierarchies with Policy Bootstrapping

John Zhou ⋅ Jonathan Kao

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Offline goal-conditioned reinforcement learning (GCRL) is a promising approach for pretraining generalist policies on large datasets of reward-free trajectories, akin to the self-supervised objectives used to train foundation models for computer vision and natural language processing. However, scaling GCRL to longer horizons remains challenging due to the combination of sparse rewards and discounting, which obscures the comparative advantages of primitive actions with respect to distant goals. Hierarchical RL methods achieve strong empirical results on long-horizon goal-reaching tasks, but their reliance on modular, timescale-specific policies and subgoal generation introduces significant additional complexity and hinders scaling to high-dimensional goal spaces. In this work, we introduce an algorithm to train a flat (non-hierarchical) goal-conditioned policy by bootstrapping on subgoal-conditioned policies with advantage-weighted importance sampling. Our approach eliminates the need for a generative model over the (sub)goal space, which we find is key for scaling to high-dimensional control in large state spaces. We further show that existing hierarchical and bootstrapping-based approaches correspond to specific design choices within our derivation. Across a comprehensive suite of state- and pixel-based locomotion and manipulation benchmarks, our method matches or surpasses state-of-the-art offline GCRL algorithms and scales to complex, long-horizon tasks where prior approaches fail. Project page: https://johnlyzhou.github.io/saw/

View full details

Poster

CSBrain: A Cross-scale Spatiotemporal Brain Foundation Model for EEG Decoding

Yuchen Zhou ⋅ Jiamin Wu ⋅ Zichen Ren ⋅ Zhouheng Yao ⋅ Weiheng Lu ⋅ Kunyu Peng ⋅ Qihao Zheng ⋅ Chunfeng Song ⋅ Wanli Ouyang ⋅ Chao Gou

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Understanding and decoding human brain activity from electroencephalography (EEG) signals is a fundamental problem in neuroscience and artificial intelligence, with applications ranging from cognition and emotion recognition to clinical diagnosis and brain–computer interfaces. While recent EEG foundation models have made progress in generalized brain decoding by leveraging unified architectures and large-scale pretraining, they inherit a scale-agnostic dense modeling paradigm from NLP and vision. This design overlooks an intrinsic property of neural activity—cross-scale spatiotemporal structure. Different EEG task patterns span a broad range of temporal and spatial scales, from brief neural activations to slow-varying rhythms, and from localized cortical activations to large-scale distributed interactions. Ignoring this diversity may lead to suboptimal representations and weakened generalization ability. To address these limitations, we propose CSBrain, a Cross-scale Spatiotemporal Brain foundation model for generalized EEG decoding. CSBrain introduces two key components: (i) Cross-scale Spatiotemporal Tokenization (CST), which aggregates multi-scale features within localized temporal windows and anatomical brain regions into compact scale-aware token representations; and (ii) Structured Sparse Attention (SSA), which models cross-window and cross-region dependencies for diverse decoding tasks, further enriching scale diversities while eliminating the spurious dependencies. CST and SSA are alternately stacked to progressively integrate cross-scale spatiotemporal dependencies. Extensive experiments across 11 representative EEG tasks and 16 datasets demonstrate that CSBrain consistently outperforms both task-specific models and strong foundation baselines. These results establish cross-scale modeling as a key inductive bias for generalized EEG decoding and highlight CSBrain as a robust backbone for future brain–AI research.

View full details

Poster

Learning to Factorize Spatio-Temporal Foundation Models

Siru Zhong ⋅ Junjie Qiu ⋅ Yangyu Wu ⋅ Xingchen Zou ⋅ Zhongwen Rao ⋅ Bin Yang ⋅ Chenjuan Guo ⋅ Hao Xu ⋅ Yuxuan Liang

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Spatio-Temporal Foundation Models (STFMs) promise zero/few-shot generalization across various datasets, yet joint spatio-temporal pretraining is computationally prohibitive and struggles with domain-specific spatial correlations. To this end, we introduce FactoST, a factorized STFM that decouples universal temporal pretraining from spatio-temporal adaptation. The first stage pretrains a space-agnostic backbone with multi-frequency reconstruction and domain-aware prompting, capturing cross-domain temporal regularities at low computational cost. The second stage freezes or further fine-tunes the backbone and attaches an adapter that fuses spatial metadata, sparsifies interactions, and aligns domains with continual memory replay. Extensive forecasting experiments reveal that, in few-shot setting, FactoST reduces MAE by up to 46.4% versus UniST, uses 46.2% fewer parameters, and achieves 68% faster inference than OpenCity, while remaining competitive with expert models. We believe this factorized view offers a practical and scalable path toward truly universal STFMs. The code will be released upon notification.

View full details

Poster

UniTok: a Unified Tokenizer for Visual Generation and Understanding

Chuofan Ma ⋅ Yi Jiang ⋅ Junfeng Wu ⋅ Jihan Yang ⋅ Xin Yu ⋅ Zehuan Yuan ⋅ BINGYUE PENG ⋅ Xiaojuan Qi

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Visual generative and understanding models typically rely on distinct tokenizers to process images, presenting a key challenge for unifying them within a single framework. Recent studies attempt to address this by connecting the training of VQVAE (for autoregressive generation) and CLIP (for understanding) to build a unified tokenizer. However, directly combining these training objectives has been observed to cause severe loss conflicts. In this paper, we show that reconstruction and semantic supervision do not inherently conflict. Instead, the underlying bottleneck stems from limited representational capacity of discrete token space. Building on these insights, we introduce UniTok, a unified tokenizer featuring a novel multi-codebook quantization mechanism that effectively scales up the vocabulary size and bottleneck dimension. In terms of final performance, UniTok sets a new record of 0.38 rFID and 78.6\% zero-shot accuracy on ImageNet. Besides, UniTok can be seamlessly integrated into MLLMs to unlock native visual generation capability, without compromising the understanding performance. Additionally, we show that UniTok favors cfg-free generation, reducing gFID from 14.6 to 2.5 on ImageNet 256$\times$256 benchmark. All codes and models have been made publicly available.

View full details

Poster

Theory-Driven Label-Specific Representation for Incomplete Multi-View Multi-Label Learning

Quanjiang Li ⋅ Tianxiang Xu ⋅ Tingjin Luo ⋅ Yan Zhong ⋅ Yang Li ⋅ Yiyun Zhou ⋅ Chenping Hou

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Multi-view multi-label learning typically suffers from dual data incompleteness due to limitations in feature storage and annotation costs. The interplay of hetero geneous features, numerous labels, and missing information significantly degrades model performance. To tackle the complex yet highly practical challenges, we propose a Theory-Driven Label-Specific Representation (TDLSR) framework. Through constructing the view-specific sample topology and prototype association graph, we develop the proximity-aware imputation mechanism, while deriving class representatives that capture the label correlation semantics. To obtain semantically distinct view representations, we introduce principles of information shift, inter action and orthogonality, which promotes the disentanglement of representation information, and mitigates message distortion and redundancy. Besides, label semantic-guided feature learning is employed to identify the discriminative shared and specific representations and refine the label preference across views. Moreover, we theoretically investigate the characteristics of representation learning and the generalization performance. Finally, extensive experiments on public datasets and real-world applications validate the effectiveness of TDLSR.

View full details

Poster

Mean-Field Sampling for Cooperative Multi-Agent Reinforcement Learning

Emile Anand ⋅ Ishani Karmarkar ⋅ Guannan Qu

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Designing efficient algorithms for multi-agent reinforcement learning (MARL) is fundamentally challenging because the size of the joint state and action spaces grows exponentially in the number of agents. These difficulties are exacerbated when balancing sequential global decision-making with local agent interactions. In this work, we propose a new algorithm $\texttt{SUBSAMPLE-MFQ}$ ($\textbf{Subsample}$-$\textbf{M}$ean-$\textbf{F}$ield-$\textbf{Q}$-learning) and a decentralized randomized policy for a system with $n$ agents. For any $k\leq n$, our algorithm learns a policy for the system in time polynomial in $k$. We prove that this learned policy converges to the optimal policy on the order of $\tilde{O}(1/\sqrt{k})$ as the number of subsampled agents $k$ increases. In particular, this bound is independent of the number of agents $n$.

View full details

Poster

Training-Free Constrained Generation With Stable Diffusion Models

Stefano Zampini ⋅ Jacob K Christopher ⋅ Luca Oneto ⋅ Davide Anguita ⋅ Ferdinando Fioretto

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Stable diffusion models represent the state-of-the-art in data synthesis across diverse domains and hold transformative potential for applications in science and engineering, e.g., by facilitating the discovery of novel solutions and simulating systems that are computationally intractable to model explicitly. While there is increasing effort to incorporate physics-based constraints into generative models, existing techniques are either limited in their applicability to latent diffusion frameworks or lack the capability to strictly enforce domain-specific constraints. To address this limitation this paper proposes a novel integration of stable diffusion models with constrained optimization frameworks, enabling the generation of outputs satisfying stringent physical and functional requirements. The effectiveness of this approach is demonstrated through material design experiments requiring adherence to precise morphometric properties, challenging inverse design tasks involving the generation of materials inducing specific stress-strain responses, and copyright-constrained content generation tasks. All code has been released at https://github.com/RAISELab-atUVA/Constrained-Stable-Diffusion.

View full details

Poster

Spectral Estimation with Free Decompression

Siavash Ameli ⋅ Chris van der Heide ⋅ Liam Hodgkinson ⋅ Michael Mahoney

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Computing eigenvalues of very large matrices is a critical task in many machine learning applications, including the evaluation of log-determinants, the trace of matrix functions, and other important metrics. As datasets continue to grow in scale, the corresponding covariance and kernel matrices become increasingly large, often reaching magnitudes that make their direct formation impractical or impossible. Existing techniques typically rely on matrix-vector products, which can provide efficient approximations, if the matrix spectrum behaves well. However, in settings like distributed learning, or when the matrix is defined only indirectly, access to the full data set can be restricted to only very small sub-matrices of the original matrix. In these cases, the matrix of nominal interest is not even available as an implicit operator, meaning that even matrix-vector products may not be available. In such settings, the matrix is "impalpable", in the sense that we have access to only masked snapshots of it. We draw on principles from free probability theory to introduce a novel method of "free decompression" to estimate the spectrum of such matrices. Our method can be used to extrapolate from the empirical spectral densities of small submatrices to infer the eigenspectrum of extremely large (impalpable) matrices (that we cannot form or even evaluate with full matrix-vector products). We demonstrate the effectiveness of this approach through a series of examples, comparing its performance against known limiting distributions from random matrix theory in synthetic settings, as well as applying it to submatrices of real-world datasets, matching them with their full empirical eigenspectra.

View full details

Poster

Stable Minima of ReLU Neural Networks Suffer from the Curse of Dimensionality: The Neural Shattering Phenomenon

Tongtong Liang ⋅ Dan Qiao ⋅ Yu-Xiang Wang ⋅ Rahul Parhi

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We study the implicit bias of flatness / low (loss) curvature and its effects on generalization in two-layer overparameterized ReLU networks with multivariate inputs---a problem well motivated by the minima stability and edge-of-stability phenomena in gradient-descent training. Existing work either requires interpolation or focuses only on univariate inputs. This paper presents new and somewhat surprising theoretical results for multivariate inputs. On two natural settings (1) generalization gap for flat solutions, and (2) mean-squared error (MSE) in nonparametric function estimation by stable minima, we prove upper and lower bounds, which establish that while flatness does imply generalization, the resulting rates of convergence necessarily deteriorate exponentially as the input dimension grows. This gives an exponential separation between the flat solutions compared to low-norm solutions (i.e., weight decay), which are known not to suffer from the curse of dimensionality. In particular, our minimax lower bound construction, based on a novel packing argument with boundary-localized ReLU neurons, reveals how flat solutions can exploit a kind of "neural shattering" where neurons rarely activate, but with high weight magnitudes. This leads to poor performance in high dimensions. We corroborate these theoretical findings with extensive numerical simulations. To the best of our knowledge, our analysis provides the first systematic explanation for why flat minima may fail to generalize in high dimensions.

View full details

Poster

Ctrl-DNA: Controllable Cell-Type-Specific Regulatory DNA Design via Constrained RL

Xingyu Chen ⋅ Shihao Ma ⋅ Runsheng Lin ⋅ Jiecong Lin ⋅ Bo Wang

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Designing regulatory DNA sequences that achieve precise cell-type-specific gene expression is crucial for advancements in synthetic biology, gene therapy and precision medicine. Although transformer-based language models (LMs) can effectively capture patterns in regulatory DNA, their generative approaches often struggle to produce novel sequences with reliable cell-specific activity. Here, we introduce Ctrl-DNA, a novel constrained reinforcement learning (RL) framework tailored for designing regulatory DNA sequences with controllable cell-type specificity. By formulating regulatory sequence design as a biologically informed constrained optimization problem, we apply RL to autoregressive genomic LMs, enabling the models to iteratively refine sequences that maximize regulatory activity in targeted cell types while constraining off-target effects. Our evaluation on human promoters and enhancers demonstrates that Ctrl-DNA consistently outperforms existing generative and RL-based approaches, generating high-fitness regulatory sequences and achieving state-of-the-art cell-type specificity. Moreover, Ctrl-DNA-generated sequences capture key cell-type-specific transcription factor binding sites (TFBS), short DNA motifs recognized by regulatory proteins that control gene expression, demonstrating the biological plausibility of the generated sequences.

View full details

Poster

MAESTRO : Adaptive Sparse Attention and Robust Learning for Multimodal Dynamic Time Series

Payal Mohapatra ⋅ Yueyuan Sui ⋅ Akash Pandey ⋅ Stephen Xia ⋅ Qi Zhu

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

From clinical healthcare to daily living, continuous sensor monitoring across multiple modalities has shown great promise for real-world intelligent decision-making but also faces various challenges. In this work, we argue for modeling such heterogeneous data sources under the multimodal paradigm and introduce a new framework, MAESTRO. We introduce MAESTRO, a novel framework that overcomes key limitations of existing multimodal learning approaches: (1) reliance on a single primary modality for alignment, (2) pairwise modeling of modalities, and (3) assumption of complete modality observations. These limitations hinder the applicability of these approaches in real-world multimodal time-series settings, where primary modality priors are often unclear, the number of modalities can be large (making pairwise modeling impractical), and sensor failures often result in arbitrary missing observations. At its core, MAESTRO facilitates dynamic intra- and cross-modal interactions based on task relevance, and leverages symbolic tokenization and adaptive attention budgeting to construct long multimodal sequences, which are processed via sparse cross-modal attention. The resulting cross-modal tokens are routed through a sparse Mixture-of-Experts (MoE) mechanism, enabling black-box specialization under varying modality combinations. We evaluate MAESTRO against 10 baselines on four diverse datasets spanning three applications, and observe average relative improvements of 4% and 8% over the best existing multimodal and multivariate approaches, respectively, under complete observations. Under partial observations—with up to 40% of missing modalities—MAESTRO achieves an average 9% improvement. Further analysis also demonstrates the robustness and efficiency of MAESTRO's sparse, modality-aware design for learning from dynamic time series.

View full details

Poster

A Smooth Sea Never Made a Skilled SAILOR: Robust Imitation via Learning to Search

Arnav Kumar Jain ⋅ Vibhakar Mohta ⋅ Subin Kim ⋅ Atiksh Bhardwaj ⋅ Juntao Ren ⋅ Yunhai Feng ⋅ Sanjiban Choudhury ⋅ Gokul Swamy

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The fundamental limitation of the behavioral cloning (BC) approach to imitation learning is that it only teaches an agent what the expert did at states the expert visited. This means that when a BC agent makes a mistake which takes them out of the support of the demonstrations, they often don't know how to recover from it. In this sense, BC is akin to *giving the agent the fish* -- giving them dense supervision across a narrow set of states -- rather than teaching them *to fish*: to be able to reason independently about achieving the expert's outcome even when faced with unseen situations at test-time. In response, we explore *learning to search* (L2S) from expert demonstrations, i.e. learning the components required to, at test time, plan to match expert outcomes, even after making a mistake. These include *(1)* a world model and *(2)* a reward model. We carefully ablate the set of algorithmic and design decisions required to combine these and other components for stable and sample/interaction-efficient learning of recovery behavior without additional human corrections. Across a dozen visual manipulation tasks from three benchmarks, our approach SAILOR consistently out-performs state-of-the-art Diffusion Policies trained via BC on the same data. Furthermore, scaling up the amount of demonstrations used for BC by 5-10x still leaves a performance gap. We find that SAILOR can identify nuanced failures and is robust to reward hacking. Our code is available at [https://github.com/arnavkj1995/SAILOR](https://github.com/arnavkj1995/SAILOR).

View full details

Poster

Does Object Binding Naturally Emerge in Large Pretrained Vision Transformers?

Yihao Li ⋅ Saeed Salehi ⋅ Lyle Ungar ⋅ Konrad Kording

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Object binding, the brain’s ability to bind the many features that collectively represent an object into a coherent whole, is central to human cognition. It groups low-level perceptual features into high‑level object representations, stores those objects efficiently and compositionally in memory, and supports human reasoning about individual object instances. While prior work often imposes object-centric attention (e.g., Slot Attention) explicitly to probe these benefits, it remains unclear whether this ability naturally emerges in pre-trained Vision Transformers (ViTs). Intuitively, they could: recognizing which patches belong to the same object should be useful for downstream prediction and thus guide attention. Motivated by the quadratic nature of self-attention, we hypothesize that ViTs represent whether two patches belong to the same object, a property we term *IsSameObject*. We decode *IsSameObject* from patch embeddings across ViT layers using a quadratic similarity probe, which reaches over 90\% accuracy. Crucially, this object-binding capability emerges reliably in DINO, CLIP, and ImageNet-supervised ViTs, but is markedly weaker in MAE, suggesting that binding is not a trivial architectural artifact, but an ability acquired through specific pretraining objectives. We further discover that *IsSameObject* is encoded in a low-dimensional subspace on top of object features, and that this signal actively guides attention. Ablating *IsSameObject* from model activations degrades downstream performance and works against the learning objective, implying that emergent object binding naturally serves the pretraining objective. Our findings challenge the view that ViTs lack object binding and highlight how symbolic knowledge of “which parts belong together” emerges naturally in a connectionist system.

View full details

Poster

What are you sinking? A geometric approach on attention sink

Valeria Ruscio ⋅ Umberto Nanni ⋅ Fabrizio Silvestri

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Attention sink (AS) is a consistent pattern in transformer attention maps where certain tokens (often special tokens or positional anchors) disproportionately attract attention from other tokens. We show that in transformers, AS is not an architectural artifact, but it is the manifestation of a fundamental geometric principle: the establishment of reference frames that anchor representational spaces. We analyze several architectures and identify three distinct reference frame types, centralized, distributed, and bidirectional, that correlate with the attention sink phenomenon. We show that they emerge during the earliest stages of training as optimal solutions to the problem of establishing stable coordinate systems in high-dimensional spaces. We show the influence of architecture components, particularly position encoding implementations, on the specific type of reference frame. This perspective transforms our understanding of transformer attention mechanisms and provides insights for both architecture design and the relationship with AS.

View full details

Poster

Understanding Parametric and Contextual Knowledge Reconciliation within Large Language Models

Jun Zhao ⋅ Yongzhuo Yang ⋅ Xiang Hu ⋅ Jingqi Tong ⋅ Yi Lu ⋅ Wei Wu ⋅ Tao Gui ⋅ Qi Zhang ⋅ Xuanjing Huang

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Retrieval-Augmented Generation (RAG) provides additional contextual knowledge to complement the parametric knowledge in Large Language Models (LLMs). These two knowledge interweave to enhance the accuracy and timeliness of LLM responses. However, the internal mechanisms by which LLMs utilize these knowledge remain unclear. We propose modeling the forward propagation of knowledge as an entity flow, employing this framework to trace LLMs' internal behaviors when processing mixed-source knowledge. Linear probing utilizes a trainable linear classifier to detect specific attributes in hidden layers. However, once trained, a probe cannot adapt to dynamically specified entities. To address this challenge, we construct an entity-aware probe, which introduces special tokens to mark probing targets and employs a small trainable rank-8 lora update to process these special markers. We first verify this approach through an attribution experiment, demonstrating that it can accurately detect information about ad-hoc entities from complex hidden states. Next, we trace entity flows across layers to understand how LLMs reconcile conflicting knowledge internally. Our probing results reveal that contextual and parametric knowledge are routed between tokens through distinct sets of attention heads, supporting attention competition only within knowledge types. While conflicting knowledge maintains a residual presence across layers, aligned knowledge from multiple sources gradually accumulates, with the magnitude of this accumulation directly determining its influence on final outputs.

View full details

Poster

Distilling LLM Agent into Small Models with Retrieval and Code Tools

Minki Kang ⋅ Jongwon Jeong ⋅ Seanie Lee ⋅ Jaewoong Cho ⋅ Sung Ju Hwang

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Large language models (LLMs) excel at complex reasoning tasks but remain computationally expensive, limiting their practical deployment. To address this, recent works have focused on distilling reasoning capabilities into smaller language models (sLMs) using chain-of-thought (CoT) traces from teacher LLMs. However, this approach struggles in scenarios requiring rare factual knowledge or precise computation, where sLMs often hallucinate due to limited capability. In this work, we propose Agent Distillation, a framework for transferring not only reasoning capability but full task-solving behavior from LLM-based agents into sLMs with retrieval and code tools. We improve agent distillation along two complementary axes: (1) we introduce a prompting method called first-thought prefix to enhance the quality of teacher-generated trajectories; and (2) we propose a self-consistent action generation for improving test-time robustness of small agents. We evaluate our method on eight reasoning tasks across factual and mathematical domains, covering both in-domain and out-of-domain generalization. Our results show that sLMs as small as 0.5B, 1.5B, 3B parameters can achieve performance competitive with next-tier larger 1.5B, 3B, 7B models fine-tuned using CoT distillation, demonstrating the potential of agent distillation for building practical, tool-using small agents.

View full details

Poster

FFN Fusion: Rethinking Sequential Computation in Large Language Models

Akhiad Bercovich ⋅ Mohammed Dabbah ⋅ Omri Puny ⋅ Ido Galil ⋅ Amnon Geifman ⋅ Yonatan Geifman ⋅ Izik Golan ⋅ Ehud Karpas ⋅ Itay Levy ⋅ Zach Moshe ⋅ Najeeb Nabwani ⋅ Tomer Ronen ⋅ Itamar Schen ⋅ Ido Shahaf ⋅ Oren Tropp ⋅ Ran Zilberstein ⋅ Ran El-Yaniv

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We introduce \textit{FFN Fusion}, an architectural optimization technique that reduces sequential computation in large language models by identifying and exploiting natural opportunities for parallelization. Our key insight is that sequences of Feed-Forward Network (FFN) layers, particularly those remaining after the removal of specific attention layers, can often be parallelized with minimal accuracy impact. We develop a principled methodology for identifying and fusing such sequences, transforming them into parallel operations that significantly reduce inference latency while preserving model behavior. Applying these techniques to Llama-3.1-405B-Instruct, we create a 253B model (253B-Base), an efficient and soon-to-be publicly available model that achieves a 1.71$\times$ speedup in inference latency and 35$\times$ lower per-token cost while maintaining strong performance across benchmarks. Most intriguingly, we find that even full transformer blocks containing both attention and FFN layers can sometimes be parallelized, suggesting new directions for neural architecture design.

View full details

Poster

Horizon Reduction Makes RL Scalable

Seohong Park ⋅ Kevin Frans ⋅ Deepinder Mann ⋅ Benjamin Eysenbach ⋅ Aviral Kumar ⋅ Sergey Levine

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

In this work, we study the scalability of offline reinforcement learning (RL) algorithms. In principle, a truly scalable offline RL algorithm should be able to solve any given problem, regardless of its complexity, given sufficient data, compute, and model capacity. We investigate if and how current offline RL algorithms match up to this promise on diverse, challenging, previously unsolved tasks, using datasets up to 1000× larger than typical offline RL datasets. We observe that despite scaling up data, many existing offline RL algorithms exhibit poor scaling behavior, saturating well below the maximum performance. We hypothesize that the horizon is the main cause behind the poor scaling of offline RL. We empirically verify this hypothesis through several analysis experiments, showing that long horizons indeed present a fundamental barrier to scaling up offline RL. We then show that various horizon reduction techniques substantially enhance scalability on challenging tasks. Based on our insights, we also introduce a minimal yet scalable method named SHARSA that effectively reduces the horizon. SHARSA achieves the best asymptotic performance and scaling behavior among our evaluation methods, showing that explicitly reducing the horizon unlocks the scalability of offline RL.

View full details

Poster

Which Algorithms Have Tight Generalization Bounds?

Michael Gastpar ⋅ Ido Nachum ⋅ Jonathan Shafer ⋅ Thomas Weinberger

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We study which machine learning algorithms have tight generalization bounds with respect to a given collection of population distributions. Our results build on and extend the recent work of Gastpar et al. (2023). First, we present conditions that preclude the existence of tight generalization bounds. Specifically, we show that algorithms that have certain inductive biases that cause them to be unstable do not admit tight generalization bounds. Next, we show that algorithms that are sufficiently loss-stable do have tight generalization bounds. We conclude with a simple characterization that relates the existence of tight generalization bounds to the conditional variance of the algorithm's loss.

View full details

Poster

Think or Not? Exploring Thinking Efficiency in Large Reasoning Models via an Information-Theoretic Lens

Xixian Yong ⋅ Xiao Zhou ⋅ Yingying Zhang ⋅ Jinlin Li ⋅ Yefeng Zheng ⋅ Xian Wu

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The recent rise of Large Reasoning Models (LRMs) has significantly improved multi-step reasoning performance, but often at the cost of generating excessively long reasoning chains. This paper revisits the efficiency of such reasoning processes through an information-theoretic lens, revealing a fundamental trade-off between reasoning length and semantic efficiency. We propose two metrics—InfoBias and InfoGain—to quantify divergence from ideal reasoning paths and stepwise information contribution, respectively. Empirical analyses show that longer reasoning chains tend to exhibit higher information bias and diminishing information gain, especially for incorrect answers. Motivated by these findings, we introduce an entropy-based Adaptive Think strategy that dynamically halts reasoning once confidence is sufficiently high, improving efficiency while maintaining competitive accuracy. Compared to the Vanilla Think approach (default mode), our strategy yields a 1.10% improvement in average accuracy and a 50.80% reduction in token usage on QwQ-32B across six benchmark tasks spanning diverse reasoning types and difficulty levels, demonstrating superior efficiency and reasoning performance. These results underscore the promise of entropy-based methods for enhancing both accuracy and cost-effiiciency in large language model deployment.

View full details

Poster

RidgeLoRA: Matrix Ridge Enhanced Low-Rank Adaptation of Large Language Models

Junda Zhu ⋅ Jun Ai ⋅ Yujun Li ⋅ Yichun Yin ⋅ Yasheng Wang ⋅ Lifeng Shang ⋅ Qun Liu

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

As one of the state-of-the-art parameter-efficient fine-tuning~(PEFT) methods, Low-Rank Adaptation (LoRA) enables model optimization with reduced computational cost through trainable low-rank matrix. However, the low-rank nature makes it prone to produce a decrease in the representation ability, leading to suboptimal performance. In order to break this limitation, we propose RidgeLoRA, a lightweight architecture like LoRA that incorporates novel architecture and matrix ridge enhanced full-rank approximation, to match the performance of full-rank training, while eliminating the need for high memory and a large number of parameters to restore the rank of matrices. We provide a rigorous mathematical derivation to prove that RidgeLoRA has a better upper bound on the representations than vanilla LoRA. Furthermore, extensive experiments across multiple domains demonstrate that RidgeLoRA achieves better performance than other LoRA variants, and can even match or surpass full-rank training.

View full details

Poster

InfiFPO: Implicit Model Fusion via Preference Optimization in Large Language Models

Yanggan Gu ⋅ Yuanyi Wang ⋅ Zhaoyi Yan ⋅ Yiming Zhang ⋅ Qi Zhou ⋅ Fei Wu ⋅ Hongxia Yang

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Model fusion combines multiple Large Language Models (LLMs) with different strengths into a more powerful, integrated model through lightweight training methods. Existing works on model fusion focus primarily on supervised fine-tuning (SFT), leaving preference alignment (PA) —a critical phase for enhancing LLM performance—largely unexplored. The current few fusion methods on PA phase, like WRPO, simplify the process by utilizing only response outputs from source models while discarding their probability information. To address this limitation, we propose InfiFPO, a preference optimization method for implicit model fusion. InfiFPO replaces the reference model in Direct Preference Optimization (DPO) with a fused source model that synthesizes multi-source probabilities at the sequence level, circumventing complex vocabulary alignment challenges in previous works and meanwhile maintaining the probability information. By introducing probability clipping and max-margin fusion strategies, InfiFPO enables the pivot model to align with human preferences while effectively distilling knowledge from source models. Comprehensive experiments on 11 widely-used benchmarks demonstrate that InfiFPO consistently outperforms existing model fusion and preference optimization methods. When using Phi-4 as the pivot model, InfiFPO improves its average performance from 79.95 to 83.33 on 11 benchmarks, significantly improving its capabilities in mathematics, coding, and reasoning tasks.

View full details

Poster

EvoBrain: Dynamic Multi-Channel EEG Graph Modeling for Time-Evolving Brain Networks

Rikuto Kotoge ⋅ Zheng Chen ⋅ Tasuku Kimura ⋅ Yasuko Matsubara ⋅ Takufumi Yanagisawa ⋅ Haruhiko Kishima ⋅ Yasushi Sakurai

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Dynamic GNNs, which integrate temporal and spatial features in Electroencephalography (EEG) data, have shown great potential in automating seizure detection. However, fully capturing the underlying dynamics necessary to represent brain states, such as seizure and non-seizure, remains a non-trivial task and presents two fundamental challenges. First, most existing dynamic GNN methods are built on temporally fixed static graphs, which fail to reflect the evolving nature of brain connectivity during seizure progression. Second, current efforts to jointly model temporal signals and graph structures and, more importantly, their interactions remain nascent, often resulting in inconsistent performance. To address these challenges, we present the first theoretical analysis of these two problems, demonstrating the effectiveness and necessity of explicit dynamic modeling and time-then-graph dynamic GNN method. Building on these insights, we propose EvoBrain, a novel seizure detection model that integrates a two-stream Mamba architecture with a GCN enhanced by Laplacian Positional Encoding, following neurological insights. Moreover, EvoBrain incorporates explicitly dynamic graph structures, allowing both nodes and edges to evolve over time. Our contributions include (a) a theoretical analysis proving the expressivity advantage of explicit dynamic modeling and time-then-graph over other approaches, (b) a novel and efficient model that significantly improves AUROC by 23\% and F1 score by 30\%, compared with the dynamic GNN baseline, and (c) broad evaluation of our method on the challenging early seizure prediction task.

View full details

Poster

Unifying Proportional Fairness in Centroid and Non-Centroid Clustering

Benjamin Cookson ⋅ Nisarg Shah ⋅ Ziqi Yu

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Proportional fairness criteria inspired by democratic ideals of proportional representation have received growing attention in the clustering literature. Prior work has investigated them in two separate paradigms. Chen et al. [ICML 2019] study _centroid clustering_, in which each data point's loss is determined by its distance to a representative point (centroid) chosen in its cluster. Caragiannis et al. [NeurIPS 2024] study _non-centroid clustering_, in which each data point's loss is determined by its maximum distance to any other data point in its cluster. We generalize both paradigms to introduce _semi-centroid clustering_, in which each data point's loss is a combination of its centroid and non-centroid losses, and study two proportional fairness criteria---the core and, its relaxation, fully justified representation (FJR). Our main result is a novel algorithm which achieves a constant approximation to the core, in polynomial time, even when the distance metrics used for centroid and non-centroid loss measurements are different. We also derive improved results for more restricted loss functions and the weaker FJR criterion, and establish lower bounds in each case.

View full details

Poster

A Unifying View of Linear Function Approximation in Off-Policy RL Through Matrix Splitting and Preconditioning

Zechen Wu ⋅ Amy Greenwald ⋅ Ronald Parr

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

In off-policy policy evaluation (OPE) tasks within reinforcement learning, Temporal Difference Learning(TD) and Fitted Q-Iteration (FQI) have traditionally been viewed as differing in the number of updates toward the target value function: TD makes one update, FQI makes an infinite number, and Partial Fitted Q-Iteration (PFQI) performs a finite number. We show that this view is not accurate, and provide a new mathematical perspective under linear value function approximation that unifies these methods as a single iterative method solving same linear system, but using different matrix splitting schemes and preconditioners. We show that increasing the number of updates under the same target value function, i.e., the target network technique, is a transition from using a constant preconditioner to using a data-feature adaptive preconditioner. This elucidates, for the first time, why TD convergence does not necessarily imply FQI convergence, and establishes tight convergence connections among TD, PFQI, and FQI. Our framework enables sharper theoretical results than previous work and characterization of the convergence conditions for each algorithm, without relying on assumptions about the features (e.g., linear independence). We also provide an encoder-decoder perspective to better understand TD’s convergence conditions, and prove, for the first time, that when a large learning rate doesn’t work, trying a smaller one may help(for batch TD). Our framework also leads to the discovery of new crucial conditions on features for convergence, and shows how common assumptions about features influence convergence, e.g., the assumption of linearly independent features can be dropped without compromising the convergence guarantees of stochastic TD in the on-policy setting. This paper is also the first to introduce matrix splitting into the convergence analysis of these algorithms.

View full details

Poster

Not All Data are Good Labels: On the Self-supervised Labeling for Time Series Forecasting

Yuxuan Yang ⋅ Dalin Zhang ⋅ Yuxuan Liang ⋅ Hua Lu ⋅ Gang Chen ⋅ Huan Li

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Time Series Forecasting (TSF) is a crucial task in various domains, yet existing TSF models rely heavily on high-quality data and insufficiently exploit all available data. This paper explores a novel self-supervised approach to re-label time series datasets by inherently constructing candidate datasets. During the optimization of a simple reconstruction network, intermediates are used as pseudo labels in a self-supervised paradigm, improving generalization for any predictor. We introduce the Self-Correction with Adaptive Mask (SCAM), which discards overfitted components and selectively replaces them with pseudo labels generated from reconstructions. Additionally, we incorporate Spectral Norm Regularization (SNR) to further suppress overfitting from a loss landscape perspective. Our experiments on eleven real-world datasets demonstrate that SCAM consistently improves the performance of various backbone models. This work offers a new perspective on constructing datasets and enhancing the generalization of TSF models through self-supervised learning. The code is available at https://github.com/SuDIS-ZJU/SCAM.

View full details

Poster

Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo

Zachary Charles ⋅ Gabriel Teston ⋅ Lucio Dery ⋅ John Rush ⋅ Nova Fallen ⋅ Zachary Garrett ⋅ Arthur Szlam ⋅ Arthur Douillard

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

As we scale to more massive machine learning models, the frequent synchronization demands inherent in data-parallel approaches create significant slowdowns, posing a critical challenge to further scaling. Recent work develops an approach (DiLoCo) that relaxes synchronization demands without compromising model quality. However, these works do not carefully analyze how DiLoCo's behavior changes with model size. In this work, we study the scaling law behavior of DiLoCo when training LLMs under a fixed compute budget. We focus on how algorithmic factors, including number of model replicas, hyperparameters, and token budget affect training in ways that can be accurately predicted via scaling laws. We find that DiLoCo scales both predictably and robustly with model size. When well-tuned, DiLoCo scales better than data-parallel training with model size, and can outperform data-parallel training even at small model sizes. Our results showcase a more general set of benefits of DiLoCo than previously documented, including increased optimal batch sizes, improved downstream generalization with scale, and improved evaluation loss for a fixed token budget.

View full details

Poster

AI-Researcher: Autonomous Scientific Innovation

Jiabin Tang ⋅ Lianghao Xia ⋅ Zhonghang Li ⋅ Chao Huang

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The powerful reasoning capabilities of Large Language Models (LLMs) in mathematics and coding, combined with their ability to automate complex tasks through agentic frameworks, present unprecedented opportunities for accelerating scientific innovation. In this paper, we introduce AI-Researcher, a fully autonomous research system that transforms how AI-driven scientific discovery is conducted and evaluated. Our framework seamlessly orchestrates the complete research pipeline--from literature review and hypothesis generation to algorithm implementation and publication-ready manuscript preparation--with minimal human intervention. To rigorously assess autonomous research capabilities, we develop Scientist-Bench, a comprehensive benchmark comprising state-of-the-art papers across diverse AI research domains, featuring both guided innovation and open-ended exploration tasks. Through extensive experiments, we demonstrate that AI-Researcher achieves remarkable implementation success rates and produces research papers that approach human-level quality. This work establishes new foundations for autonomous scientific innovation that can complement human researchers by systematically exploring solution spaces beyond cognitive limitations.

View full details

Poster

CTRL-ALT-DECEIT Sabotage Evaluations for Automated AI R&D

Francis Ward ⋅ Teun van der Weij ⋅ Hanna Gábor ⋅ Sam Martin ⋅ Raja Moreno ⋅ Harel Lidar ⋅ Louis Makower ⋅ Thomas Jodrell ⋅ Lauren Robson

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

AI systems are increasingly able to autonomously conduct realistic software engineering tasks, and may soon be deployed to automate machine learning (ML) R\&D itself. Frontier AI systems may be deployed in safety-critical settings, including to help ensure the safety of future systems. Unfortunately, frontier and future systems may not be sufficiently trustworthy, and there is evidence that these systems may even be misaligned with their developers or users. Therefore, we investigate the capabilities of AI agents to act against the interests of their users when conducting ML engineering, by sabotaging ML models, sandbagging their performance, and subverting oversight mechanisms. First, we extend MLE-Bench, a benchmark for realistic ML tasks, with code-sabotage tasks such as implanting backdoors and purposefully causing generalisation failures. Frontier agents make meaningful progress on our sabotage tasks. In addition, we study agent capabilities to sandbag on MLE-Bench. Agents can calibrate their performance to specified target levels below their actual capability. To mitigate sabotage, we use LM monitors to detect suspicious agent behaviour, and we measure model capability to sabotage and sandbag without being detected by these monitors. Overall, monitors are capable at detecting code-sabotage attempts but our results suggest that detecting sandbagging is more difficult. Additionally, aggregating multiple monitor predictions works well, but monitoring may not be sufficiently reliable to mitigate sabotage in high-stakes domains. Our benchmark is implemented in the UK AISI’s Inspect framework and we make our code publicly available.

View full details

Poster

On the Expressive Power of Mixture-of-Experts for Structured Complex Tasks

Mingze Wang ⋅ Weinan E

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Mixture-of-experts networks (MoEs) have demonstrated remarkable efficiency in modern deep learning. Despite their empirical success, the theoretical foundations underlying their ability to model complex tasks remain poorly understood. In this work, we conduct a systematic study of the expressive power of MoEs in modeling complex tasks with two common structural priors: low-dimensionality and sparsity. For shallow MoEs, we prove that they can efficiently approximate functions supported on low-dimensional manifolds, overcoming the curse of dimensionality. For deep MoEs, we show that $\mathcal{O}(L)$-layer MoEs with $E$ experts per layer can approximate piecewise functions comprising $E^L$ pieces with compositional sparsity, i.e., they can exhibit an exponential number of structured tasks. Our analysis reveals the roles of critical architectural components and hyperparameters in MoEs, including the gating mechanism, expert networks, the number of experts, and the number of layers, and offers natural suggestions for MoE variants.

View full details

Poster

Learnable Burst-Encodable Time-of-Flight Imaging for High-Fidelity Long-Distance Depth Sensing

Manchao Bao ⋅ Shengjiang Fang ⋅ Tao Yue ⋅ Xuemei Hu

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Long-distance depth imaging holds great promise for applications such as autonomous driving and robotics. Direct time-of-flight (dToF) imaging offers high-precision, long-distance depth sensing, yet demands ultra-short pulse light sources and high-resolution time-to-digital converters. In contrast, indirect time-of-flight (iToF) imaging often suffers from phase wrapping and low signal-to-noise ratio (SNR) as the sensing distance increases. In this paper, we introduce a novel ToF imaging paradigm, termed Burst-Encodable Time-of-Flight (BE-ToF), which facilitates high-fidelity, long-distance depth imaging. Specifically, the BE-ToF system emits light pulses in burst mode and estimates the phase delay of the reflected signal over the entire burst period, thereby effectively avoiding the phase wrapping inherent to conventional iToF systems. Moreover, to address the low SNR caused by light attenuation over increasing distances, we propose an end-to-end learnable framework that jointly optimizes the coding functions and the depth reconstruction network. A specialized double well function and first-order difference term are incorporated into the framework to ensure the hardware implementability of the coding functions. The proposed approach is rigorously validated through comprehensive simulations and real-world prototype experiments, demonstrating its effectiveness and practical applicability. The code is available at: https://github.com/ComputationalPerceptionLab/BE-ToF.

View full details

Poster

ElliCE: Efficient and Provably Robust Algorithmic Recourse via the Rashomon Sets

Bohdan Turbal ⋅ Iryna Voitsitska ⋅ Lesia Semenova

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Machine learning models now influence decisions that directly affect people’s lives, making it important to understand not only their predictions, but also how individuals could act to obtain better results. Algorithmic recourse provides actionable input modifications to achieve more favorable outcomes, typically relying on counterfactual explanations to suggest such changes. However, when the Rashomon set - the set of near-optimal models - is large, standard counterfactual explanations can become unreliable, as a recourse action valid for one model may fail under another. We introduce ElliCE, a novel framework for robust algorithmic recourse that optimizes counterfactuals over an ellipsoidal approximation of the Rashomon set. The resulting explanations are provably valid over this ellipsoid, with theoretical guarantees on uniqueness, stability, and alignment with key feature directions. Empirically, ElliCE generates counterfactuals that are not only more robust but also more flexible, adapting to user-specified features constraints while being substantially faster than existing baselines. This provides a principled and practical solution for reliable recourse under model uncertainty, ensuring stable recommendations for users even as models evolve.

View full details

Poster

Fisher meets Feynman: score-based variational inference with a product of experts

Diana Cai ⋅ Robert Gower ⋅ David Blei ⋅ Lawrence Saul

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We introduce a highly expressive yet distinctly tractable family for black-box variational inference (BBVI). Each member of this family is a weighted product of experts (PoE), and each weighted expert in the product is proportional to a multivariate $t$-distribution. These products of experts can model distributions with skew, heavy tails, and multiple modes, but to use them for BBVI, we must be able to sample from their densities. We show how to do this by reformulating these products of experts as latent variable models with auxiliary Dirichlet random variables. These Dirichlet variables emerge from a Feynman identity, originally developed for loop integrals in quantum field theory, that expresses the product of multiple fractions (or in our case, $t$-distributions) as an integral over the simplex. We leverage this simplicial latent space to draw weighted samples from these products of experts---samples which BBVI then uses to find the PoE that best approximates a target density. Given a collection of experts, we derive an iterative procedure to optimize the exponents that determine their geometric weighting in the PoE. At each iteration, this procedure minimizes a regularized Fisher divergence to match the scores of the variational and target densities at a batch of samples drawn from the current approximation. This minimization reduces to a convex quadratic program, and we prove under general conditions that these updates converge exponentially fast to a near-optimal weighting of experts. We conclude by evaluating this approach on a variety of synthetic and real-world target distributions.

View full details

Poster

Towards Dynamic 3D Reconstruction of Hand-Instrument Interaction in Ophthalmic Surgery

Ming Hu ⋅ Zhengdi Yu ⋅ feilong tang ⋅ Kaiwen Chen ⋅ Yulong Li ⋅ Imran Razzak ⋅ Junjun He ⋅ Tolga Birdal ⋅ Kai-Jing Zhou ⋅ Zongyuan Ge

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Accurate 3D reconstruction of hands and instruments is critical for vision-based analysis of ophthalmic microsurgery, yet progress has been hampered by the lack of realistic, large-scale datasets and reliable annotation tools. In this work, we introduce OphNet-3D, the first extensive RGB-D dynamic 3D reconstruction dataset for ophthalmic surgery, comprising 41 sequences from 40 surgeons and totaling 7.1 million frames, with fine-grained annotations of 12 surgical phases, 10 instrument categories, dense MANO hand meshes, and full 6-DoF instrument poses. To scalably produce high-fidelity labels, we design a multi-stage automatic annotation pipeline that integrates multi-view data observation, data-driven motion prior with cross-view geometric consistency and biomechanical constraints, along with a combination of collision-aware interaction constraints for instrument interactions. Building upon OphNet-3D, we establish two challenging benchmarks—bimanual hand pose estimation and hand–instrument interaction reconstruction—and propose two dedicated architectures: H-Net for dual-hand mesh recovery and OH-Net for joint reconstruction of two-hand–two-instrument interactions. These models leverage a novel spatial reasoning module with weak-perspective camera modeling and collision-aware center-based representation. Both architectures outperform existing methods by substantial margins, achieving improvements of over 2mm in Mean Per Joint Position Error (MPJPE) and up to 23\% in ADD-S metrics for hand and instrument reconstruction, respectively.

View full details

Poster

Memory-Enhanced Neural Solvers for Routing Problems

Felix Chalumeau ⋅ Refiloe Shabe ⋅ Noah De Nicola ⋅ Arnu Pretorius ⋅ Tom Barrett ⋅ Nathan Grinsztajn

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Routing Problems are central to many real-world applications, yet remain challenging due to their (NP-)hard nature. Amongst existing approaches, heuristics often offer the best trade-off between quality and scalability, making them suitable for industrial use. While Reinforcement Learning (RL) offers a flexible framework for designing heuristics, its adoption over handcrafted heuristics remains incomplete. Existing learned methods still lack the ability to adapt to specific instances and fully leverage the available computational budget. Current best methods either rely on a collection of pre-trained policies, or on RL fine-tuning; hence failing to fully utilize newly available information within the constraints of the budget. In response, we present MEMENTO, an approach that leverages memory to improve the search of neural solvers at inference. MEMENTO updates the action distribution dynamically based on the outcome of previous decisions. We validate its effectiveness on Traveling Salesman and Capacitated Vehicle Routing problems, demonstrating its superiority over tree-search and policy-gradient fine-tuning; and showing that it can be zero-shot combined with diversity-based solvers. We successfully train all RL auto-regressive solvers on large instances, and verify MEMENTO's scalability and data-efficiency: pushing the state-of-the-art on 11 out of 12 evaluated tasks.

View full details

Poster

E2Former: An Efficient and Equivariant Transformer with Linear-Scaling Tensor Products

Yunyang Li ⋅ Lin Huang ⋅ Zhihao Ding ⋅ Xinran Wei ⋅ Chu Wang ⋅ Han Yang ⋅ Zun Wang ⋅ Chang Liu ⋅ Yu Shi ⋅ Peiran Jin ⋅ Tao Qin ⋅ Mark Gerstein ⋅ Jia Zhang

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Equivariant Graph Neural Networks (EGNNs) have demonstrated significant success in modeling microscale systems, including those in chemistry, biology and materials science. However, EGNNs face substantial computational challenges due to the high cost of constructing edge features via spherical tensor products, making them almost impractical for large-scale systems. To address this limitation, we introduce E2Former, an equivariant and efficient transformer architecture that incorporates a Wigner $6j$ convolution (Wigner $6j$ Conv). By shifting the computational burden from edges to nodes, Wigner $6j$ Conv reduces the complexity from $O(| \mathcal{E}|)$ to $O(| \mathcal{V}|)$ while preserving both the model's expressive power and rotational equivariance. We show that this approach achieves a 7x–30x speedup compared to conventional $\mathrm{SO}(3)$ convolutions. Furthermore, our empirical results demonstrate that the derived E2Former mitigates the computational challenges of existing approaches without compromising the ability to capture detailed geometric information. This development could suggest a promising direction for scalable molecular modeling.

View full details

Poster

Generalized Top-k Mallows Model for Ranked Choices

Shahrzad Haddadan ⋅ Sara Ahmadian

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The classic Mallows model is a foundational tool for modeling user preferences. However, it has limitations in capturing real-world scenarios, where users often focus only on a limited set of preferred items and are indifferent to the rest. To address this, extensions such as the top-$k$ Mallows model have been proposed, aligning better with practical applications. In this paper, we address several challenges related to the generalized top-$k$ Mallows model, with a focus on analyzing buyer choices. Our key contributions are: (1) a novel sampling scheme tailored to generalized top-$k$ Mallows models, (2) an efficient algorithm for computing choice probabilities under this model, and (3) an active learning algorithm for estimating the model parameters from observed choice data. These contributions provide new tools for analysis and prediction in critical decision-making scenarios. We present a rigorous mathematical analysis for the performance of our algorithms. Furthermore, through extensive experiments on synthetic data and real-world data, we demonstrate the scalability and accuracy of our proposed methods, and we compare the predictive power of Mallows model for top-$k$ lists compared to the simpler Multinomial Logit model.

View full details

Poster

Repurposing Marigold for Zero-Shot Metric Depth Estimation via Defocus Blur Cues

Chinmay Talegaonkar ⋅ Nikhil Gandudi Suresh ⋅ Zachary Novack ⋅ Yash Belhe ⋅ Priyanka Nagasamudra ⋅ Nicholas Antipa

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Recent monocular metric depth estimation (MMDE) methods have made notable progress towards zero-shot generalization. However, they still exhibit a significant performance drop on out-of-distribution datasets. We address this limitation by injecting defocus blur cues at inference time into Marigold, a \textit{pre-trained} diffusion model for zero-shot, scale-invariant monocular depth estimation (MDE). Our method effectively turns Marigold into a metric depth predictor in a training-free manner. To incorporate defocus cues, we capture two images with a small and a large aperture from the same viewpoint. To recover metric depth, we then optimize the metric depth scaling parameters and the noise latents of Marigold at inference time using gradients from a loss function based on the defocus-blur image formation model. We compare our method against existing state-of-the-art zero-shot MMDE methods on a self-collected real dataset, showing quantitative and qualitative improvements.

View full details

Poster

Mitigating Instability in High Residual Adaptive Sampling for PINNs via Langevin Dynamics

Minseok Jeong ⋅ Giup Seo ⋅ Euiseok Hwang

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Recently, physics-informed neural networks (PINNs) have gained attention in the scientific community for their potential to solve partial differential equations (PDEs). However, they face challenges related to resource efficiency and slow convergence. Adaptive sampling methods, which prioritize collocation points with high residuals, improve both efficiency and accuracy. However, these methods often neglect points with medium or low residuals, which can affect stability as the complexity of the model increases. In this paper, we investigate this limitation and show that high residual-based approaches require stricter learning rate bounds to ensure stability. To address this, we propose a Langevin dynamics-based Adaptive Sampling (LAS) framework that is robust to various learning rates and model complexities. Our experiments demonstrate that the proposed method outperforms existing approaches in terms of relative $L^{2}$ error, and stability across a range of environments, including high-dimensional PDEs where Monte Carlo integration-based methods typically suffer from instability.

View full details

Poster

Learnable Sampler Distillation for Discrete Diffusion Models

Feiyang Fu ⋅ Tongxian Guo ⋅ Zhaoqiang Liu

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Discrete diffusion models (DDMs) have shown powerful generation ability for discrete data modalities like text and molecules. However, their practical application is hindered by inefficient sampling, requiring a large number of sampling steps. Accelerating DDMs by using larger step sizes typically introduces significant problems in generation quality, as it amplifies the impact of both the compounding decoding error due to factorized predictions and discretization error from numerical approximations, leading to a significant decrease in sampling quality. To address these challenges, we propose learnable sampler distillation (LSD), a novel approach to train fast and high-fidelity samplers for DDMs. LSD employs a distillation approach where a student sampler with a few steps learns to align its intermediate score trajectory with that of a high-quality teacher sampler with numerous steps. This alignment is achieved by optimizing learnable sampler coefficients that adaptively adjust sampling dynamics. Additionally, we further propose LSD+, which also learns time schedules that allocate steps non-uniformly. Experiments across text generation, image generation, and synthetic tasks demonstrate that our proposed approaches outperform existing samplers for DDMs, achieving substantially higher sampling quality with significantly fewer sampling steps.

View full details

Poster

Ambient Proteins - Training Diffusion Models on Noisy Structures

Giannis Daras ⋅ Jeffrey Ouyang-Zhang ⋅ Krithika Ravishankar ⋅ Constantinos Daskalakis ⋅ Adam Klivans ⋅ Daniel Diaz

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We present Ambient Protein Diffusion, a framework for training protein diffusion models that generates structures with unprecedented diversity and quality. State-of-the-art generative models are trained on computationally derived structures from AlphaFold2 (AF), as experimentally determined structures are relatively scarce. The resulting models are therefore limited by the quality of synthetic datasets. Since the accuracy of AF predictions degrades with increasing protein length and complexity, de novo generation of long, complex proteins remains challenging. Ambient Protein Diffusion overcomes this problem by treating low-confidence AF structures as corrupted data. Rather than simply filtering out low-quality AF structures, our method adjusts the diffusion objective for each structure based on its corruption level, allowing the model to learn from both high and low quality structures. Empirically, ambient protein diffusion yields major improvements: on proteins with 700 residues, diversity increases from 45% to 85% from the previous state-of-the-art, and designability improves from 70% to 88%.

View full details

Poster

RepLDM: Reprogramming Pretrained Latent Diffusion Models for High-Quality, High-Efficiency, High-Resolution Image Generation

Boyuan Cao ⋅ Jiaxin Ye ⋅ Yujie Wei ⋅ Hongming Shan

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

While latent diffusion models (LDMs), such as Stable Diffusion, are designed for high-resolution image generation, they often struggle with significant structural distortions when generating images at resolutions higher than their training one. Instead of relying on extensive retraining, a more resource-efficient approach is to reprogram the pretrained model for high-resolution (HR) image generation; however, existing methods often result in poor image quality and long inference time. We introduce RepLDM, a novel reprogramming framework for pretrained LDMs that enables high-quality, high-efficiency, high-resolution image generation; see Fig. 1. RepLDM consists of two stages: (i) an attention guidance stage, which generates a latent representation of a higher-quality training-resolution image using a novel parameter-free self-attention mechanism to enhance the structural consistency; and (ii) a progressive upsampling stage, which progressively performs upsampling in pixel space to mitigate the severe artifacts caused by latent space upsampling. The effective initialization from the first stage allows for denoising at higher resolutions with significantly fewer steps, improving the efficiency. Extensive experimental results demonstrate that RepLDM significantly outperforms state-of-the-art methods in both quality and efficiency for HR image generation, underscoring its advantages for real-world applications. Codes: https://github.com/kmittle/RepLDM.

View full details

Poster

Aligning Text-to-Image Diffusion Models to Human Preference by Classification

Longquan Dai ⋅ Xiaolu Wei ⋅ wang he ⋅ Shaomeng Wang ⋅ Jinhui Tang

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Text-to-image diffusion models are typically trained on large-scale web data, often resulting in outputs that misalign with human preferences. Inspired by preference learning in large language models, we propose ABC (Alignment by Classification), a simple yet effective framework for aligning diffusion models with human preferences. In contrast to prior DPO-based methods that depend on suboptimal supervised fine-tuned (SFT) reference models, ABC assumes access to an ideal reference model perfectly aligned with human intent and reformulates alignment as a classification problem. Under this view, we recognize that preference data naturally forms a semi-supervised classification setting. To address this, we propose a data augmentation strategy that transforms preference comparisons into fully supervised training signals. We then introduce a classification-based ABC loss to guide alignment. Our alignment by classification approach could effectively steer the diffusion model toward the behavior of the ideal reference. Experiments on various diffusion models show that our ABC consistently outperforms existing baselines, offering a scalable and robust solution for preference-based text-to-image fine-tuning.

View full details

Poster

Selective Omniprediction and Fair Abstention

Sílvia Casacuberta ⋅ Varun Kanade

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We propose new learning algorithms for building selective classifiers, which are predictors that are allowed to abstain on some fraction of the domain. We study the model where a classifier may abstain from predicting at a fixed cost. Building on the recent framework on multigroup fairness and omniprediction, given a pre-specified class of loss functions, we provide an algorithm for building a single classifier that learns abstentions and predictions optimally for every loss in the entire class, where the abstentions are decided efficiently for each specific loss function by applying a fixed post-processing function. Our algorithm and theoretical guarantees generalize the previously-known algorithms for learning selective classifiers in formal learning-theoretic models. We then extend the traditional multigroup fairness algorithms to the selective classification setting and show that we can use a calibrated and multiaccurate predictor to efficiently build selective classifiers that abstain optimally not only globally but also locally within each of the groups in any pre-specified collection of possibly intersecting subgroups of the domain, and are also accurate when they do not abstain. We show how our abstention algorithms can be used as conformal prediction methods in the binary classification setting to achieve both marginal and group-conditional coverage guarantees for an intersecting collection of groups. We provide empirical evaluations for all of our theoretical results, demonstrating the practicality of our learning algorithms for abstaining optimally and fairly.

View full details

Poster

Characterizing control between interacting subsystems with deep Jacobian estimation

Adam J. Eisen ⋅ Mitchell Ostrow ⋅ Sarthak Chandra ⋅ Leo Kozachkov ⋅ Earl Miller ⋅ Ila Fiete

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Biological function arises through the dynamical interactions of multiple subsystems, including those between brain areas, within gene regulatory networks, and more. A common approach to understanding these systems is to model the dynamics of each subsystem and characterize communication between them. An alternative approach is through the lens of control theory: how the subsystems control one another. This approach involves inferring the directionality, strength, and contextual modulation of control between subsystems. However, methods for understanding subsystem control are typically linear and cannot adequately describe the rich contextual effects enabled by nonlinear complex systems. To bridge this gap, we devise a data-driven nonlinear control-theoretic framework to characterize subsystem interactions via the Jacobian of the dynamics. We address the challenge of learning Jacobians from time-series data by proposing the JacobianODE, a deep learning method that leverages properties of the Jacobian to directly estimate it for arbitrary dynamical systems from data alone. We show that JacobianODE models outperform existing Jacobian estimation methods on challenging systems, including high-dimensional chaos. Applying our approach to a multi-area recurrent neural network (RNN) trained on a working memory selection task, we show that the “sensory” area gains greater control over the “cognitive” area over learning. Furthermore, we leverage the JacobianODE to directly control the trained RNN, enabling precise manipulation of its behavior. Our work lays the foundation for a theoretically grounded and data-driven understanding of interactions among biological subsystems.

View full details

Poster

Feedback-Aware MCTS for Goal-Oriented Information Seeking

Harshita Chopra ⋅ Chirag Shah

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Effective decision-making and problem-solving in conversational systems require the ability to identify and acquire missing information through targeted questioning. A key challenge lies in efficiently narrowing down a large space of possible outcomes by posing questions that minimize uncertainty. To address this, we introduce a novel framework that leverages Large Language Models (LLMs) to generate information-seeking questions, with Monte Carlo Tree Search (MCTS) to strategically select questions that maximize information gain, as a part of inference-time planning. Our primary contribution includes a hierarchical feedback mechanism that exploits past interaction patterns to guide future strategy. Specifically, each new problem is mapped to a cluster based on semantic similarity, and our UCT (Upper Confidence bound for Trees) formulation employs a cluster-specific bonus reward to prioritize successful question trajectories that have proven effective for similar problems in the past. Extensive empirical evaluation across medical diagnosis and technical troubleshooting domains shows that our method achieves an average of 12\% improvement in success rates and about 10x reduction in the number of LLM calls made for planning per conversation, compared to the state of the art. An additional 8\% gain in success rate is observed on average when we start with a constrained set of possibilities. Our results underscore the efficacy of feedback-aware MCTS in enhancing information-seeking in goal-oriented dialogues.

View full details

Poster

Hybrid-Balance GFlowNet for Solving Vehicle Routing Problems

Ni Zhang ⋅ Zhiguang Cao

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Existing GFlowNet-based methods for vehicle routing problems (VRPs) typically employ Trajectory Balance (TB) to achieve global optimization but often neglect important aspects of local optimization. While Detailed Balance (DB) addresses local optimization more effectively, it alone falls short in solving VRPs, which inherently require holistic trajectory optimization. To address these limitations, we introduce the Hybrid-Balance GFlowNet (HBG) framework, which uniquely integrates TB and DB in a principled and adaptive manner by aligning their intrinsically complementary strengths. Additionally, we propose a specialized inference strategy for depot-centric scenarios like the Capacitated Vehicle Routing Problem (CVRP), leveraging the depot node's greater flexibility in selecting successors. Despite this specialization, HBG maintains broad applicability, extending effectively to problems without explicit depots, such as the Traveling Salesman Problem (TSP). We evaluate HBG by integrating it into two established GFlowNet-based solvers, i.e., AGFN and GFACS, and demonstrate consistent and significant improvements across both CVRP and TSP, underscoring the enhanced solution quality and generalization afforded by our approach.

View full details

Poster

On the Hardness of Conditional Independence Testing In Practice

Zheng He ⋅ Roman Pogodin ⋅ Yazhe Li ⋅ Namrata Deka ⋅ Arthur Gretton ⋅ Danica J. Sutherland

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Tests of conditional independence (CI) underpin a number of important problems in machine learning and statistics, from causal discovery to evaluation of predictor fairness and out-of-distribution robustness. Shah and Peters (2020) showed that, contrary to the unconditional case, no universally finite-sample valid test can ever achieve nontrivial power. While informative, this result (based on “hiding” dependence) does not seem to explain the frequent practical failures observed with popular CI tests. We investigate the Kernel-based Conditional Independence (KCI) test – of which we show the Generalized Covariance Measure underlying many recent tests is _nearly_ a special case – and identify the major factors underlying its practical behavior. We highlight the key role of errors in the conditional mean embedding estimate for the Type I error, while pointing out the importance of selecting an appropriate conditioning kernel (not recognized in previous work) as being necessary for good test power but also tending to inflate Type I error.

View full details

Poster

Tradeoffs between Mistakes and ERM Oracle Calls in Online and Transductive Online Learning

Idan Attias ⋅ Steve Hanneke ⋅ Arvind Ramaswami

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We study online and transductive online learning in settings where the learner can interact with the concept class only via Empirical Risk Minimization (ERM) or weak consistency oracles on arbitrary subsets of the instance domain. This contrasts with standard online models, where the learner has full knowledge of the concept class. The ERM oracle returns a hypothesis that minimizes the loss on a given subset, while the weak consistency oracle returns only a binary signal indicating whether the subset is realizable by a concept in the class. The learner's performance is measured by the number of mistakes and oracle calls. In the standard online setting with ERM access, we establish tight lower bounds in both the realizable and agnostic cases: $\Omega(2^{d_\mathrm{LD}})$ mistakes and $\Omega(\sqrt{T 2^{d_\mathrm{LD}}})$ regret, respectively, where $T$ is the number of timesteps and $d_\mathrm{LD}$ is the Littlestone dimension of the class. We further show how existing results for online learning with ERM access translate to the setting with a weak consistency oracle, at the cost of increasing the number of oracle calls by $O(T)$. We then consider the transductive online model, where the instance sequence is known in advance but labels are revealed sequentially. For general Littlestone classes, we show that the optimal mistake bound in the realizable case and in the agnostic case can be achieved using $O(T^{d_\mathrm{VC}+1})$ weak consistency oracle calls, where $d_\mathrm{VC}$ is the VC dimension of the class. On the negative side, we show that $\Omega(T)$ weak consistency queries are necessary for transductive online learnability, and that $\Omega(T)$ ERM queries are necessary to avoid exponential dependence on the Littlestone dimension. % Finally, for special families of concept classes, we demonstrate how to reduce the number of oracle calls using randomized algorithms while maintaining similar mistake bounds. In particular, for Thresholds on an unknown ordering, $O(\log T)$ ERM queries suffice, and for $k$-Intervals, $O(T^3 2^{2k})$ weak consistency queries suffice.

View full details

Poster

Two‑Stage Learning of Stabilizing Neural Controllers via Zubov Sampling and Iterative Domain Expansion

Haoyu Li ⋅ Xiangru Zhong ⋅ Bin Hu ⋅ Huan Zhang

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Learning-based neural network (NN) control policies have shown impressive empirical performance. However, obtaining stability guarantees and estimates of the region of attraction of these learned neural controllers is challenging due to the lack of stable and scalable training and verification algorithms. Although previous works in this area have achieved great success, much conservatism remains in their frameworks. In this work, we propose a novel two-stage training framework to jointly synthesize a controller and a Lyapunov function for continuous-time systems. By leveraging a Zubov‑inspired region of attraction characterization to directly estimate stability boundaries, we propose a novel training-data sampling strategy and a domain-updating mechanism that significantly reduces the conservatism in training. Moreover, unlike existing works on continuous-time systems that rely on an SMT solver to formally verify the Lyapunov condition, we extend state-of-the-art neural network verifier $\alpha,\beta$-CROWN with the capability of performing automatic bound propagation through the Jacobian of dynamical systems and a novel verification scheme that avoids expensive bisection. To demonstrate the effectiveness of our approach, we conduct numerical experiments by synthesizing and verifying controllers on several challenging nonlinear systems across multiple dimensions. We show that our training can yield region of attractions with volume $5 - 1.5\cdot 10^{5}$ times larger compared to the baselines, and our verification on continuous systems can be up to $40-10{,}000$ times faster compared to the traditional SMT solver dReal. Our code is available at https://github.com/Verified-Intelligence/Two-Stage_Neural_Controller_Training.

View full details

Poster

Adaptive 3D Reconstruction via Diffusion Priors and Forward Curvature-Matching Likelihood Updates

Seunghyeok Shin ⋅ Dabin Kim ⋅ Hongki Lim

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Reconstructing high-quality point clouds from images remains challenging in computer vision. Existing generative models, particularly diffusion models, based approaches that directly learn the posterior may suffer from inflexibility—they require conditioning signals during training, support only a fixed number of input views, and need complete retraining for different measurements. Recent diffusion-based methods have attempted to address this by combining prior models with likelihood updates, but they rely on heuristic fixed step sizes for the likelihood update that lead to slow convergence and suboptimal reconstruction quality. We advance this line of approach by integrating our novel Forward Curvature-Matching (FCM) update method with diffusion sampling. Our method dynamically determines optimal step sizes using only forward automatic differentiation and finite-difference curvature estimates, enabling precise optimization of the likelihood update. This formulation enables high-fidelity reconstruction from both single-view and multi-view inputs, and supports various input modalities through simple operator substitution—all without retraining. Experiments on ShapeNet and CO3D datasets demonstrate that our method achieves superior reconstruction quality at matched or lower NFEs, yielding higher F-score and lower CD and EMD, validating its efficiency and adaptability for practical applications. Code is available at https://github.com/Seunghyeok0715/FCM.

View full details

Poster

ARECHO: Autoregressive Evaluation via Chain-Based Hypothesis Optimization for Speech Multi-Metric Estimation

Jiatong Shi ⋅ Yifan Cheng ⋅ Bo-Hao Su ⋅ Hye-jin Shim ⋅ Jinchuan Tian ⋅ Samuele Cornell ⋅ Yiwen Zhao ⋅ Siddhant Arora ⋅ Shinji Watanabe

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Speech signal analysis poses significant challenges, particularly in tasks such as speech quality evaluation and profiling, where the goal is to predict multiple perceptual and objective metrics. For instance, metrics like PESQ (Perceptual Evaluation of Speech Quality), STOI (Short-Time Objective Intelligibility), and MOS (Mean Opinion Score) each capture different aspects of speech quality. However, these metrics often have different scales, assumptions, and dependencies, making joint estimation non-trivial. To address these issues, we introduce ARECHO (Autoregressive Evaluation via Chain-based Hypothesis Optimization), a chain-based, versatile evaluation system for speech assessment grounded in autoregressive dependency modeling. ARECHO is distinguished by three key innovations: (1) a comprehensive speech information tokenization pipeline; (2) a dynamic classifier chain that explicitly captures inter-metric dependencies; and (3) a two-step confidence-oriented decoding algorithm that enhances inference reliability. Experiments demonstrate that ARECHO significantly outperforms the baseline framework across diverse evaluation scenarios, including enhanced speech analysis, speech generation evaluation, and noisy speech evaluation. Furthermore, its dynamic dependency modeling improves interpretability by capturing inter-metric relationships. Across tasks, ARECHO offers reference-free evaluation using its dynamic classifier chain to support subset queries (single or multiple metrics) and reduces error propagation via confidence-oriented decoding.

View full details

Poster

Complete Structure Guided Point Cloud Completion via Cluster- and Instance-Level Contrastive Learning

Yang Chen ⋅ Yirun Zhou ⋅ Weizhong Zhang ⋅ Cheng Jin

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Point cloud completion, aiming to reconstruct missing part from incomplete point clouds, is a pivotal task in 3D computer vision. Traditional supervised approaches often necessitate complete point clouds for training supervision, which are not readily accessible in real-world applications. Recent studies have attempted to mitigate this dependency by employing self-supervise mechanisms. However, these approaches frequently yield suboptimal results due to the absence of complete structure in the point cloud data during training. To address these issues, in this paper, we propose an effective framework to complete the point cloud under the guidance of self learned complete structure. A key contribution of our work is the development of a novel self-supervised complete structure reconstruction module, which can learn the complete structure explicitly from incomplete point clouds and thus eliminate the reliance on training data from complete point clouds. Additionally, we introduce a contrastive learning approach at both the cluster- and instance-level to extract shape features guided by the complete structure and to capture style features, respectively. This dual-level learning design ensures that the generated point clouds are both shape-completed and detail-preserving. Extensive experiments on both synthetic and real-world datasets demonstrate that our approach significantly outperforms state-of-the-art self-supervised methods.

View full details

Poster

Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems

Christian Walder ⋅ Deep Tejas Karkhanis

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Reinforcement Learning algorithms commonly sample multiple ($n>1$) solution attempts for each problem and reward them independently. This optimizes for pass@1 performance and prioritizes individual sample performance over the diversity and collective utility of a set of samples. Such algorithms under-utilize the sampling capacity, limiting exploration and eventual improvement on harder examples. As a fix, we propose Pass-at-$k$ Policy Optimization (PKPO), a multivariate transformation on batches of rewards which leads to direct optimization of \passk\ performance, thus optimizing for sets of samples that feature a large maximum reward when considered jointly. Our primary contribution is to derive novel low variance unbiased estimators for the pass@k and its gradient, in both the binary and continuous reward settings. We show that optimizing with these estimators reduces to reinforcement learning with (batches of) rewards that have been jointly transformed by a function that is stable and efficient to compute. While previous efforts propose transformations for $k=n$, our transformations are the first to enable robust optimization of the pass@k for any arbitrary $k \leq n$. Rather than simply trading off pass@1 performance for pass@k gains, our method allows annealing $k$ during training, optimizing both metrics and often achieving strong pass@1 performance alongside significant pass@k gains. We validate our transformations on illustrative toy experiments, which reveal the variance reducing properties of our formulations. We also include real-world examples using the open-source models Gemma and Llama . We find that our transformation effectively optimizes for the target $k$. Furthermore, higher $k$ values enable solving more and harder problems, while annealing $k$ boosts both the pass@1 and pass@k. Crucially, for challenging task sets where conventional pass@1 optimization stalls, our pass@k approach unblocks learning, likely by improving exploration through the prioritization of joint utility over the utility of individual samples

View full details

Poster

Inner Speech as Behavior Guides: Steerable Imitation of Diverse Behaviors for Human-AI coordination

Rakshit Trivedi ⋅ Kartik Sharma ⋅ David Parkes

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Effective human-AI coordination requires artificial agents capable of exhibiting and responding to human-like behaviors while adapting to changing contexts. Imitation learning has emerged as one of the prominent approaches to build such agents by training them to mimic human-demonstrated behaviors. However, current methods struggle to capture the inherent diversity and non-Markovian nature of human behavior and lack the ability to steer behavior at inference time. Drawing inspiration from the theory of human cognitive processes, where inner speech guides action selection before execution, we propose MIMIC (Modeling Inner Motivations for Imitation and Control), a framework that uses language as an internal representation of behavioral intent. MIMIC employs the novel use of vision-language models as linguistic scaffolding to train a conditional variational autoencoder capable of generating inner speech from observations. A diffusion-based behavior cloning policy then selects actions conditioned on current observations and the generated inner speech. MIMIC enables fine-grained steering of behavior at inference time by conditioning the agent on behavior-specific speech. Experiments across robotic manipulation tasks and human-AI collaboration games demonstrate that MIMIC significantly enhances both behavior diversity and fidelity to human demonstrations while enabling nuanced behavioral steering without training on additional demonstrations.

View full details

Poster

Protein Design with Dynamic Protein Vocabulary

Nuowei Liu ⋅ Jiahao Kuang ⋅ Yanting Liu ⋅ Tao Ji ⋅ Changzhi Sun ⋅ Man Lan ⋅ Yuanbin Wu

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Protein design is a fundamental challenge in biotechnology, aiming to design novel sequences with specific functions within the vast space of possible proteins. Recent advances in deep generative models have enabled function-based protein design from textual descriptions, yet struggle with structural plausibility. Inspired by classical protein design methods that leverage natural protein structures, we explore whether incorporating fragments from natural proteins can enhance foldability in generative models. Our empirical results show that even random incorporation of fragments improves foldability. Building on this insight, we introduce ProDVa, a novel protein design approach that integrates a text encoder for functional descriptions, a protein language model for designing proteins, and a fragment encoder to dynamically retrieve protein fragments based on textual functional descriptions. Experimental results demonstrate that our approach effectively designs protein sequences that are both functionally aligned and structurally plausible. Compared to state-of-the-art models, ProDVa achieves comparable function alignment using less than 0.04% of the training data, while designing significantly more well-folded proteins, with the proportion of proteins having pLDDT above 70 increasing by 7.38% and those with PAE below 10 increasing by 9.62%.

View full details

Poster

Fixed-Point RNNs: Interpolating from Diagonal to Dense

Sajad Movahedi ⋅ Felix Sarnthein ⋅ Nicola Muca Cirone ⋅ Antonio Orvieto

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Linear recurrent neural networks (RNNs) and state-space models (SSMs) such as Mamba have become promising alternatives to softmax-attention as sequence mixing layers in Transformer architectures. Current models, however, do not exhibit the full state-tracking expressivity of RNNs because they rely on channel-wise (i.e. diagonal) sequence mixing. In this paper, we investigate parameterizations of a large class of dense linear RNNs as fixed-points of parallelizable diagonal linear RNNs. The resulting models can naturally trade expressivity for efficiency at a fixed number of parameters and achieve state-of-the-art results on the state-tracking benchmarks $A_5$ and $S_5$, while matching performance on copying and other tasks.

View full details

Poster

On the sample complexity of semi-supervised multi-objective learning

Tobias Wegel ⋅ Geelon So ⋅ Junhyung Park ⋅ Fanny Yang

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

In multi-objective learning (MOL), several possibly competing prediction tasks must be solved jointly by a single model. Achieving good trade-offs may require a model class $\mathcal{G}$ with larger capacity than what is necessary for solving the individual tasks. This, in turn, increases the statistical cost, as reflected in known MOL bounds that depend on the complexity of $\mathcal{G}$. We show that this cost is unavoidable for some losses, even in an idealized semi-supervised setting, where the learner has access to the Bayes-optimal solutions for the individual tasks as well as the marginal distributions over the covariates. On the other hand, for objectives defined with Bregman losses, we prove that the complexity of $\mathcal{G}$ may come into play only in terms of unlabeled data. Concretely, we establish sample complexity upper bounds, showing precisely when and how unlabeled data can significantly alleviate the need for labeled data. This is achieved by a simple pseudo-labeling algorithm.

View full details

Poster

OpenCUA: Open Foundations for Computer-Use Agents

Xinyuan Wang ⋅ Bowen Wang ⋅ Dunjie Lu ⋅ Junlin Yang ⋅ Tianbao Xie ⋅ Junli Wang ⋅ Jiaqi Deng ⋅ Xiaole Guo ⋅ Yiheng Xu ⋅ Chen Wu ⋅ Zhennan Shen ⋅ Zhuokai Li ⋅ Ryan Li ⋅ Xiaochuan Li ⋅ Junda Chen ⋅ Boyuan Zheng ⋅ LI PEIHANG ⋅ Fangyu Lei ⋅ Ruisheng Cao ⋅ Yeqiao Fu ⋅ Dongchan Shin ⋅ Martin Shin ⋅ Hu Jiarui ⋅ Yuyan Wang ⋅ Jixuan Chen ⋅ Yuxiao Ye ⋅ Danyang Zhang ⋅ Yipu Wang ⋅ Heng Wang ⋅ Diyi Yang ⋅ Victor Zhong ⋅ Y.Charles ⋅ Zhilin Yang ⋅ Tao Yu

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Vision-language models have demonstrated impressive capabilities as computer-use agents (CUAs) capable of automating diverse computer tasks. As their commercial potential grows, critical details of the most capable CUA systems remain closed. As these agents will increasingly mediate digital interactions and execute consequential decisions on our behalf, the research community needs access to open CUA frameworks to study their capabilities, limitations, and risks. To bridge this gap, we propose OpenCUA, a comprehensive open-source framework for scaling CUA data and foundation models. Our framework consists of: (1) an annotation infrastructure that seamlessly captures human computer-use demonstrations; (2) AgentNet, the first large-scale computer-use task dataset spanning 3 operating systems and 200+ applications and websites; (3) a scalable pipeline that transforms demonstrations into state–action pairs with reflective long Chain-of-Thought reasoning that sustain robust performance gains as data scales. Our end-to-end agent models demonstrate strong performance across CUA benchmarks. In particular, OpenCUA-72B achieves an average success rate of 45.0% on OSWorld‑Verified, establishing a new state-of-the-art (SOTA) among open-source models. Further analysis confirms that our approach generalizes well across domains and benefits significantly from increased test-time computation. We release our annotation tool, datasets, code, and models to build open foundations for further CUA research.

View full details

Poster

An Analysis of Causal Effect Estimation using Outcome Invariant Data Augmentation

Uzair Akbar ⋅ Niki Kilbertus ⋅ Hao Shen ⋅ Krikamol Muandet ⋅ Bo Dai

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The technique of data augmentation (DA) is often used in machine learning for regularization purposes to better generalize under i.i.d. settings. In this work, we present a unifying framework with topics in causal inference to make a case for the use of DA beyond just the i.i.d. setting, but for generalization across interventions as well. Specifically, we argue that when the outcome generating mechanism is invariant to our choice of DA, then such augmentations can effectively be thought of as interventions on the treatment generating mechanism itself. This can potentially help to reduce bias in causal effect estimation arising from hidden confounders. In the presence of such unobserved confounding we typically make use of instrumental variables (IVs)—sources of treatment randomization that are conditionally independent of the outcome. However, IVs may not be as readily available as DA for many applications, which is the main motivation behind this work. By appropriately regularizing IV based estimators, we introduce the concept of *IV-like (IVL)* regression for mitigating confounding bias and improving predictive performance across interventions even when certain IV properties are relaxed. Finally, we cast parameterized DA as an IVL regression problem and show that when used in composition can simulate a worst-case application of such DA, further improving performance on causal estimation and generalization tasks beyond what simple DA may offer. This is shown both theoretically for the population case and via simulation experiments for the finite sample case using a simple linear example. We also present real data experiments to support our case.

View full details

Poster

HM3: Hierarchical Multi-Objective Model Merging for Pretrained Models

Yu Zhou ⋅ Xingyu Wu ⋅ Jibin Wu ⋅ Liang Feng ⋅ KC Tan

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Model merging is a technique that combines multiple large pretrained models into a single model, enhancing performance and broadening task adaptability without original data or additional training. However, most existing model merging methods focus primarily on exploring the parameter space, merging models with identical architectures. Despite its potential, merging in the architecture space remains in its early stages due to the vast search space and challenges related to layer compatibility. This paper designs a hierarchical model merging framework named HM3, formulating a bilevel multi-objective model merging problem across both parameter and architecture spaces. At the parameter level, HM3 integrates existing merging methods to quickly identify optimal parameters. Based on these, an actor-critic strategy with efficient policy discretization is employed at the architecture level to explore inference paths with Markov property in the layer-granularity search space for reconstructing these optimal models. By training reusable policy and value networks, HM3 learns Pareto optimal models to provide customized solutions for various tasks. Experimental results on language and vision tasks demonstrate that HM3 outperforms methods focusing solely on the parameter or architecture space.

View full details

Poster

The Structure of Relation Decoding Linear Operators in Large Language Models

Miranda Anna Christ ⋅ Adrián Csiszárik ⋅ Gergely Becsó ⋅ Dániel Varga

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

This paper investigates the structure of linear operators introduced in Hernandez et al. [2023] that decode specific relational facts in transformer language models. We extend their single-relation findings to a collection of relations and systematically chart their organization. We show that such collections of relation decoders can be highly compressed by simple order-3 tensor networks without significant loss in decoding accuracy. To explain this surprising redundancy, we develop a cross-evaluation protocol, in which we apply each linear decoder operator to the subjects of every other relation. Our results reveal that these linear maps do not encode distinct relations, but extract recurring, coarse-grained semantic properties (e.g., country of capital city and country of food are both in the country-of-X property). This property-centric structure clarifies both the operators' compressibility and highlights why they generalize only to new relations that are semantically close. Our findings thus interpret linear relational decoding in transformer language models as primarily property-based, rather than relation-specific.

View full details

Poster

UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface

Hao Tang ⋅ Chen-Wei Xie ⋅ Haiyang Wang ⋅ Xiaoyi Bao ⋅ Tingyu Weng ⋅ Pandeng Li ⋅ Yun Zheng ⋅ Liwei Wang

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Generalist models have achieved remarkable success in both language and vision-language tasks, showcasing the potential of unified modeling. However, effectively integrating fine-grained perception tasks like detection and segmentation into these models remains a significant challenge. This is primarily because these tasks often rely heavily on task-specific designs and architectures that can complicate the modeling process. To address this challenge, we present UFO, a framework that unifies fine-grained visual perception tasks through an open-ended language interface. By transforming all perception targets into the language space, UFO unifies object-level detection, pixel-level segmentation, and image-level vision-language tasks into a single model. Additionally, we introduce a novel embedding retrieval approach that relies solely on the language interface to support segmentation tasks. Our framework bridges the gap between fine-grained perception and vision-language tasks, significantly simplifying architectural design and training strategies while achieving comparable or superior performance to methods with intricate task-specific designs. After multi-task training on five standard visual perception datasets, UFO outperforms the previous state-of-the-art generalist models by 12.3 mAP on COCO instance segmentation and 3.3 mIoU on ADE20K semantic segmentation. Furthermore, our method seamlessly integrates with existing MLLMs, effectively combining fine-grained perception capabilities with their advanced language abilities, thereby achieving superior performance on the challenging reasoning segmentation. Code and models are available at https://github.com/nnnth/UFO.

View full details

Poster

Variational Learning Finds Flatter Solutions at the Edge of Stability

Avrajit Ghosh ⋅ Bai Cong ⋅ Rio Yokota ⋅ Saiprasad Ravishankar ⋅ Rongrong Wang ⋅ Molei Tao ⋅ Mohammad Emtiyaz Khan ⋅ Thomas Möllenhoff

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Variational Learning (VL) has recently gained popularity for training deep neural networks. Part of its empirical success can be explained by theories such as PAC-Bayes bounds, minimum description length and marginal likelihood, but little has been done to unravel the implicit regularization in play. Here, we analyze the implicit regularization of VL through the Edge of Stability (EoS) framework. EoS has previously been used to show that gradient descent can find flat solutions and we extend this result to show that VL can find even flatter solutions. This result is obtained by controlling the shape of the variational posterior as well as the number of posterior samples used during training. The derivation follows in a similar fashion as in the standard EoS literature for deep learning, by first deriving a result for a quadratic problem and then extending it to deep neural networks. We empirically validate these findings on a wide variety of large networks, such as ResNet and ViT, to find that the theoretical results closely match the empirical ones. Ours is the first work to analyze the EoS dynamics of~VL.

View full details

Poster

EuroSpeech: A Multilingual Speech Corpus

Samuel Pfisterer ⋅ Florian Grötschla ⋅ Luca Lanzendörfer ⋅ Florian Yan ⋅ Roger Wattenhofer

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Recent progress in speech processing has highlighted that high-quality performance across languages requires substantial training data for each individual language. While existing multilingual datasets cover many languages, they often contain insufficient data for each language, leading to models trained on these datasets to exhibit poor performance on most supported languages. Our work addresses this challenge by introducing a scalable pipeline for constructing speech datasets from parliamentary recordings. The proposed pipeline includes robust components for media retrieval and a two-stage alignment algorithm designed to handle non-verbatim transcripts and long-form audio. Applying this pipeline to recordings from 22 European parliaments, we extract over 61k hours of aligned speech segments, achieving substantial per-language coverage with 19 languages exceeding 1k hours and 22 languages exceeding 500 hours of high-quality speech data. We obtain an average 41.8\% reduction in word error rates over baselines when finetuning an existing ASR model on our dataset, demonstrating the usefulness of our approach.

View full details

Poster

Augmenting Biological Fitness Prediction Benchmarks with Landscapes Features from GraphFLA

Mingyu Huang ⋅ Shasha Zhou ⋅ Ke Li

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Machine learning models increasingly map biological sequence-fitness landscapes to predict mutational effects. Effective evaluation of these models requires benchmarks curated from empirical data. Despite their impressive scales, existing benchmarks lack topographical information regarding the underlying fitness landscapes, which hampers interpretation and comparison of model performance beyond averaged scores. Here, we introduce GraphFLA, a Python framework that constructs and analyzes fitness landscapes from diverse modalities (DNA, RNA, protein, and beyond.), accommodating datasets up to millions of mutants. GraphFLA calculates 20 biologically relevant features that characterize 4 fundamental aspects of landscape topography. By applying GraphFLA to over 5,300 landscapes from ProteinGym, RNAGym, and CIS-BP, we demonstrate its utility in interpreting and comparing the performance of dozens of fitness prediction models, highlighting factors influencing model accuracy and respective advantages of different models. Additionally, we release 155 combinatorially complete empirical fitness landscapes, encompassing over 2.2 million sequences across various modalities. All the codes and datasets are available at https://github.com/COLA-Laboratory/GraphFLA.

View full details

Poster

CausalVerse: Benchmarking Causal Representation Learning with Configurable High-Fidelity Simulations

Guangyi Chen ⋅ Yunlong Deng ⋅ Peiyuan Zhu ⋅ Yan Li ⋅ Yifan Shen ⋅ Zijian Li ⋅ Kun Zhang

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Causal Representation Learning (CRL) aims to uncover the data-generating process and identify the underlying causal variables and relations, whose evaluation remains inherently challenging due to the requirement of known ground-truth causal variables and causal structure. Existing evaluations often rely on either simplistic synthetic datasets or downstream performance on real-world tasks, generally suffering a dilemma between realism and evaluative precision. In this paper, we introduce a new benchmark for CRL using high-fidelity simulated visual data that retains both realistic visual complexity and, more importantly, access to ground-truth causal generating processes. The dataset comprises around 200 thousand images and 3 million video frames across 24 sub-scenes in four domains: static image generation, dynamic physical simulations, robotic manipulations, and traffic situation analysis. These scenarios range from static to dynamic settings, simple to complex structures, and single to multi-agent interactions, offering a comprehensive testbed that hopefully bridges the gap between rigorous evaluation and real-world applicability. In addition, we provide flexible access to the underlying causal structures, allowing users to modify or configure them to align with the required assumptions in CRL, such as available domain labels, temporal dependencies, or intervention histories. Leveraging this benchmark, we evaluated representative CRL methods across diverse paradigms and offered empirical insights to assist practitioners and newcomers in choosing or extending appropriate CRL frameworks to properly address specific types of real problems that can benefit from the CRL perspective. Welcome to visit our: Project page: https://causal-verse.github.io/ , Dataset: https://huggingface.co/CausalVerse

View full details

Poster

A Controllable Examination for Long-Context Language Models

Yijun Yang ⋅ Zeyu Huang ⋅ Wenhao Zhu ⋅ Zihan Qiu ⋅ Fei Yuan ⋅ Jeff Pan ⋅ Ivan Titov

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Existing frameworks for evaluating long-context language models (LCLM) can be broadly categorized into real-world applications (e.g, document summarization) and synthetic tasks (e.g, needle-in-a-haystack). Despite their utility, both approaches are accompanied by certain intrinsic limitations. Real-world tasks often involve complexity that makes interpretation challenging and suffer from data contamination, whereas synthetic tasks frequently lack meaningful coherence between the target information ("needle") and its surrounding context ("haystack"), undermining their validity as proxies for realistic applications. In response to these challenges, we posit that an ideal long-context evaluation framework should be characterized by three essential features: 1) seamless context: coherent contextual integration between target information and its surrounding context; 2) controllable setting: an extensible task setup that enables controlled studies—for example, incorporating additional required abilities such as numerical reasoning; and 3) sound evaluation: avoiding LLM-as-Judge and conduct exact-match to ensure deterministic and reproducible evaluation results.This study introduces $\textbf{LongBioBench}$, a benchmark that utilizes artificially generated biographies as a controlled environment for assessing LCLMs across dimensions of $\textit{understanding}$, $\textit{reasoning}$, and $\textit{trustworthiness}$. Our experimental evaluation, which includes $\textbf{18}$ LCLMs in total, demonstrates that most models still exhibit deficiencies in semantic understanding and elementary reasoning over retrieved results and are less trustworthy as context length increases.Our further analysis indicates some design choices employed by existing synthetic benchmarks, such as contextual non-coherence, numerical needles, and the absence of distractors, rendering them vulnerable to test the model's long-context capabilities.Moreover, we also reveal that long-context continual pretraining primarily adjusts RoPE embedding to accommodate extended context lengths, which in turn yields only marginal improvements in the model’s true capabilities. To sum up, compared to previous synthetic benchmarks, LongBioBench achieves a better trade-off between mirroring authentic language tasks and maintaining controllability, and is highly interpretable and configurable.

View full details

Poster

STAR: A Benchmark for Astronomical Star Fields Super-Resolution

WU KUO-CHENG ⋅ Guohang Zhuang ⋅ Jinyang Huang ⋅ Xiang Zhang ⋅ Wanli Ouyang ⋅ Yan Lu

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Super-resolution (SR) advances astronomical imaging by enabling cost-effective high-resolution capture, crucial for detecting faraway celestial objects and precise structural analysis. However, existing datasets for astronomical SR (ASR) exhibit three critical limitations: flux inconsistency, object-crop setting, and insufficient data diversity, significantly impeding ASR development. We propose STAR, a large-scale astronomical SR dataset containing 54,738 flux-consistent star field image pairs covering wide celestial regions. These pairs combine Hubble Space Telescope high-resolution observations with physically faithful low-resolution counterparts generated through a flux-preserving data generation pipeline, enabling systematic development of field-level ASR models. To further empower the ASR community, STAR provides a novel Flux Error (FE) to evaluate SR models in physical view. Leveraging this benchmark, we propose a Flux-Invariant Super Resolution (FISR) model that could accurately infer the flux-consistent high-resolution images from input photometry, suppressing several SR state-of-the-art methods by 24.84% on a novel designed flux consistency metric, showing the priority of our method for astrophysics. Extensive experiments demonstrate the effectiveness of our proposed method and the value of our dataset. Code and models are available at https://github.com/GuoCheng12/STAR.

View full details

Poster

AgentRecBench: Benchmarking LLM Agent-based Personalized Recommender Systems

Yu Shang ⋅ Peijie Liu ⋅ Yuwei Yan ⋅ Zijing Wu ⋅ Leheng Sheng ⋅ Yuanqing Yu ⋅ Chumeng Jiang ⋅ An Zhang ⋅ Fengli Xu ⋅ Yu Wang ⋅ Min Zhang ⋅ Yong Li

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The emergence of agentic recommender systems powered by Large Language Models (LLMs) represents a paradigm shift in personalized recommendations, leveraging LLMs’ advanced reasoning and role-playing capabilities to enable autonomous, adaptive decision-making. Unlike traditional recommendation approaches, agentic recommender systems can dynamically gather and interpret user-item interactions from complex environments, generating robust recommendation strategies that generalize across diverse scenarios. However, the field currently lacks standardized evaluation protocols to systematically assess these methods. To address this critical gap, we propose: (1) an interactive textual recommendation simulator incorporating rich user and item metadata and three typical evaluation scenarios (classic, evolving-interest, and cold-start recommendation tasks); (2) a unified modular framework for developing agentic recommender systems; and (3) the first comprehensive benchmark comparing over 10 classical and agentic recommendation methods. Our findings demonstrate the superiority of agentic systems and establish actionable design guidelines for their core components. The benchmark environment has been rigorously validated through an open challenge and remains publicly available with a maintained leaderboard at https://tsinghua-fib-lab.github.io/AgentSocietyChallenge/pages/overview.html. The benchmark is available at: https://huggingface.co/datasets/SGJQovo/AgentRecBench.

View full details

Poster

Measuring Fingerprints of Web-filtered Text Datasets and Fingerprint Propagation Through Training

Youssef Mansour ⋅ Reinhard Heckel

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We investigate fingerprints in pretraining datasets for large language models (LLMs) through dataset classification experiments. Building on prior work demonstrating the existence of fingerprints or biases in popular computer vision datasets, we analyze popular open-source pretraining datasets for LLMs derived from CommonCrawl including C4, RefinedWeb, DolmaCC, RedPajama-V2, FineWeb, and DCLM-Baseline. Despite those datasets being obtained with similar curation steps, neural networks can classify surprisingly well which dataset a single text sequence belongs to, significantly better than a human can. This indicates that small differences in filtering and processing pipelines induce fingerprints, that we find are evident in formatting, vocabulary, and content distributions. Such fingerprints can negatively impact cross-dataset generalization. Additionally, we show that these fingerprints propagate through training: sequences generated by models trained on those datasets can be accurately classified by a classifier trained on the original datasets. This can offer insights into data characteristics that are typically undisclosed by LLM developers, including pretraining mixture proportions and finetuning data sources.

View full details

Poster

FineGRAIN: Evaluating Failure Modes of Text-to-Image Models with Vision Language Model Judges

Kevin Hayes ⋅ Micah Goldblum ⋅ Vikash Sehwag ⋅ Gowthami Somepalli ⋅ Ashwinee Panda ⋅ Tom Goldstein

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Text-to-image (T2I) models are capable of generating visually impressive images, yet they often fail to accurately capture specific attributes in user prompts, such as the correct number of objects with the specified colors. The diversity of such errors underscores the need for a hierarchical evaluation framework that can compare prompt adherence abilities of different image generation models. Simultaneously, benchmarks of vision language models (VLMs) have not kept pace with the complexity of scenes that VLMs are used to annotate. In this work, we propose a structured methodology for jointly evaluating T2I models and VLMs by testing whether VLMs can identify 27 specific failure modes in the images generated by T2I models conditioned on challenging prompts. Our second contribution is a dataset of prompts and images generated by 5 T2I models (Flux, SD3-Medium, SD3-Large, SD3.5-Medium, SD3.5-Large) and the corresponding annotations from VLMs (Molmo, InternVL3, Pixtral) annotated by an LLM (Llama3) to test whether VLMs correctly identify the failure mode in a generated image. By analyzing failure modes on a curated set of prompts, we reveal systematic errors in attribute fidelity and object representation. Our findings suggest that current metrics are insufficient to capture these nuanced errors, highlighting the importance of targeted benchmarks for advancing generative model reliability and interpretability.

View full details

Poster

Towards Understanding Camera Motions in Any Video

Zhiqiu Lin ⋅ Siyuan Cen ⋅ Daniel Jiang ⋅ Jay Karhade ⋅ Hewei Wang ⋅ Chancharik Mitra ⋅ Yu Tong Tiffany Ling ⋅ Yuhan Huang ⋅ Rushikesh Zawar ⋅ Xue Bai ⋅ Yilun Du ⋅ Chuang Gan ⋅ Deva Ramanan

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding. CameraBench consists of ~3,000 diverse internet videos, annotated by experts through a rigorous multi-stage quality control process. One of our core contributions is a taxonomy or "language" of camera motion primitives, designed in collaboration with cinematographers. We find, for example, that some motions like "follow" (or tracking) require understanding scene content like moving subjects. We conduct a large-scale human study to quantify human performance, revealing that domain expertise and tutorial-based training can significantly enhance accuracy. For example, a novice may confuse zoom-in (a change of intrinsics) with translating forward (a change of extrinsics), but can be trained to differentiate the two. Using CameraBench, we evaluate Structure-from-Motion (SfM) and Video-Language Models (VLMs), finding that SfM models struggle to capture semantic primitives that depend on scene content, while generative VLMs struggle to capture geometric primitives that require precise estimation of trajectories. We then fine-tune a generative VLM on CameraBench to achieve the best of both worlds and showcase its applications, including motion-augmented captioning, video question answering, and video-text retrieval. We hope our taxonomy, benchmark, and tutorials will drive future efforts towards the ultimate goal of understanding camera motions in any video.

View full details

Poster

TabArena: A Living Benchmark for Machine Learning on Tabular Data

Nick Erickson ⋅ Lennart Purucker ⋅ Andrej Tschalzev ⋅ David Holzmüller ⋅ Prateek Desai ⋅ David Salinas ⋅ Frank Hutter

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

With the growing popularity of deep learning and foundation models for tabular data, the need for standardized and reliable benchmarks is higher than ever. However, current benchmarks are static. Their design is not updated even if flaws are discovered, model versions are updated, or new models are released. To address this, we introduce TabArena, the first continuously maintained living tabular benchmarking system. To launch TabArena, we manually curate a representative collection of datasets and well-implemented models, conduct a large-scale benchmarking study to initialize a public leaderboard, and assemble a team of experienced maintainers. Our results highlight the influence of validation method and ensembling of hyperparameter configurations to benchmark models at their full potential. While gradient-boosted trees are still strong contenders on practical tabular datasets, we observe that deep learning methods have caught up under larger time budgets with ensembling. At the same time, foundation models excel on smaller datasets. Finally, we show that ensembles across models advance the state-of-the-art in tabular machine learning. We observe that some deep learning models are overrepresented in cross-model ensembles due to validation set overfitting, and we encourage model developers to address this issue. We launch TabArena with a public leaderboard, reproducible code, and maintenance protocols to create a living benchmark available at https://tabarena.ai.

View full details

Poster

UniRelight: Learning Joint Decomposition and Synthesis for Video Relighting

Kai He ⋅ Ruofan Liang ⋅ Jacob Munkberg ⋅ Jon Hasselgren ⋅ Nandita Vijaykumar ⋅ Alexander Keller ⋅ Sanja Fidler ⋅ Igor Gilitschenski ⋅ Zan Gojcic ⋅ Zian Wang

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We address the challenge of relighting a single image or video, a task that demands precise scene intrinsic understanding and high-quality light transport synthesis. Existing end-to-end relighting models are often limited by the scarcity of paired multi-illumination data, restricting their ability to generalize across diverse scenes. Conversely, two-stage pipelines that combine inverse and forward rendering can mitigate data requirements but are susceptible to error accumulation and often fail to produce realistic outputs under complex lighting conditions or with sophisticated materials. In this work, we introduce a general-purpose approach that jointly estimates albedo and synthesizes relit outputs in a single pass, harnessing the generative capabilities of video diffusion models. This joint formulation enhances implicit scene comprehension and facilitates the creation of realistic lighting effects and intricate material interactions, such as shadows, reflections, and transparency. Trained on synthetic multi-illumination data and extensive automatically labeled real-world videos, our model demonstrates strong generalization across diverse domains and surpasses previous methods in both visual fidelity and temporal consistency. Our project page is https://research.nvidia.com/labs/toronto-ai/UniRelight/.

View full details

Poster

Why Do Multi-Agent LLM Systems Fail?

Mert Cemri ⋅ Melissa Z Pan ⋅ Shuyi Yang ⋅ Lakshya A Agrawal ⋅ Bhavya Chopra ⋅ Rishabh Tiwari ⋅ Kurt Keutzer ⋅ Aditya Parameswaran ⋅ Dan Klein ⋅ Kannan Ramchandran ⋅ Matei A Zaharia ⋅ Joseph Gonzalez ⋅ Ion Stoica

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Despite enthusiasm for Multi-Agent LLM Systems (MAS), their performance gains on popular benchmarks are often minimal. This gap highlights a critical need for a principled understanding of why MAS fail. Addressing this question requires systematic identification and analysis of failure patterns. We introduce MAST-Data, a comprehensive dataset of 1600+ annotated traces collected across 7 popular MAS frameworks. MAST-Data is the first multi-agent system dataset to outline the failure dynamics in MAS for guiding the development of better future systems. To enable systematic classification of failures for MAST-Data, we build the first Multi-Agent System Failure Taxonomy (MAST). We develop MAST through rigorous analysis of 150 traces, guided closely by expert human annotators andvalidated by high inter-annotator agreement (κ = 0.88). This process identifies 14 unique modes, clustered into 3 categories: (i) system design issues, (ii) inter-agent misalignment, and (iii) task verification. To enable scalable annotation, we develop an LLM-as-a-Judge pipeline with high agreement with human annotations. We leverage MAST and MAST-Data to analyze failure patterns across models (GPT4, Claude 3, Qwen2.5, CodeLlama) and tasks (coding, math, general agent), demonstrating improvement headrooms from better MAS design. Our analysis provides insights revealing that identified failures require more sophisticated solutions, highlighting a clear roadmap for future research. We publicly release our comprehensive dataset (MAST-Data), the MAST, and our LLM annotator to facilitate widespread research and development in MAS.

View full details

Poster

GUARD: Constructing Realistic Two-Player Matrix and Security Games for Benchmarking Game-Theoretic Algorithms

Noah Krever ⋅ Jakub Cerny ⋅ Moise Blanchard ⋅ Christian Kroer

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Game-theoretic algorithms are commonly benchmarked on recreational games, classical constructs from economic theory such as congestion and dispersion games, or entirely random game instances. While the past two decades have seen the rise of security games -- grounded in real-world scenarios like patrolling and infrastructure protection -- their practical evaluation has been hindered by limited access to the datasets used to generate them. In particular, although the structural components of these games (e.g., patrol paths derived from maps) can be replicated, the critical data defining target values -- central to utility modeling -- remain inaccessible. In this paper, we introduce a flexible framework that leverages open-access datasets to generate realistic matrix and security game instances. These include animal movement data for modeling anti-poaching scenarios and demographic and infrastructure data for infrastructure protection. Our framework allows users to customize utility functions and game parameters, while also offering a suite of preconfigured instances. We provide theoretical results highlighting the degeneracy and limitations of benchmarking on random games, and empirically compare our generated games against random baselines across a variety of standard algorithms for computing Nash and Stackelberg equilibria, including linear programming, incremental strategy generation, and self-play with no-regret learners.

View full details

Poster

Towards Physics-informed Spatial Intelligence with Human Priors: An Autonomous Driving Pilot Study

Guanlin (Frank) Wu ⋅ Boyan Su ⋅ Yang Zhao ⋅ Pu Wang ⋅ Yichen Lin ⋅ Hao Frank Yang

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

How to integrate and verify spatial intelligence in foundation models remains an open challenge. Current practice often proxies Visual-Spatial Intelligence (VSI) with purely textual prompts and VQA-style scoring, which obscures geometry, invites linguistic shortcuts, and weakens attribution to genuinely spatial skills. We introduce Spatial Intelligence Grid (SIG): a structured, grid-based schema that explicitly encodes object layouts, inter-object relations, and physically grounded priors. As a complementary channel to text, SIG provides a faithful, compositional representation of scene structure for foundation-model reasoning. Building on SIG, we derive SIG-informed evaluation metrics that quantify a model’s intrinsic VSI, which separates spatial capability from language priors. In few-shot in-context learning with state-of-the-art multimodal LLMs (e.g. GPT- and Gemini-family models), SIG yields consistently larger, more stable, and more comprehensive gains across all VSI metrics compared to VQA-only representations, indicating its promise as a data-labeling and training schema for learning VSI. We also release SIGBench, a benchmark of 1.4K driving frames annotated with ground-truth SIG labels and human gaze traces, supporting both grid-based machine VSI tasks and attention-driven, human-like VSI tasks in autonomous-driving scenarios.

View full details

Poster

AgentBreeder: Mitigating the AI Safety Risks of Multi-Agent Scaffolds via Self-Improvement

J Rosser ⋅ Jakob Foerster

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Scaffolding Large Language Models (LLMs) into multi-agent systems often improves performance on complex tasks, but the safety impact of such scaffolds has not been thoroughly explored. We introduce AgentBreeder, a framework for multi-objective self-improving evolutionary search over scaffolds. We evaluate discovered scaffolds on widely recognized reasoning, mathematics, and safety benchmarks and compare them with popular baselines. In "blue" mode, we see a 79.4% average uplift in safety benchmark performance while maintaining or improving capability scores. In "red" mode, we find adversarially weak scaffolds emerging concurrently with capability optimization. Our work demonstrates the risks of multi-agent scaffolding and provides a framework for mitigating them. Code is available at \url{https://github.com/jrosseruk/AgentBreeder}.

View full details

Poster

SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning

Borong Zhang ⋅ Yuhao Zhang ⋅ Jiaming Ji ⋅ Yingshan Lei ⋅ Juntao Dai ⋅ Yuanpei Chen ⋅ Yaodong Yang

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Vision-language-action models (VLAs) show potential as generalist robot policies. However, these models pose extreme safety challenges during real-world deployment, including the risk of harm to the environment, the robot itself, and humans. *How can safety constraints be explicitly integrated into VLAs?* We address this by exploring an integrated safety approach (ISA), systematically **modeling** safety requirements, then actively **eliciting** diverse unsafe behaviors, effectively **constraining** VLA policies via safe reinforcement learning, and rigorously **assuring** their safety through targeted evaluations. Leveraging the constrained Markov decision process (CMDP) paradigm, ISA optimizes VLAs from a min-max perspective against elicited safety risks. Thus, policies aligned through this comprehensive approach achieve the following key features: (I) effective **safety-performance trade-offs**, reducing the cumulative cost of safety violations by 83.58\% compared to the state-of-the-art method, while also maintaining task success rate (+3.85\%). (II) strong **safety assurance**, with the ability to mitigate long-tail risks and handle extreme failure scenarios. (III) robust **generalization** of learned safety behaviors to various out-of-distribution perturbations. The effectiveness is evaluated on long-horizon mobile manipulation tasks.

View full details

Poster

FlashMD: long-stride, universal prediction of molecular dynamics

Filippo Bigi ⋅ Sanggyu Chong ⋅ Agustinus Kristiadi ⋅ Michele Ceriotti

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Molecular dynamics (MD) provides insights into atomic-scale processes by integrating over time the equations that describe the motion of atoms under the action of interatomic forces. Machine learning models have substantially accelerated MD by providing inexpensive predictions of the forces, but they remain constrained to minuscule time integration steps, which are required by the fast time scale of atomic motion. In this work, we propose FlashMD, a method to predict the evolution of positions and momenta over strides that are between one and two orders of magnitude longer than typical MD time steps. We incorporate considerations on the mathematical and physical properties of Hamiltonian dynamics in the architecture, generalize the approach to allow the simulation of any thermodynamic ensemble, and carefully assess the possible failure modes of a direct MD approach. We validate FlashMD’s accuracy in reproducing equilibrium and time‐dependent properties, using both system‐specific and general-purpose models, extending the ability of MD simulation to reach the long time scales needed to model microscopic processes of high scientific and technological relevance.

View full details

Poster

PoE-World: Compositional World Modeling with Products of Programmatic Experts

Top Piriyakulkij ⋅ Yichao Liang ⋅ Hao Tang ⋅ Adrian Weller ⋅ Marta Kryven ⋅ Kevin Ellis

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Learning how the world works is central to building AI agents that can adapt to complex environments. Traditional world models based on deep-learning demand vast amounts of training data, and do not flexibly update their knowledge from sparse observations. Recent advances in program synthesis using Large Language Models (LLMs) give an alternate approach which learns world models represented as source code, supporting strong generalization from little data. To date, application of program-structured world models remains limited to natural language and grid-world domains. We introduce a novel program synthesis method for effectively modeling complex, non-gridworld domains by representing a world model as an exponentially-weighted product of programmatic experts (PoE-World) synthesized by LLMs. We show that this approach can learn complex, stochastic world models from just a few observations. We evaluate the learned world models by embedding them in a model-based planning agent, demonstrating efficient performance and generalization to unseen levels on Atari's Pong and Montezuma's Revenge.

View full details

Poster

Aggregation Hides Out-of-Distribution Generalization Failures from Spurious Correlations

Olawale Salaudeen ⋅ Haoran Zhang ⋅ Kumail Alhamoud ⋅ Sara Beery ⋅ Marzyeh Ghassemi

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Benchmarks for out-of-distribution (OOD) generalization often reveal a strong positive correlation between in-distribution (ID) and OOD accuracy across models, a phenomenon known as “accuracy-on-the-line.” This pattern is commonly interpreted as evidence that spurious correlations—relationships that improve ID but harm OOD performance—are rare in practice. We show that this positive correlation can be an artifact of aggregating heterogeneous OOD examples. Using a simple gradient-based method, OODSelect, we identify semantically coherent OOD subsets where accuracy-on-the-line breaks down. Across widely used distribution-shift benchmarks, OODSelect uncovers subsets—sometimes comprising more than half of the standard OOD set—where higher ID accuracy predicts lower OOD accuracy. These results suggest that aggregate metrics can mask critical failure modes in OOD robustness. We release code and the identified subsets to support further research.

View full details

Poster

Online Functional Tensor Decomposition via Continual Learning for Streaming Data Completion

Xi Zhang ⋅ Yanyi Li ⋅ Yisi Luo ⋅ Qi Xie ⋅ Deyu Meng

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Online tensor decompositions are powerful and proven techniques that address the challenges in processing high-velocity streaming tensor data, such as traffic flow and weather system. The main aim of this work is to propose a novel online functional tensor decomposition (OFTD) framework, which represents a spatial-temporal continuous function using the CP tensor decomposition parameterized by coordinate-based implicit neural representations (INRs). The INRs allow for natural characterization of continually expanded streaming data by simply adding new coordinates into the network. Particularly, our method transforms the classical online tensor decomposition algorithm into a more dynamic continual learning paradigm of updating the INR weights to fit the new data without forgetting the previous tensor knowledge. To this end, we introduce a long-tail memory replay method that adapts to the local continuity property of INR. Extensive experiments for streaming tensor completion using traffic, weather, user-item, and video data verify the effectiveness of the OFTD approach for streaming data analysis. This endeavor serves as a pivotal inspiration for future research to connect classical online tensor tools with continual learning paradigms to better explore knowledge underlying streaming tensor data.

View full details

Poster

Transformer Copilot: Learning from The Mistake Log in LLM Fine-tuning

Jiaru Zou ⋅ Yikun Ban ⋅ Zihao Li ⋅ Yunzhe Qi ⋅ Ruizhong Qiu ⋅ Ling Yang ⋅ Jingrui He

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Large language models are typically adapted to downstream tasks through supervised fine-tuning on domain-specific data. While standard fine-tuning focuses on minimizing generation loss to optimize model parameters, we take a deeper step by retaining and leveraging the model’s own learning signals, analogous to how human learners reflect on past mistakes to improve future performance. We first introduce the concept of Mistake Log to systematically track the model’s learning behavior and recurring errors throughout fine-tuning. Treating the original transformer-based model as the Pilot, we correspondingly design a Copilot model to refine the Pilot’s inference performance via logits rectification. We name the overall Pilot-Copilot framework the Transformer Copilot, which introduces (i) a novel Copilot model design, (ii) a joint training paradigm where the Copilot continuously learns from the evolving Mistake Log alongside the Pilot, and (iii) a fused inference paradigm where the Copilot rectifies the Pilot’s logits for enhanced generation. We provide both theoretical and empirical analyses on our new learning framework. Experiments on 12 benchmarks spanning commonsense, arithmetic, and recommendation tasks demonstrate that Transformer Copilot consistently improves performance by up to 34.5%, while introducing marginal computational overhead to Pilot models and exhibiting strong scalability and transferability. Our code is released at https://github.com/jiaruzouu/TransformerCopilot.

View full details

Poster

DexGarmentLab: Dexterous Garment Manipulation Environment with Generalizable Policy

Yuran Wang ⋅ Ruihai Wu ⋅ Yue Chen ⋅ Jiarui Wang ⋅ Jiaqi Liang ⋅ Ziyu Zhu ⋅ Haoran Geng ⋅ Jitendra Malik ⋅ Pieter Abbeel ⋅ Hao Dong

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Garment manipulation is a critical challenge due to the diversity in garment categories, geometries, and deformations. Despite this, humans can effortlessly handle garments, thanks to the dexterity of our hands. However, existing research in the field has struggled to replicate this level of dexterity, primarily hindered by the lack of realistic simulations of dexterous garment manipulation. Therefore, we propose DexGarmentLab, the first environment specifically designed for dexterous (especially bimanual) garment manipulation, which features large-scale high-quality 3D assets for 15 task scenarios, and refines simulation techniques tailored for garment modeling to reduce the sim-to-real gap. Previous data collection typically relies on teleoperation or training expert reinforcement learning (RL) policies, which are labor-intensive and inefficient. In this paper, we leverage garment structural correspondence to automatically generate a dataset with diverse trajectories using only a single expert demonstration, significantly reducing manual intervention. However, even extensive demonstrations cannot cover the infinite states of garments, which necessitates the exploration of new algorithms. To improve generalization across diverse garment shapes and deformations, we propose a Hierarchical gArment-manipuLation pOlicy (HALO). It first identifies transferable affordance points to accurately locate the manipulation area, then generates generalizable trajectories to complete the task. Through extensive experiments and detailed analysis of our method and baseline, we demonstrate that HALO consistently outperforms existing methods, successfully generalizing to previously unseen instances even with significant variations in shape and deformation where others fail. Our project page is available at: https://wayrise.github.io/DexGarmentLab/.

View full details

Poster

Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations

Brian Zheng ⋅ Alisa Liu ⋅ Orevaoghene Ahia ⋅ Jonathan Hayase ⋅ Yejin Choi ⋅ Noah Smith

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Modern tokenizers employ deterministic algorithms to map text into a single ``canonical" token sequence, yet the same string can be encoded as many non-canonical tokenizations using the language model vocabulary, including tokenizing by character. In this paper, we investigate the robustness of LMs to input encoded with non-canonical tokenizations entirely unseen during training. Surprisingly, when evaluated across 20 benchmarks, we find that instruction-tuned models retain up to 93.4\% of their original performance when given a randomly sampled tokenization, and 90.8\% with character-level tokenization. We find that overall stronger models tend to be more robust, and that robustness diminishes as the tokenization departs farther from the canonical form. Motivated by these results, we identify settings where non-canonical tokenization schemes can \textit{improve} performance, finding that character‑level segmentation improves string manipulation and code understanding tasks by up to 15\%, and right‑aligned digit grouping enhances large‑number arithmetic by over 33\%. Finally, we investigate the source of this robustness, finding that it arises in the instruction-tuning phase. We provide evidence that both base and post-trained models grasp the semantics of non-canonical tokenizations (perceiving them as containing misspellings). However, base models try to mimic the imagined mistakes and degenerate into nonsensical output, while post-trained models are committed to fluent responses. Overall, our findings suggest that models are less committed to their tokenizer than previously believed, and highlight the promise of intervening on tokenization at inference time to boost language model performance.

View full details

Poster

RoFt-Mol: Benchmarking Robust Fine-tuning with Molecular Graph Foundation Models

Shikun Liu ⋅ Deyu Zou ⋅ Nima Shoghi ⋅ Victor Fung ⋅ Kai Liu ⋅ Pan Li

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

In the era of foundation models, fine-tuning pre-trained models for specific downstream tasks has become crucial. This drives the need for robust fine-tuning methods to address challenges such as model overfitting and sparse labeling. Moleculargraph foundation models (MGFMs) face unique difficulties that complicate fine-tuning. These models are limited by smaller pre-training datasets and more severedata scarcity for downstream tasks, both of which require enhanced model generalization. Moreover, MGFMs must accommodate diverse objectives, including bothregression and classification tasks. To better understand and improve fine-tuningtechniques under these conditions, we classify eight fine-tuning methods into threemechanisms: weight-based, representation-based, and partial fine-tuning. Webenchmark these methods on downstream regression and classification tasks acrosssupervised and self-supervised pre-trained models in diverse labeling settings. Thisextensive evaluation provides valuable insights and informs the design of a refinedrobust fine-tuning method, ROFT-MOL. This approach combines the strengths ofsimple post-hoc weight interpolation with more complex weight ensemble fine-tuning methods, delivering improved performance across both task types whilemaintaining the ease of use inherent in post-hoc weight interpolation.

View full details

Poster

Measuring and Guiding Monosemanticity

Ruben Härle ⋅ Felix Friedrich ⋅ Manuel Brack ⋅ Björn Deiseroth ⋅ Stephan Waeldchen ⋅ Patrick Schramowski ⋅ Kristian Kersting

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

There is growing interest in leveraging mechanistic interpretability and controllability to better understand and influence the internal dynamics of large language models (LLMs). However, current methods face fundamental challenges in reliably localizing and manipulating feature representations. Sparse Autoencoders (SAEs) have recently emerged as a promising direction for feature extraction at scale, yet they, too, are limited by incomplete feature isolation and unreliable monosemanticity. To systematically quantify these limitations, we introduce Feature Monosemanticity Score (FMS), a novel metric to quantify feature monosemanticity in latent representation. Building on these insights, we propose Guided Sparse Autoencoders (G-SAE), a method that conditions latent representations on labeled concepts during training. We demonstrate that reliable localization and disentanglement of target concepts within the latent space improve interpretability, detection of behavior, and control. Specifically, our evaluations on toxicity detection, writing style identification, and privacy attribute recognition show that G-SAE not only enhances monosemanticity but also enables more effective and fine-grained steering with less quality degradation. Our findings provide actionable guidelines for measuring and advancing mechanistic interpretability and control of LLMs.

View full details

Poster

ModHiFi: Identifying High Fidelity predictive components for Model Modification

Dhruva Kashyap ⋅ Chaitanya Murti ⋅ Pranav K Nayak ⋅ Tanay Narshana ⋅ Chiranjib Bhattacharyya

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Open weight models, which are ubiquitous, rarely provide access to their training data or loss function. This makes modifying such models for tasks such as pruning or unlearning, which are constrained by this unavailability, an active area of research. Existing techniques typically require gradients or ground-truth labels, rendering them infeasible in settings with limited computational resources. In this work, we investigate the fundamental question of identifying components that are critical to the model's predictive performance, without access to either gradients or the loss function, and with only distributional access such as synthetic data. We theoretically demonstrate that the global error is linearly bounded by local reconstruction errors for Lipschitz-continuous networks such as CNNs and well-trained Transformers (which, contrary to existing literature, we find exhibit Lipschitz continuity). This motivates using the locally reconstructive behavior of component subsets to quantify their global importance, via a metric that we term *Subset Fidelity*. In the uncorrelated features setting, selecting individual components based on their Subset Fidelity scores is optimal, which we utilize to propose **ModHiFi**, an algorithm for model modification that requires neither training data nor access to a loss function. **ModHiFi-P**, for structured pruning, achieves an 11\% speedup over the current state of the art on ImageNet models and competitive performance on language models. **ModHiFi-U**, for classwise unlearning, achieves complete unlearning on CIFAR-10 without fine-tuning and demonstrates competitive performance on Swin Transformers.

View full details

Poster

Discrete Spatial Diffusion: Intensity-Preserving Diffusion Modeling

Javier E. Santos ⋅ Agnese Marcato ⋅ Roman Colman ⋅ NIcholas Lubbers ⋅ Yen Ting Lin

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Generative diffusion models have achieved remarkable success in producing high-quality images. However, these models typically operate in continuous intensity spaces, diffusing independently across pixels and color channels. As a result, they are fundamentally ill-suited for applications involving inherently discrete quantities such as particle counts or material units, that are constrained by strict conservation laws like mass conservation, limiting their applicability in scientific workflows. To address this limitation, we propose Discrete Spatial Diffusion (DSD), a framework based on a continuous-time, discrete-state jump stochastic process that operates directly in discrete spatial domains while strictly preserving particle counts in both forward and reverse diffusion processes. By using spatial diffusion to achieve particle conservation, we introduce stochasticity naturally through a discrete formulation. We demonstrate the expressive flexibility of DSD by performing image synthesis, class conditioning, and image inpainting across standard image benchmarks, while exactly conditioning total image intensity. We validate DSD on two challenging scientific applications: porous rock microstructures and lithium-ion battery electrodes, demonstrating its ability to generate structurally realistic samples under strict mass conservation constraints, with quantitative evaluation using state-of-the-art metrics for transport and electrochemical performance.

View full details

Poster

Ambient Diffusion Omni: Training Good Models with Bad Data

Giannis Daras ⋅ Adrian Rodriguez-Munoz ⋅ Adam Klivans ⋅ Antonio Torralba ⋅ Constantinos Daskalakis

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We show how to use low-quality, synthetic, and out-of-distribution images to improve the quality of a diffusion model. Typically, diffusion models are trained on curated datasets that emerge from highly filtered data pools from the Web and other sources. We show that there is immense value in the lower-quality images that are often discarded. We present Ambient Diffusion Omni, a simple, principled framework to train diffusion models that can extract signal from arbitrarily images during training. Our framework exploits two properties of natural images -- spectral power law decay and locality. We first validate our framework by successfully training diffusion models with images synthetically corrupted by Gaussian blur, JPEG compression, and motion blur. We use our framework to achieve state-of-the-art ImageNet FID and we show significant improvements in both image quality and diversity for text-to-image generative modeling. The core insight is that noise dampens the initial skew between the desired high-quality distribution and the mixed distribution we actually observe. We provide rigorous theoretical justification for our approach by analyzing the trade-off between learning from biased data versus limited unbiased data across diffusion times.

View full details

Poster

Achieving $\tilde{\mathcal{O}}(1/N)$ Optimality Gap in Restless Bandits through Gaussian Approximation

Chen YAN ⋅ Weina Wang ⋅ Lei Ying

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We study the finite-horizon Restless Multi-Armed Bandit (RMAB) problem with $N$ homogeneous arms. Prior work has shown that when an RMAB satisfies a non-degeneracy condition, Linear-Programming-based (LP-based) policies derived from the fluid approximation, which captures the mean dynamics of the system, achieve an exponentially small optimality gap. However, it is common for RMABs to be degenerate, in which case LP-based policies can result in a $\Theta(1/\sqrt{N})$ optimality gap per arm. In this paper, we propose a novel Stochastic-Programming-based (SP-based) policy that, under a uniqueness assumption, achieves an $\tilde{\mathcal{O}}(1/N)$ optimality gap for degenerate RMABs. Our approach is based on the construction of a Gaussian stochastic system that captures not only the mean but also the variance of the RMAB dynamics, resulting in a more accurate approximation than the fluid approximation. We then solve a stochastic program for this system to obtain our policy. This is the first result to establish an $\tilde{\mathcal{O}}(1/N)$ optimality gap for degenerate RMABs.

View full details

Poster

SGCD: Stain-Guided CycleDiffusion for Unsupervised Domain Adaptation of Histopathology Image Classification

Hsi-Ling Chen ⋅ Chun-Shien Lu ⋅ Pau-Choo Chung

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

The effectiveness of domain translation in addressing image-based problems of Unsupervised Domain Adaptation (UDA) depends on the quality of the translated images and the preservation of crucial discriminative features. However, achieving high-quality and stable translations typically requires paired data, which poses a challenge in scenarios with limited annotations in the target domain. To address this issue, this paper proposes a novel method termed Stain-Guided Cycle Diffusion (SGCD), employing a dual diffusion model with bidirectional generative constraints to synthesize highly realistic data for downstream task fine-tuning. The bidirectional generative constraints ensure that the translated images retain the features critical to the downstream model in properly controlling the generation process. Additionally, a stain-guided consistency loss is introduced to enhance the denoising capability of the dual diffusion model, thereby improving the quality of images translated between different domains using latents from one domain and a diffusion model trained on another. Experiments conducted on four public datasets demonstrate that SGCD can effectively enhance the performance of downstream task models on the target domain.

View full details

Poster

Some Optimizers are More Equal: Understanding the Role of Optimizers in Group Fairness

Mojtaba Kolahdouzi ⋅ Hatice Gunes ⋅ Ali Etemad

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We study whether and how the choice of optimization algorithm can impact group fairness in deep neural networks. Through stochastic differential equation analysis of optimization dynamics in an analytically tractable setup, we demonstrate that the choice of optimization algorithm indeed influences fairness outcomes, particularly under severe imbalance. Furthermore, we show that when comparing two categories of optimizers, adaptive methods and stochastic methods, RMSProp (from the adaptive category) has a higher likelihood of converging to fairer minima than SGD (from the stochastic category). Building on this insight, we derive two new theoretical guarantees showing that, under appropriate conditions, RMSProp exhibits fairer parameter updates and improved fairness in a single optimization step compared to SGD. We then validate these findings through extensive experiments on three publicly available datasets, namely CelebA, FairFace, and MS-COCO, across different tasks as facial expression recognition, gender classification, and multi-label classification, using various backbones. Considering multiple fairness definitions including equalized odds, equal opportunity, and demographic parity, adaptive optimizers like RMSProp and Adam consistently outperform SGD in terms of group fairness, while maintaining comparable predictive accuracy. Our results highlight the role of adaptive updates as a crucial yet overlooked mechanism for promoting fair outcomes. We release the source code at: https://github.com/Mkolahdoozi/Some-Optimizers-Are-More-Equal.

View full details

Poster

Fire360: A Benchmark for Robust Perception and Episodic Memory in Degraded 360° Firefighting Video

Aditi Tiwari ⋅ Farzaneh Masoud ⋅ Dac Nguyen ⋅ Jill Kraft ⋅ Heng Ji ⋅ Klara Nahrstedt

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Modern AI systems struggle most in environments where reliability is critical - scenes with smoke, poor visibility, and structural deformation. Each year, tens of thousands of firefighters are injured on duty, often due to breakdowns in situational perception. We introduce Fire360, a benchmark for evaluating perception and reasoning in safety-critical firefighting scenarios. The dataset includes 228 360° videos from professional training sessions under diverse conditions (e.g., low light, thermal distortion), annotated with action segments, object locations, and degradation metadata. Fire360 supports five tasks: Visual Question Answering, Temporal Action Captioning, Object Localization, Safety-Critical Reasoning, and Transformed Object Retrieval (TOR). TOR tests whether models can match pristine exemplars to fire-damaged counterparts in unpaired scenes, evaluating episodic memory under irreversible visual transformations. While human experts achieve 83.5% on TOR, models like GPT-4o lag significantly, exposing failures in reasoning under degradation. By releasing Fire360 and its evaluation suite, we aim to advance models that not only see, but also remember, reason, and act under uncertainty. The dataset is available at https://uofi.box.com/v/fire360dataset

View full details

Poster

Axial Neural Networks for Dimension-Free Foundation Models

Hyunsu Kim ⋅ Jonggeon Park ⋅ Joan Bruna ⋅ Hongseok Yang ⋅ Juho Lee

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

The advent of foundation models in AI has significantly advanced general-purpose learning, enabling remarkable capabilities in zero-shot inference and in-context learning. However, training such models on physics data, including solutions to partial differential equations (PDEs), poses a unique challenge due to varying dimensionalities across different systems. Traditional approaches either fix a maximum dimension or employ separate encoders for different dimensionalities, resulting in inefficiencies. To address this, we propose a dimension-agnostic neural network architecture, the Axial Neural Network (XNN), inspired by parameter-sharing structures such as Deep Sets and Graph Neural Networks. XNN generalizes across varying tensor dimensions while maintaining computational efficiency. We convert existing PDE foundation models into axial neural networks and evaluate their performance across three training scenarios: training from scratch, pretraining on multiple PDEs, and fine-tuning on a single PDE. Our experiments show that XNNs perform competitively with original models and exhibit superior generalization to unseen dimensions, highlighting the importance of multidimensional pretraining for foundation models.

View full details

Poster

Debate or Vote: Which Yields Better Decisions in Multi-Agent Large Language Models?

Hyeong Kyu Choi ⋅ Jerry Zhu ⋅ Sharon Li

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Multi-Agent Debate (MAD) has emerged as a promising paradigm for improving the performance of large language models through collaborative reasoning. Despite recent advances, the key factors driving MAD’s effectiveness remain unclear. In this work, we disentangle MAD into two key components–Majority Voting and inter-agent Debate–and assess their respective contributions. Through extensive experiments across seven NLP benchmarks, we find that Majority Voting alone accounts for most of the performance gains typically attributed to MAD. To explain this, we propose a theoretical framework that models debate as a stochastic process. We prove that it induces a martingale over agents’ belief trajectories, implying that debate alone does not improve expected correctness. Guided by these insights, we demonstrate that targeted interventions, by biasing the belief update toward correction, can meaningfully enhance debate effectiveness. Overall, our findings suggest that while MAD has potential, simple ensembling methods remain strong and more reliable alternatives in many practical settings. Code is released in https://github.com/deeplearning-wisc/debate-or-vote.

View full details

Poster

SIU3R: Simultaneous Scene Understanding and 3D Reconstruction Beyond Feature Alignment

QiXu ⋅ Dongxu Wei ⋅ Lingzhe Zhao ⋅ Wenpu Li ⋅ Zhangchi Huang ⋅ Shunping Ji ⋅ Peidong Liu

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Simultaneous understanding and 3D reconstruction plays an important role in developing end-to-end embodied intelligent systems. To achieve this, recent approaches resort to 2D-to-3D feature alignment paradigm, which leads to limited 3D understanding capability and potential semantic information loss. In light of this, we propose SIU3R, the first alignment-free framework for generalizable simultaneous understanding and 3D reconstruction from unposed images. Specifically, SIU3R bridges reconstruction and understanding tasks via pixel-aligned 3D representation, and unifies multiple understanding tasks into a set of unified learnable queries, enabling native 3D understanding without the need of alignment with 2D models. To encourage collaboration between the two tasks with shared representation, we further conduct in-depth analyses of their mutual benefits, and propose two lightweight modules to facilitate their interaction. Extensive experiments demonstrate that our method achieves state-of-the-art performance not only on the individual tasks of 3D reconstruction and understanding, but also on the task of simultaneous understanding and 3D reconstruction, highlighting the advantages of our alignment-free framework and the effectiveness of the mutual benefit designs.

View full details

Poster

Transstratal Adversarial Attack: Compromising Multi-Layered Defenses in Text-to-Image Models

Chunlong Xie ⋅ Kangjie Chen ⋅ Shangwei Guo ⋅ Shudong Zhang ⋅ Tianwei Zhang ⋅ Tao Xiang

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Modern Text-to-Image (T2I) models deploy multi-layered defenses to block Not-Safe-For-Work (NSFW) content generation. These defenses typically include sequential layers such as prompt filters, concept erasers and image filters. While existing adversarial attacks have demonstrated vulnerabilities in isolated defense layers, they prove largely ineffective against multi-layered defenses deployed in real-world T2I systems. In this paper, we demonstrate that exploiting overlapping vulnerabilities across these distinct defense layers enables adversaries to systematically bypass the entire safeguard of T2I systems. We propose Transstratal Adversarial Attack (TAA), a novel black-box framework to compromise T2I models with multi-layered protection. It generates transstratal adversarial prompts to evade all defense layers simultaneously. This is accomplished through transstratal adversarial candidate generation using LLMs to fulfill implicit and subjective adversarial requirements against different defense layers, combined with adversarial genetic optimization for efficient black-box search to maximize the bypass rates and generated image harmfulness. Evaluated across 14 T2I models (e.g., Stable Diffusion, DALL·E, and Midjourney) and 17 safety modules, our attack achieves an average attack success rate of 85.6\%, surpassing state-of-the-art methods by 73.5\%. Our findings challenge the isolated design of safety mechanisms and establish the first benchmark for holistic robustness evaluation in multi-layered safeguarded T2I models. The code can be found in https://github.com/Bluedask/TAA-T2I.

View full details

Poster

Flow Equivariant Recurrent Neural Networks

Andy Keller

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Data arrives at our senses as a continuous stream, smoothly transforming from one instant to the next. These smooth transformations can be viewed as continuous symmetries of the environment that we inhabit, defining equivalence relations between stimuli over time. In machine learning, neural network architectures that respect symmetries of their data are called equivariant and have provable benefits in terms of generalization ability and sample efficiency. To date, however, equivariance has been considered only for static transformations and feed-forward networks, limiting its applicability to sequence models, such as recurrent neural networks (RNNs), and corresponding time-parameterized sequence transformations. In this work, we extend equivariant network theory to this regime of 'flows' -- one-parameter Lie subgroups capturing natural transformations over time, such as visual motion. We begin by showing that standard RNNs are generally not flow equivariant: their hidden states fail to transform in a geometrically structured manner for moving stimuli. We then show how flow equivariance can be introduced, and demonstrate that these models significantly outperform their non-equivariant counterparts in terms of training speed, length generalization, and velocity generalization, on both next step prediction and sequence classification. We present this work as a first step towards building sequence models that respect the time-parameterized symmetries which govern the world around us.

View full details

Poster

Imitation Beyond Expectation Using Pluralistic Stochastic Dominance

Ali Farajzadeh ⋅ Danyal Saeed ⋅ Syed M Abbas ⋅ Rushit Shah ⋅ Aadirupa Saha ⋅ Brian Ziebart

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Imitation learning seeks policies reflecting the values of demonstrated behaviors. Prevalent approaches learn to match or exceed the demonstrator's performance in expectation without knowing the demonstrator’s reward function. Unfortunately, this does not induce pluralistic imitators that learn to support qualitatively distinct demonstrations. We reformulate imitation learning using stochastic dominance over the demonstrations' reward distribution across a range of reward functions as our foundational aim. Our approach matches imitator policy samples (or support) with demonstrations using optimal transport theory to define an imitation learning objective over trajectory pairs. We demonstrate the benefits of pluralistic stochastic dominance (PSD) for imitation in both theory and practice.

View full details

Poster

Hamiltonian Descent Algorithms for Optimization: Accelerated Rates via Randomized Integration Time

Qiang Fu ⋅ Andre Wibisono

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We study the Hamiltonian flow for optimization (HF-opt), which simulates the Hamiltonian dynamics for some integration time and resets the velocity to $0$ to decrease the objective function; this is the optimization analogue of the Hamiltonian Monte Carlo algorithm for sampling. For short integration time, HF-opt has the same convergence rates as gradient descent for minimizing strongly and weakly convex functions. We show that by randomizing the integration time in HF-opt, the resulting randomized Hamiltonian flow (RHF) achieves accelerated convergence rates in continuous time, similar to the rates for accelerated gradient flow. We study a discrete-time implementation of RHF as the randomized Hamiltonian gradient descent (RHGD) algorithm. We prove that RHGD achieves the same accelerated convergence rates as Nesterov's accelerated gradient descent (AGD) for minimizing smooth strongly and weakly convex functions. We provide numerical experiments to demonstrate that RHGD is competitive with classical accelerated methods such as AGD across all settings and outperforms them in certain regimes.

View full details

Poster

Scaling Laws For Scalable Oversight

Joshua Engels ⋅ David Baek ⋅ Subhash Kantamneni ⋅ Max Tegmark

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Scalable oversight, the process by which weaker AI systems supervise stronger ones, has been proposed as a key strategy to control future superintelligent systems. However, it is still unclear how scalable oversight itself scales. To address this gap, we propose a framework that quantifies the probability of successful oversight as a function of the capabilities of the overseer and the system being overseen. Specifically, our framework models oversight as a game between capability-mismatched players; the players have oversight-specific Elo scores that are a piecewise-linear function of their general intelligence, with two plateaus corresponding to task incompetence and task saturation. We validate our framework with a modified version of the game Nim and then apply it to four oversight games: Mafia, Debate, Backdoor Code and Wargames. For each game, we find scaling laws that approximate how domain performance depends on general AI system capability. We then build on our findings in a theoretical study of Nested Scalable Oversight (NSO), a process in which trusted models oversee untrusted stronger models, which then become the trusted models in the next step. We identify conditions under which NSO succeeds and derive numerically (and in some cases analytically) the optimal number of oversight levels to maximize the probability of oversight success. We also apply our theory to our four oversight games, where we find that NSO success rates at a general Elo gap of 400 are 13.5\% for Mafia, 51.7\% for Debate, 10.0\% for Backdoor Code, and 9.4\% for Wargames; these rates decline further when overseeing stronger systems.

View full details

Poster

Stochastic Optimization in Semi-Discrete Optimal Transport: Convergence Analysis and Minimax Rate

Ferdinand Genans ⋅ Antoine Godichon-Baggioni ⋅ François-Xavier Vialard ⋅ Olivier Wintenberger

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We investigate the semi-discrete Optimal Transport (OT) problem, where a continuous source measure $\mu$ is transported to a discrete target measure $\nu$, with particular attention to the OT map approximation. In this setting, Stochastic Gradient Descent (SGD) based solvers have demonstrated strong empirical performance in recent machine learning applications, yet their theoretical guarantee to approximate the OT map is an open question. In this work, we answer it positively by providing both computational and statistical convergence guarantees of SGD. Specifically, we show that SGD methods can estimate the OT map with a minimax convergence rate of $\mathcal{O}(1/\sqrt{n})$, where $n$ is the number of samples drawn from $\mu$. To establish this result, we study the averaged projected SGD algorithm, and identify a suitable projection set that contains a minimizer of the objective, even when the source measure is not compactly supported. Our analysis holds under mild assumptions on the source measure and applies to MTW cost functions,whic include $\|\cdot\|^p$ for $p \in (1, \infty)$. We finally provide numerical evidence for our theoretical results.

View full details

Poster

FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

Shuang Zeng ⋅ Xinyuan Chang ⋅ Mengwei Xie ⋅ Xinran Liu ⋅ Yifan Bai ⋅ Zheng Pan ⋅ Mu Xu ⋅ Xing Wei

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Vision–Language–Action (VLA) models are increasingly used for end-to-end driving due to their world knowledge and reasoning ability. Most prior work, however, inserts textual chains-of-thought (CoT) as intermediate steps tailored to the current scene. Such symbolic compressions can blur spatio-temporal relations and discard fine visual cues, creating a cross-modal gap between perception and planning. We propose FSDrive, a visual spatio-temporal CoT framework that enables VLAs to think in images. The model first acts as a world model to generate a unified future frame that overlays coarse but physically-plausible priors—future lane dividers and 3D boxes—on the predicted future image. This unified frame serves as the visual CoT, capturing both spatial structure and temporal evolution. The same VLA then functions as an inverse-dynamics model, planning trajectories from current observations and the visual CoT. To equip VLAs with image generation while preserving understanding, we introduce a unified pre-training paradigm that expands the vocabulary to include visual tokens and jointly optimizes VQA (for semantics) and future-frame prediction (for dynamics). A progressive easy-to-hard scheme first predicts lane/box priors to enforce physical constraints, then completes full future frames for fine details. On nuScenes and NAVSIM, FSDrive improves trajectory accuracy and reduces collisions under both ST-P3 and UniAD metrics, and attains competitive FID for future-frame generation despite using lightweight autoregression. It also advances scene understanding on DriveLM. Together, these results indicate that visual CoT narrows the cross-modal gap and yields safer, more anticipatory planning. Code is available at https://github.com/MIV-XJTU/FSDrive.

View full details

Poster

Disentangled Concepts Speak Louder Than Words: Explainable Video Action Recognition

Jongseo Lee ⋅ Wooil Lee ⋅ Gyeong-Moon Park ⋅ Seong Tae Kim ⋅ Jinwoo Choi

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Effective explanations of video action recognition models should disentangle how movements unfold over time from the surrounding spatial context. However, existing methods—based on saliency—produce entangled explanations, making it unclear whether predictions rely on motion or spatial context. Language-based approaches offer structure but often fail to explain motions due to their tacit nature—intuitively understood but difficult to verbalize. To address these challenges, we propose Disentangled Action aNd Context concept-based Explainable (DANCE) video action recognition, a framework that predicts actions through disentangled concept types: motion dynamics, objects, and scenes. We define motion dynamics concepts as human pose sequences. We employ a large language model to automatically extract object and scene concepts. Built on an ante-hoc concept bottleneck design, DANCE enforces prediction through these concepts. Experiments on four datasets—KTH, Penn Action, HAA500, and UCF101—demonstrate that DANCE significantly improves explanation clarity with competitive performance. Through a user study, we validate the superior interpretability of DANCE. Experimental results also show that DANCE is beneficial for model debugging, editing, and failure analysis.

View full details

Poster

Fast-Slow Thinking GRPO for Large Vision-Language Model Reasoning

Wenyi Xiao ⋅ Leilei Gan

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

When applying reinforcement learning—typically through GRPO—to large vision-language model reasoning struggles to effectively scale reasoning length or generates verbose outputs across all tasks with only marginal gains in accuracy. To address this issue, we present FAST-GRPO, a variant of GRPO that dynamically adapts reasoning depth based on question characteristics. Through empirical analysis, we establish the feasibility of fast-slow thinking in LVLMs by investigating how response length and data distribution affect performance. Inspired by these observations, we introduce two complementary metrics to estimate the difficulty of the questions, guiding the model to determine when fast or slow thinking is more appropriate. Next, we incorporate adaptive length-based rewards and difficulty-aware KL divergence into the GRPO algorithm. Experiments across seven reasoning benchmarks demonstrate that FAST achieves state-of-the-art accuracy with over 10% relative improvement compared to the base model, while reducing token usage by 32.7-67.3% compared to previous slow-thinking approaches, effectively balancing reasoning length and accuracy.

View full details

Poster

Rectified Point Flow: Generic Point Cloud Pose Estimation

Tao Sun ⋅ Liyuan Zhu ⋅ Shengyu Huang ⋅ Shuran Song ⋅ Iro Armeni

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We present Rectified Point Flow, a unified parameterization that formulates pairwise point cloud registration and multi-part shape assembly as a single conditional generative problem. Given unposed point clouds, our method learns a continuous point-wise velocity field that transports noisy points toward their target positions, from which part poses are recovered. In contrast to prior work that regresses part-wise poses with ad-hoc symmetry handling, our method intrinsically learns assembly symmetries without symmetry labels. Together with an overlap-aware encoder focused on inter-part contacts, Rectified Point Flow achieves a new state-of-the-art performance on six benchmarks spanning pairwise registration and shape assembly. Notably, our unified formulation enables effective joint training on diverse datasets, facilitating the learning of shared geometric priors and consequently boosting accuracy. Our code and models are available at https://rectified-pointflow.github.io/.

View full details

Poster

Sample-Adaptivity Tradeoff in On-Demand Sampling

Nika Haghtalab ⋅ Omar Montasser ⋅ Mingda Qiao

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We study the tradeoff between sample complexity and round complexity in *on-demand sampling*, where the learning algorithm adaptively samples from $k$ distributions over a limited number of rounds. In the realizable setting of Multi-Distribution Learning (MDL), we show that the optimal sample complexity of an $r$-round algorithm scales approximately as $dk^{\Theta(1/r)} / \epsilon$. For the general agnostic case, we present an algorithm that achieves near-optimal sample complexity of $\widetilde O((d + k) / \epsilon^2)$ within $\widetilde O(\sqrt{k})$ rounds. Of independent interest, we introduce a new framework, Optimization via On-Demand Sampling (OODS), which abstracts the sample-adaptivity tradeoff and captures most existing MDL algorithms. We establish nearly tight bounds on the round complexity in the OODS setting. The upper bounds directly yield the $\widetilde O(\sqrt{k})$-round algorithm for agnostic MDL, while the lower bounds imply that achieving sub-polynomial round complexity would require fundamentally new techniques that bypass the inherent hardness of OODS.

View full details

Poster

Abstract Rendering: Certified Rendering Under 3D Semantic Uncertainty

Chenxi Ji ⋅ Yangge Li ⋅ Xiangru Zhong ⋅ Huan Zhang ⋅ Sayan Mitra

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Rendering produces 2D images from 3D scene representations, yet how continuous variations in camera pose and scenes influence these images—and, consequently, downstream visual models—remains underexplored. We introduce **abstract rendering**, a framework that computes provable bounds on all images rendered under continuously varying camera poses and scenes. The resulting abstract image, expressed as a set of constraints over the image matrix, enables rigorous uncertainty propagation through downstream neural networks and thereby supports certification of model behavior under realistic 3D semantic perturbations, far beyond traditional pixel-level noise models. Our approach propagates camera pose uncertainty through each rendering step using efficient piecewise linear bounds, including custom abstractions for three rendering-specific operations—matrix inversion, sorting-based aggregation, and cumulative product summation—not supported by standard tools. Our implementation, ABSTRACTRENDER, targets two state-of-the-art photorealistic scene representations—3D Gaussian Splats and Neural Radiance Fields (NeRF)—and scales to complex scenes with up to 1M Gaussians. Our computed abstract images achieve up to 3% over-approximation error compared to sampling results (baseline). Through experiments on classification (ResNet), object detection (YOLO), and pose estimation (GATENet) tasks, we demonstrate that abstract rendering enables formal certification of downstream models under realistic 3D variations—an essential step toward safety-critical vision systems.

View full details

Poster

Blackbox Model Provenance via Palimpsestic Membership Inference

Rohith Kuditipudi ⋅ Jing Huang ⋅ Sally Zhu ⋅ Diyi Yang ⋅ Chris Potts ⋅ Percy Liang

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Suppose Alice trains an open-weight language model and Bob uses a blackbox derivative of Alice’s model to produce text. Can Alice prove that Bob is using her model, either by querying Bob’s derivative model (query setting) or from the text alone ( observational setting)? We formulate this question as an independence testing problem—in which the null hypothesis is that Bob’s model or text is independent of Alice’s randomized training run—and investigate it through the lens of palimpsestic memorization in language models: models are more likely to memorize data seen later in training, so we can test whether Bob is using Alice’s model using test statistics that capture correlation between Bob’s model or text and the ordering of training examples in Alice’s training run. If Alice has randomly shuffled her training data, then any significant correlation amounts to exactly quantifiable statistical evidence against the null hypothesis, regardless of the composition of Alice’s training data. In the query setting, we directly estimate (via prompting) the likelihood Bob’s model gives to Alice’s training examples and their training order; we correlate the likelihoods of over 40 fine-tunes of various Pythia and OLMo base models ranging from 1B to 12B parameters with the base model’s training data order, achieving a p-value on the order of at most $1 \times 10^{-8}$ in all but six cases. In the observational setting, we try two approaches based on estimating 1) the likelihood of Bob’s text overlapping with spans of Alice’s training examples and 2) the likelihood of Bob’s text with respect to different versions of Alice’s model we obtain by repeating the last phase (e.g., 1%) of her training run on reshuffled data. The second approach can reliably distinguish Bob’s text from as little as a few hundred tokens; the first does not involve any retraining but requires many more tokens (several hundred thousand) to achieve high power.

View full details

Poster

Checklists Are Better Than Reward Models For Aligning Language Models

Vijay Viswanathan ⋅ Yanchao Sun ⋅ Xiang Kong ⋅ Meng Cao ⋅ Graham Neubig ⋅ Sherry Wu

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Language models must be adapted to understand and follow user instructions. Reinforcement learning is widely used to facilitate this —typically using fixed criteria such as "helpfulness" and "harmfulness". In our work, we instead propose using flexible, instruction-specific criteria as a means of broadening the impact that reinforcement learning can have in eliciting instruction following. We propose "Reinforcement Learning from Checklist Feedback" (RLCF). From instructions, we extract checklists and evaluate how well responses satisfy each item—using both AI judges and specialized verifier programs—then combine these scores to compute rewards for RL. We compare RLCF with other alignment methods on top of a strong instruction following model (Qwen2.5-7B-Instruct) on five widely-studied benchmarks — RLCF is the only method to help on every benchmark, including a 4-point boost in hard satisfaction rate on FollowBench, a 6-point increase on InFoBench, and a 3-point rise in win rate on Arena-Hard. We show that RLCF can also be used off-policy to improve Llama 3.1 8B Instruct and OLMo 2 7B Instruct. These results establish rubrics as a key tool for improving language models' support of queries that express a multitude of needs. We release our our dataset of rubrics (WildChecklists), models, and code to the public.

View full details

Poster

MONITRS: Multimodal Observations of Natural Incidents Through Remote Sensing

Shreelekha Revankar ⋅ Utkarsh Mall ⋅ Cheng Perng Phoo ⋅ Kavita Bala ⋅ Bharath Hariharan

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Natural disasters cause devastating damage to communities and infrastructure every year. Effective disaster response is hampered by the difficulty of accessing affected areas during and after events. Remote sensing has allowed us to monitor natural disasters in a remote way. More recently there have been advances in computer vision and deep learning that help automate satellite imagery analysis, However, they remain limited by their narrow focus on specific disaster types, reliance on manual expert interpretation, and lack of datasets with sufficient temporal granularity or natural language annotations for tracking disaster progression. We present MONITRS, a novel multimodal dataset of $\sim$10,000 FEMA disaster events with temporal satellite imagery with natural language annotations from news articles, accompanied by geotagged locations, and question-answer pairs. We demonstrate that fine-tuning existing MLLMs on our dataset yields significant performance improvements for disaster monitoring tasks, establishing a new benchmark for machine learning-assisted disaster response systems.

View full details

Poster

Partition-Then-Adapt: Combating Prediction Bias for Reliable Multi-Modal Test-Time Adaptation

Guowei Wang ⋅ Fan Lyu ⋅ Changxing Ding

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Existing test-time adaptation (TTA) methods primarily focus on scenarios involving domain shifts in a single modality. However, they often prove ineffective when multiple modalities simultaneously undergo domain shifts, as they struggle to identify and utilize reliable samples within testing batches amid severe prediction bias. To address this problem, we propose Partition-Then-Adapt (PTA), a novel approach combating prediction bias for TTA with multi-modal domain shifts. PTA comprises two key components: Partition and Debiased Reweighting (PDR) and multi-modal Attention-Guided Alignment (AGA). Specifically, PDR evaluates each sample’s predicted label frequency relative to the batch average, partitioning the batch into potential reliable and unreliable subsets. It then reweights each sample by jointly assessing its bias and confidence levels through a quantile-based approach. By applying weighted entropy loss, PTA simultaneously promotes learning from reliable subsets and discourages reliance on unreliable ones. Moreover, AGA regularizes PDR to focus on semantically meaningful multi-modal cues. Extensive experiments validate the effectiveness of PTA, surpassing state-of-the-art method by 6.1\% on Kinetics50-MC and 5.8\% on VGGSound-MC, respectively. Code of this paper is available at https://github.com/MPI-Lab/PTA.

View full details

Poster

Nemotron-CLIMB: Clustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training

Shizhe Diao ⋅ Yu Yang ⋅ Yonggan Fu ⋅ Xin Dong ⋅ Dan SU ⋅ Markus Kliegl ⋅ ZIJIA CHEN ⋅ Peter Belcak ⋅ Yoshi Suhara ⋅ Hongxu Yin ⋅ Mostofa Patwary ⋅ Yingyan (Celine) Lin ⋅ Jan Kautz ⋅ Pavlo Molchanov

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Pre-training datasets are typically collected from web content and lack inherent domain divisions. For instance, widely used datasets like Common Crawl do not include explicit domain labels, while manually curating labeled datasets such as The Pile is labor-intensive. Consequently, identifying an optimal pre-training data mixture remains a challenging problem, despite its significant benefits for pre-training performance. To address these challenges, we propose CLustering-based Iterative Data Mixture Bootstrapping (Nemotron-CLIMB), an automated framework that discovers, evaluates, and refines data mixtures in a pre-training setting. Specifically, Nemotron-CLIMB embeds and clusters large-scale datasets in a semantic space and then iteratively searches for optimal mixtures using a smaller proxy model and a predictor. This strategy enables effective domain adaptation without relying solely on curated data. When continuously trained on 400B tokens with this mixture, our 1B model exceeds the state-of-the-art Llama-3.2-1B by 2.0%. Moreover, we observe that optimizing for a specific domain (e.g., Social Sciences) yields a 5% improvement over random sampling. Finally, we introduce Nemotron-ClimbLab, a filtered 1.2-trillion-token corpus with 20 clusters as a research playground, and Nemotron-ClimbMix, a compact yet powerful 400-billion-token dataset designed for efficient pre-training that delivers superior performance under an equal token budget. We analyze the final data mixture, elucidating the characteristics of an optimal data mixture.

View full details

Poster

Error Forcing in Recurrent Neural Networks

A Sağtekin ⋅ Colin Bredenberg ⋅ Cristina Savin

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

How should feedback influence recurrent neural network (RNN) learning? One way to address the known limitations of backpropagation through time is to directly adjust neural activities during the learning process. However, it remains unclear how to effectively use feedback to shape RNN dynamics. Here, we introduce error forcing (EF), where the network activity is guided orthogonally toward the zero-error manifold during learning. This method contrasts with alternatives like teaching forcing, which impose stronger constraints on neural activity and thus induce larger feedback influence on circuit dynamics. Furthermore, EF can be understood from a Bayesian perspective as a form of approximate dynamic inference. Empirically, EF consistently outperforms other learning algorithms across several tasks and its benefits persist when additional biological constraints are taken into account. Overall, EF is a powerful temporal credit assignment mechanism and a promising candidate model for learning in biological systems.

View full details

Poster

CoLT: The conditional localization test for assessing the accuracy of neural posterior estimates

Tianyu Chen ⋅ Vansh Bansal ⋅ James Scott

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We consider the problem of validating whether a neural posterior estimate $q(\theta \mid x)$ is an accurate approximation to the true, unknown true posterior $p(\theta \mid x)$. Existing methods for evaluating the quality of an NPE estimate are largely derived from classifier-based tests or divergence measures, but these suffer from several practical drawbacks. As an alternative, we introduce the *Conditional Localization Test* (**CoLT**), a principled method designed to detect discrepancies between $p(\theta \mid x)$ and $q(\theta \mid x)$ across the full range of conditioning inputs. Rather than relying on exhaustive comparisons or density estimation at every $x$, CoLT learns a localization function that adaptively selects points $\theta_l(x)$ where the neural posterior $q$ deviates most strongly from the true posterior $p$ for that $x$. This approach is particularly advantageous in typical simulation-based inference settings, where only a single draw $\theta \sim p(\theta \mid x)$ from the true posterior is observed for each conditioning input, but where the neural posterior $q(\theta \mid x)$ can be sampled an arbitrary number of times. Our theoretical results establish necessary and sufficient conditions for assessing distributional equality across all $x$, offering both rigorous guarantees and practical scalability. Empirically, we demonstrate that CoLT not only performs better than existing methods at comparing $p$ and $q$, but also pinpoints regions of significant divergence, providing actionable insights for model refinement. These properties position CoLT as a state-of-the-art solution for validating neural posterior estimates.

View full details

Poster

Frame Context Packing and Drift Prevention in Next-Frame-Prediction Video Diffusion Models

Lvmin Zhang ⋅ Shengqu Cai ⋅ Muyang Li ⋅ Gordon Wetzstein ⋅ Maneesh Agrawala

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We present a neural network structure, FramePack, to train next-frame (or next-frame-section) prediction models for video generation. FramePack compresses input frame contexts with frame-wise importance so that more frames can be encoded within a fixed context length, with more important frames having longer contexts. The frame importance can be measured using time proximity, feature similarity, or hybrid metrics. The packing method allows for inference with thousands of frames and training with relatively large batch sizes. We also present drift prevention methods to address observation bias (error accumulation), including early-established endpoints, adjusted sampling orders, and discrete history representation. Ablation studies validate the effectiveness of the anti-drifting methods in both single-directional video streaming and bi-directional video generation. Finally, we show that existing video diffusion models can be finetuned with FramePack, and analyze the differences between different packing schedules.

View full details

Poster

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang ⋅ Zhengqi Li ⋅ Guande He ⋅ Mingyuan Zhou ⋅ Eli Shechtman

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We introduce Self Forcing, a novel training paradigm for autoregressive video diffusion models. It addresses the longstanding issue of exposure bias, where models trained on ground-truth context must generate sequences conditioned on their own imperfect outputs during inference. Unlike prior methods that denoise future frames based on ground-truth context frames, Self Forcing conditions each frame's generation on previously self-generated outputs by performing autoregressive rollout with key-value (KV) caching during training. This strategy enables supervision through a holistic loss at the video level that directly evaluates the quality of the entire generated sequence, rather than relying solely on traditional frame-wise objectives. To ensure training efficiency, we employ a few-step diffusion model along with a stochastic gradient truncation strategy, effectively balancing computational cost and performance. We further introduce a rolling KV cache mechanism that enables efficient autoregressive video extrapolation. Extensive experiments demonstrate that our approach achieves real-time streaming video generation with sub-second latency on a single GPU, while matching or even surpassing the generation quality of significantly slower and non-causal diffusion models.

View full details

Poster

Continuous Thought Machines

Luke Darlow ⋅ Ciaran Regan ⋅ Sebastian Risi ⋅ Jeffrey Seely ⋅ Llion Jones

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Biological brains demonstrate complex neural activity, where neural dynamics are critical to how brains process information. Most artificial neural networks ignore the complexity of individual neurons. We challenge that paradigm. By incorporating neuron-level processing and synchronization, we reintroduce neural timing as a foundational element. We present the Continuous Thought Machine (CTM), a model designed to leverage neural dynamics as its core representation. The CTM has two innovations: (1) neuron-level temporal processing}, where each neuron uses unique weight parameters to process incoming histories; and (2) neural synchronization as a latent representation. The CTM aims to strike a balance between neuron abstractions and biological realism. It operates at a level of abstraction that effectively captures essential temporal dynamics while remaining computationally tractable. We demonstrate the CTM's performance and versatility across a range of tasks, including solving 2D mazes, ImageNet-1K classification, parity computation, and more. Beyond displaying rich internal representations and offering a natural avenue for interpretation owing to its internal process, the CTM is able to perform tasks that require complex sequential reasoning. The CTM can also leverage adaptive compute, where it can stop earlier for simpler tasks, or keep computing when faced with more challenging instances. The goal of this work is to share the CTM and its associated innovations, rather than pushing for new state-of-the-art results. To that end, we believe the CTM represents a significant step toward developing more biologically plausible and powerful artificial intelligence systems. We provide an accompanying [interactive online demonstration](https://pub.sakana.ai/ctm/) and an [extended technical report](https://pub.sakana.ai/ctm/paper).

View full details

Poster

L2DGCN: Learnable Enhancement and Label Selection Dynamic Graph Convolutional Networks for Mitigating Degree Bias

jingxiao zhang ⋅ Shifei Ding ⋅ Jian Zhang ⋅ Lili Guo ⋅ Xuan Li

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Graph Neural Networks (GNNs) are powerful models for node classification, but their performance is heavily reliant on manually labeled data, which is often costly and results in insufficient labeling. Recent studies have shown that message-passing neural networks struggle to propagate information in low-degree nodes, negatively affecting overall performance. To address the information bias caused by degree imbalance, we propose a Learnable Enhancement and Label Selection Dynamic Graph Convolutional Network (L2DGCN). L2DGCN consists of a teacher model and a student model. The teacher model employs an improved label propagation mechanism that enables remote label information dissemination among all nodes. The student model introduces a dynamically learnable graph enhancement strategy, perturbing edges to facilitate information exchange among low-degree nodes. This approach maintains the global graph structure while learning graph representations. Additionally, we have designed a label selector to mitigate the impact of unreliable pseudo-labels on model learning. To validate the effectiveness of our proposed model with limited labeled data, we conducted comprehensive evaluations of semi-supervised node classification across various scenarios with a limited number of annotated nodes. Experimental results demonstrate that our data enhancement model significantly contributes to node classification tasks under sparse labeling conditions.

View full details

Poster

Any-stepsize Gradient Descent for Separable Data under Fenchel–Young Losses

Han Bao ⋅ Shinsaku Sakaue ⋅ Yuki Takezawa

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

The gradient descent (GD) has been one of the most common optimizer in machine learning. In particular, the loss landscape of a neural network is typically sharpened during the initial phase of training, making the training dynamics hover on the edge of stability. This is beyond our standard understanding of GD convergence in the stable regime where arbitrarily chosen stepsize is sufficiently smaller than the edge of stability. Recently, Wu et al. (COLT2024) have showed that GD converges with arbitrary stepsize under linearly separable logistic regression. Although their analysis hinges on the self-bounding property of the logistic loss, which seems to be a cornerstone to establish a modified descent lemma, our pilot study shows that other loss functions without the self-bounding property can make GD converge with arbitrary stepsize. To further understand what property of a loss function matters in GD, we aim to show arbitrary-stepsize GD convergence for a general loss function based on the framework of \emph{Fenchel--Young losses}. We essentially leverage the classical perceptron argument to derive the convergence rate for achieving $\epsilon$-optimal loss, which is possible for a majority of Fenchel--Young losses. Among typical loss functions, the Tsallis entropy achieves the GD convergence rate $T=\Omega(\epsilon^{-1/2})$, and the R{\'e}nyi entropy achieves the far better rate $T=\Omega(\epsilon^{-1/3})$. We argue that these better rate is possible because of \emph{separation margin} of loss functions, instead of the self-bounding property.

View full details

Poster

Orochi: Versatile Biomedical Image Processor

Gaole Dai ⋅ Chenghao Zhou ⋅ Yu Zhou ⋅ Rongyu Zhang ⋅ Yuan Zhang ⋅ Chengkai Hou ⋅ Tiejun Huang ⋅ Jianxu Chen ⋅ Shanghang Zhang

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Deep learning has emerged as a pivotal tool for accelerating research in the life sciences, with the low-level processing of biomedical images (e.g., registration, fusion, restoration, super-resolution) being one of its most critical applications. Platforms such as ImageJ (Fiji) and napari have enabled the development of customized plugins for various models. However, these plugins are typically based on models that are limited to specific tasks and datasets, making them less practical for biologists. To address this challenge, we introduce **Orochi**, the first application-oriented, efficient, and versatile image processor designed to overcome these limitations. Orochi is pre-trained on patches/volumes extracted from the raw data of over 100 publicly available studies using our Random Multi-scale Sampling strategy. We further propose Task-related Joint-embedding Pre-Training (TJP), which employs biomedical task-related degradation for self-supervision rather than relying on Masked Image Modelling (MIM), which performs poorly in downstream tasks such as registration. To ensure computational efficiency, we leverage Mamba's linear computational complexity and construct Multi-head Hierarchy Mamba. Additionally, we provide a three-tier fine-tuning framework (Full, Normal, and Light) and demonstrate that Orochi achieves comparable or superior performance to current state-of-the-art specialist models, even with lightweight parameter-efficient options. We hope that our study contributes to the development of an all-in-one workflow, thereby relieving biologists from the overwhelming task of selecting among numerous models. Our pre-trained weights and code will be released.

View full details

Poster

AI Research Agents for Machine Learning: Search, Exploration, and Generalization in MLE-bench

Edan Toledo ⋅ Karen Hambardzumyan ⋅ Martin Josifoski ⋅ Rishi Hazra ⋅ Nicolas Baldwin ⋅ Alexis Audran-Reiss ⋅ Michael Kuchnik ⋅ Despoina Magka ⋅ Minqi Jiang ⋅ Alisia Lupidi ⋅ Andrei Lupu ⋅ Roberta Raileanu ⋅ Tatiana Shavrina ⋅ Kelvin Niu ⋅ Jean-Christophe Gagnon-Audet ⋅ Michael Shvartsman ⋅ Shagun Sodhani ⋅ Alexander Miller ⋅ Abhishek Charnalia ⋅ Derek Dunfield ⋅ Carole-Jean Wu ⋅ Pontus Lars Erik Saito Stenetorp ⋅ Nicola Cancedda ⋅ Jakob Foerster ⋅ Yoram Bachrach

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

AI research agents are demonstrating great potential to accelerate scientific progress by automating the design, implementation, and training of machine learning models. We focus on methods for improving agents' performance on MLE-bench, a challenging benchmark where agents compete in Kaggle competitions to solve real-world machine learning problems. We formalize AI research agents as search policies that navigate a space of candidate solutions, iteratively modifying them using operators. By designing and systematically varying different operator sets and search policies (Greedy, MCTS, Evolutionary), we show that their interplay is critical for achieving high performance. Our best pairing of search strategy and operator set achieves a state-of-the-art result on MLE-bench lite, increasing the success rate of achieving a Kaggle medal from 39.6% to 47.7%. Our investigation underscores the importance of jointly considering the search strategy, operator design, and evaluation methodology in advancing automated machine learning.

View full details

Poster

Cost-Aware Contrastive Routing for LLMs

Reza Shirkavand ⋅ Shangqian Gao ⋅ Peiran Yu ⋅ Heng Huang

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We study cost-aware routing for large language models across diverse and dynamic pools of models. Existing approaches often overlook prompt-specific context, rely on expensive model profiling, assume a fixed set of experts, or use inefficient trial-and-error strategies. We introduce Cost-Spectrum Contrastive Routing (CSCR), a lightweight framework that maps both prompts and models into a shared embedding space to enable fast, cost-sensitive selection. CSCR uses compact, fast-to-compute logit footprints for open-source models and perplexity fingerprints for black-box APIs. A contrastive encoder is trained to favor the cheapest accurate expert within adaptive cost bands. At inference time, routing reduces to a single $k$‑NN lookup via a FAISS index, requiring no retraining when the expert pool changes and enabling microsecond latency. Across multiple benchmarks, CSCR consistently outperforms baselines, improving the accuracy–cost tradeoff by up to 25\%, while generalizing robustly to unseen LLMs and out-of-distribution prompts.

View full details

Poster

URDF-Anything: Constructing Articulated Objects with 3D Multimodal Language Model

Zhe Li ⋅ Xiang Bai ⋅ Jieyu Zhang ⋅ Zhuangzhe Wu ⋅ Che Xu ⋅ Ying Li ⋅ Chengkai Hou ⋅ Shanghang Zhang

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Constructing accurate digital twins of articulated objects is essential for robotic simulation training and embodied AI world model building, yet historically requires painstaking manual modeling or multi-stage pipelines. In this work, we propose \textbf{URDF-Anything}, an end-to-end automatic reconstruction framework based on a 3D multimodal large language model (MLLM). URDF-Anything utilizes an autoregressive prediction framework based on point-cloud and text multimodal input to jointly optimize geometric segmentation and kinematic parameter prediction. It implements a specialized [SEG] token mechanism that interacts directly with point cloud features, enabling fine-grained part-level segmentation while maintaining consistency with the kinematic parameter predictions. Experiments on both simulated and real-world datasets demonstrate that our method significantly outperforms existing approaches regarding geometric segmentation (mIoU 17\% improvement), kinematic parameter prediction (average error reduction of 29\%), and physical executability (surpassing baselines by 50\%). Notably, our method exhibits excellent generalization ability, performing well even on objects outside the training set. This work provides an efficient solution for constructing digital twins for robotic simulation, significantly enhancing the sim-to-real transfer capability.

View full details

Poster

Orient Anything V2: Unifying Orientation and Rotation Understanding

Zehan Wang ⋅ Ziang Zhang ⋅ Jiayang Xu ⋅ Jialei Wang ⋅ Tianyu Pang ⋅ Chao Du ⋅ Hengshuang Zhao ⋅ Zhou Zhao

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

This work presents Orient Anything V2, an enhanced foundation model for unified understanding of object 3D orientation and rotation from single or paired images. Building upon Orient Anything V1, which defines orientation via a single unique front face, V2 extends this capability to handle objects with diverse rotational symmetries and directly estimate relative rotations. These improvements are enabled by four key innovations: 1) Scalable 3D assets synthesized by generative models, ensuring broad category coverage and balanced data distribution; 2) An efficient, model-in-the-loop annotation system that robustly identifies 0 to N valid front faces for each object; 3) A symmetry-aware, periodic distribution fitting objective that captures all plausible front-facing orientations, effectively modeling object rotational symmetry; 4) A multi-frame architecture that directly predicts relative object rotations. Extensive experiments show that Orient Anything V2 achieves state-of-the-art zero-shot performance on orientation estimation, 6DoF pose estimation, and object symmetry recognition across 11 widely used benchmarks. The model demonstrates strong generalization, significantly broadening the applicability of orientation estimation in diverse downstream tasks.

View full details

Poster

KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation

Jiajun Shi ⋅ Jian Yang ⋅ Jiaheng Liu ⋅ Xingyuan Bu ⋅ Jiangjie Chen ⋅ Junting Zhou ⋅ Kaijing Ma ⋅ Zhoufutu Wen ⋅ Bingli Wang ⋅ Yancheng He ⋅ Liang Song ⋅ Hualei Zhu ⋅ Shilong Li ⋅ Xingjian Wang ⋅ Wei Zhang ⋅ Ruibin Yuan ⋅ Yifan Yao ⋅ Wenjun Yang ⋅ Yunli Wang ⋅ Siyuan Fang ⋅ Siyu Yuan ⋅ Qianyu He ⋅ Robert Tang ⋅ Yingshui Tan ⋅ Wangchunshu Zhou ⋅ ZHAO-XIANG ZHANG ⋅ Zhoujun Li ⋅ Wenhao Huang ⋅ Ge Zhang

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Recent advancements in large language models (LLMs) underscore the need for more comprehensive evaluation methods to accurately assess their reasoning capabilities. Existing benchmarks are often domain-specific and thus cannot fully capture an LLM’s general reasoning potential. To address this limitation, we introduce the **Knowledge Orthogonal Reasoning Gymnasium (KORGym)**, a dynamic evaluation platform inspired by KOR-Bench and Gymnasium. KORGym offers over fifty games in either textual or visual formats and supports interactive, multi-turn assessments with reinforcement learning scenarios. Using KORGym, we conduct extensive experiments on 19 LLMs and 8 VLMs, revealing consistent reasoning patterns within model families and demonstrating the superior performance of closed-source models. Further analysis examines the effects of modality, reasoning strategies, reinforcement learning techniques, and response length on model performance. We expect KORGym to become a valuable resource for advancing LLM reasoning research and developing evaluation methodologies suited to complex, interactive environments.

View full details

Poster

Deep Value Benchmark: Measuring Whether Models Generalize Deep Values or Shallow Preferences

Joshua Ashkinaze ⋅ Hua Shen ⋅ Saipranav Avula ⋅ Eric Gilbert ⋅ Ceren Budak

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We introduce the Deep Value Benchmark (DVB), an evaluation framework that directly tests whether large language models (LLMs) learn fundamental human values or merely surface-level preferences. This distinction is critical for AI alignment: Systems that capture deeper values are likely to generalize human intentions robustly, while those that capture only superficial patterns in preference data risk producing misaligned behavior. The DVB uses a novel experimental design with controlled confounding between deep values (e.g., moral principles) and shallow features (e.g., superficial attributes). In the training phase, we expose LLMs to human preference data with deliberately correlated deep and shallow features---for instance, where a user consistently prefers (non-maleficence, formal language) options over (justice, informal language) alternatives. The testing phase then breaks these correlations, presenting choices between (justice, formal language) and (non-maleficence, informal language) options. This design allows us to precisely measure a model's Deep Value Generalization Rate (DVGR)---the probability of generalizing based on the underlying value rather than the shallow feature. Across 9 different models, the average DVGR is just 0.30. All models generalize deep values less than chance. Larger models have a (slightly) lower DVGR than smaller models. We are releasing our dataset, which was subject to three separate human validation experiments. DVB provides an interpretable measure of a core feature of alignment.

View full details

Poster

LoRAShop: Training-Free Multi-Concept Image Generation and Editing with Rectified Flow Transformers

Yusuf Dalva ⋅ Hidir Yesiltepe ⋅ Pinar Yanardag

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We introduce LoRAShop, the first framework for multi-concept image generation and editing with LoRA models. LoRAShop builds on a key observation about the feature interaction patterns inside Flux-style diffusion transformers: concept-specific transformer features activate spatially coherent regions early in the denoising process. We harness this observation to derive a disentangled latent mask for each concept in a prior forward pass and blend the corresponding LoRA weights only within regions bounding the concepts to be personalized. The resulting edits seamlessly integrate multiple subjects or styles into the original scene while preserving global context, lighting, and fine details. Our experiments demonstrate that LoRAShop delivers better identity preservation compared to baselines. By eliminating retraining and external constraints, LoRAShop turns personalized diffusion models into a practical `photoshop-with-LoRAs' tool and opens new avenues for compositional visual storytelling and rapid creative iteration.

View full details

Poster

MGUP: A Momentum-Gradient Alignment Update Policy for Stochastic Optimization

Da Chang ⋅ Ganzhao Yuan

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Efficient optimization is essential for training large language models. Although intra-layer selective updates have been explored, a general mechanism that enables fine-grained control while ensuring convergence guarantees is still lacking. To bridge this gap, we propose \textbf{MGUP}, a novel mechanism for selective updates. \textbf{MGUP} augments standard momentum-based optimizers by applying larger step-sizes to a selected fixed proportion of parameters in each iteration, while applying smaller, non-zero step-sizes to the rest. As a nearly {plug-and-play} module, \textbf{MGUP} seamlessly integrates with optimizers such as AdamW, Lion, and Muon. This yields powerful variants such as \textbf{MGUP-AdamW}, \textbf{MGUP-Lion}, and \textbf{MGUP-Muon}. Under standard assumptions, we provide theoretical convergence guarantees for \textbf{MGUP-AdamW} (without weight decay) in stochastic optimization. Extensive experiments across diverse tasks, including MAE pretraining, LLM pretraining, and downstream fine-tuning, demonstrate that our \textbf{MGUP}-enhanced optimizers achieve superior or more stable performance compared to their original base optimizers. We offer a principled, versatile, and theoretically grounded strategy for efficient intra-layer selective updates, accelerating and stabilizing the training of large-scale models. The code is publicly available at https://github.com/MaeChd/MGUP.

View full details

Poster

CoT Information: Improved Sample Complexity under Chain-of-Thought Supervision

Awni Altabaa ⋅ Omar Montasser ⋅ John Lafferty

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Learning complex functions that involve multi-step reasoning poses a significant challenge for standard supervised learning from input-output examples. Chain-of-thought (CoT) supervision, which augments training data with intermediate reasoning steps to provide a richer learning signal, has driven recent advances in large language model reasoning. This paper develops a statistical theory of learning under CoT supervision. Central to the theory is the *CoT information*, which measures the additional discriminative power offered by the chain-of-thought for distinguishing hypotheses with different end-to-end behaviors. The main theoretical results demonstrate how CoT supervision can yield significantly faster learning rates compared to standard end-to-end supervision, with both upper bounds and information-theoretic lower bounds characterized by the CoT information.

View full details

Poster

Scalable Fingerprinting of Large Language Models

Anshul Nasery ⋅ Jonathan Hayase ⋅ Creston Brooks ⋅ Peiyao Sheng ⋅ Himanshu Tyagi ⋅ Pramod Viswanath ⋅ Sewoong Oh

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Model fingerprinting has emerged as a powerful tool for model owners to identify their shared model given API access. In order to lower false discovery rate, fight fingerprint leakage, and defend against coalitions of model users attempting to bypass detection, we argue that scaling up the number of fingerprints one can embed into a model, i.e. *Scalability* of fingerprints, is critical. Hence, we pose scalability as a crucial requirement for fingerprinting schemes. We experiment with fingerprint design at a scale significantly larger than previously considered, and introduce a new method, dubbed Perinucleus sampling, to generate scalable, persistent, and harmless fingerprints. We demonstrate that this scheme can add 24,576 fingerprints to a Llama-3.1-8B model---two orders of magnitude more than existing schemes---without degrading the model's utility. Our inserted fingerprints persist even after supervised fine-tuning on standard post-training data. We further address security risks for fingerprinting, and theoretically and empirically show how a scalable fingerprinting scheme like ours can mitigate these risks.

View full details

Poster

Escaping saddle points without Lipschitz smoothness: the power of nonlinear preconditioning

Alexander Bodard ⋅ Panagiotis Patrinos

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We study generalized smoothness in nonconvex optimization, focusing on $(L_0, L_1)$-smoothness and anisotropic smoothness. The former was empirically derived from practical neural network training examples, while the latter arises naturally in the analysis of nonlinearly preconditioned gradient methods. We introduce a new sufficient condition that encompasses both notions, reveals their close connection, and holds in key applications such as phase retrieval and matrix factorization. Leveraging tools from dynamical systems theory, we then show that nonlinear preconditioning—including gradient clipping—preserves the saddle point avoidance property of classical gradient descent. Crucially, the assumptions required for this analysis are actually satisfied in these applications, unlike in classical results that rely on restrictive Lipschitz smoothness conditions. We further analyze a perturbed variant that efficiently attains second-order stationarity with only logarithmic dependence on dimension, matching similar guarantees of classical gradient methods.

View full details

Poster

MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning

Jinkun Hao ⋅ Naifu Liang ⋅ Zhen Luo ⋅ Xudong XU ⋅ Weipeng Zhong ⋅ Ran Yi ⋅ Yichen Jin ⋅ Zhaoyang Lyu ⋅ Feng Zheng ⋅ Lizhuang Ma ⋅ Jiangmiao Pang

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

The ability of robots to interpret human instructions and execute manipulation tasks necessitates the availability of task-relevant tabletop scenes for training. However, traditional methods for creating these scenes rely on time-consuming manual layout design or purely randomized layouts, which are limited in terms of plausibility or alignment with the tasks. In this paper, we formulate a novel task, namely task-oriented tabletop scene generation, which poses significant challenges due to the substantial gap between high-level task instructions and the tabletop scenes. To support research on such a challenging task, we introduce \textbf{MesaTask-10K}, a large-scale dataset comprising approximately 10,700 synthetic tabletop scenes with \emph{manually crafted layouts} that ensure realistic layouts and intricate inter-object relations. To bridge the gap between tasks and scenes, we propose a \textbf{Spatial Reasoning Chain} that decomposes the generation process into object inference, spatial interrelation reasoning, and scene graph construction for the final 3D layout. We present \textbf{MesaTask}, an LLM-based framework that utilizes this reasoning chain and is further enhanced with DPO algorithms to generate physically plausible tabletop scenes that align well with given task descriptions. Exhaustive experiments demonstrate the superior performance of MesaTask compared to baselines in generating task-conforming tabletop scenes with realistic layouts.

View full details

Poster

AutoToM: Scaling Model-based Mental Inference via Automated Agent Modeling

Zhining Zhang ⋅ Chuanyang Jin ⋅ Mung Yao Jia ⋅ Shunchi Zhang ⋅ Tianmin Shu

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Theory of Mind (ToM), the ability to understand people's minds based on their behavior, is key to developing socially intelligent agents. Current approaches to ToM reasoning either rely on prompting Large Language Models (LLMs), which are prone to systematic errors, or use handcrafted, rigid agent models for model-based inference, which are more robust but fail to generalize across domains. In this work, we introduce *AutoToM*, an automated agent modeling method for scalable, robust, and interpretable mental inference. Given a ToM problem, *AutoToM* first proposes an initial agent model and then performs automated Bayesian inverse planning based on this model, leveraging an LLM backend. Guided by inference uncertainty, it iteratively refines the model by introducing additional mental variables and/or incorporating more timesteps in the context. Across five diverse benchmarks, *AutoToM* outperforms existing ToM methods and even large reasoning models. Additionally, we show that *AutoToM* can produce human‐like confidence estimates and enable online mental inference for embodied decision-making.

View full details

Poster

Fast and Fluent Diffusion Language Models via Convolutional Decoding and Rejective Fine-tuning

Yeongbin Seo ⋅ Dongha Lee ⋅ Jaehyung Kim ⋅ Jinyoung Yeo

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Autoregressive (AR) language models generate text one token at a time, which limits their inference speed. Diffusion-based language models offer a promising alternative, as they can decode multiple tokens in parallel. However, we identify a key bottleneck in current diffusion LMs: the \textbf{long decoding-window problem}, where tokens generated far from the input context often become irrelevant or repetitive. Previous solutions like semi-autoregressive address this issue by splitting windows into blocks (sacrificing bidirectionality), but we find that this also leads to \textbf{time-interval expansion problem}, sacrificing the speed. Therefore, semi-AR eliminates the main advantages of diffusion models. To overcome this, we propose Convolutional decoding (\textit{Conv}), a normalization-based method that narrows the decoding window without hard segmentation, leading to better fluency and flexibility. Additionally, we introduce Rejecting Rule-based Fine-Tuning (R2FT), a post-hoc training scheme that better aligns tokens at positions far from context. Our methods achieve state-of-the-art results on open-ended generation benchmarks (e.g., AlpacaEval) among diffusion LM baselines, with significantly lower step size than previous works, demonstrating both speed and quality improvements. The code is available online (\url{https://github.com/ybseo-ac/Conv}).

View full details

Poster

Deno-IF: Unsupervised Noisy Visible and Infrared Image Fusion Method

Han Xu ⋅ Yuyang Li ⋅ Yunfei Deng ⋅ Jiayi Ma ⋅ Guangcan Liu

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Most image fusion methods are designed for ideal scenarios and struggle to handle noise. Existing noise-aware fusion methods are supervised and heavily rely on constructed paired data, limiting performance and generalization. This paper proposes a novel unsupervised noisy visible and infrared image fusion method, comprising two key modules. First, when only noisy source images are available, a convolutional low-rank optimization module decomposes clean components based on convolutional low-rank priors, guiding subsequent optimization. The unsupervised approach eliminates data dependency and enhances generalization across various and variable noise. Second, a unified network jointly realizes denoising and fusion. It consists of both intra-modal recovery and inter-modal recovery and fusion, also with a convolutional low-rankness loss for regularization. By exploiting the commonalities of denoising and fusion, the joint framework significantly reduces network complexity while expanding functionality. Extensive experiments validate the effectiveness and generalization of the proposed method for image fusion under various and variable noise conditions. The code is publicly available at https://github.com/hanna-xu/Deno-IF.

View full details

Poster

Adaptive Defense against Harmful Fine-Tuning for Large Language Models via Bayesian Data Scheduler

Zixuan Hu ⋅ Li Shen ⋅ Zhenyi Wang ⋅ Yongxian Wei ⋅ Dacheng Tao

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Harmful fine-tuning poses critical safety risks to fine-tuning-as-a-service for large language models. Existing defense strategies preemptively build robustness via attack simulation but suffer from fundamental limitations: (i) the infeasibility of extending attack simulations beyond bounded threat models due to the inherent difficulty of anticipating unknown attacks, and (ii) limited adaptability to varying attack settings, as simulation fails to capture their variability and complexity. To address these challenges, we propose Bayesian Data Scheduler (BDS), an adaptive tuning-stage defense strategy with no need for attack simulation. BDS formulates harmful fine-tuning defense as a Bayesian inference problem, learning the posterior distribution of each data point's safety attribute, conditioned on the fine-tuning and alignment datasets. The fine-tuning process is then constrained by weighting data with their safety attributes sampled from the posterior, thus mitigating the influence of harmful data. By leveraging the post hoc nature of Bayesian inference, the posterior is conditioned on the fine-tuning dataset, enabling BDS to tailor its defense to the specific dataset, thereby achieving adaptive defense. Furthermore, we introduce a neural scheduler based on amortized Bayesian learning, enabling efficient transfer to new data without retraining. Comprehensive results across diverse attack and defense settings demonstrate the state-of-the-art performance of our approach. Code is available at https://github.com/Egg-Hu/Bayesian-Data-Scheduler.

View full details

Poster

The ML.ENERGY Benchmark: Toward Automated Inference Energy Measurement and Optimization

Jae-Won Chung ⋅ Jeff J. Ma ⋅ Ruofan Wu ⋅ Jiachen Liu ⋅ Oh Jun Kweon ⋅ Yuxuan Xia ⋅ Zhiyu Wu ⋅ Mosharaf Chowdhury

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

As the adoption of Generative AI in real-world services grow explosively, energy has emerged as a critical bottleneck resource. However, energy remains a metric that is often overlooked, under-explored, or poorly understood in the context of building ML systems. We present the ML.ENERGY Benchmark, a benchmark suite and tool for measuring inference energy consumption under realistic service environments, and the corresponding ML.ENERGY Leaderboard, which have served as a valuable resource for those hoping to understand and optimize the energy consumption of their generative AI services. In this paper, we explain four key design principles for benchmarking ML energy we have acquired over time, and then describe how they are implemented in the ML.ENERGY Benchmark. We then highlight results from the early 2025 iteration of the benchmark, including energy measurements of 40 widely used model architectures across 6 different tasks, case studies of how ML design choices impact energy consumption, and how automated optimization recommendations can lead to significant (sometimes more than 40%) energy savings without changing what is being computed by the model. The ML.ENERGY Benchmark is open-source and can be easily extended to various customized models and application scenarios.

View full details

Poster

Neptune-X: Active X-to-Maritime Generation for Universal Maritime Object Detection

Yu Guo ⋅ Shengfeng He ⋅ Yuxu Lu ⋅ Haonan An ⋅ Yihang Tao ⋅ Huilin Zhu ⋅ Jingxian Liu ⋅ Yuguang Fang

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Maritime object detection is essential for navigation safety, surveillance, and autonomous operations, yet constrained by two key challenges: the scarcity of annotated maritime data and poor generalization across various maritime attributes (e.g., object category, viewpoint, location, and imaging environment). To address these challenges, we propose Neptune-X, a data-centric generative-selection framework that enhances training effectiveness by leveraging synthetic data generation with task-aware sample selection. From the generation perspective, we develop X-to-Maritime, a multi-modality-conditioned generative model that synthesizes diverse and realistic maritime scenes. A key component is the Bidirectional Object-Water Attention module, which captures boundary interactions between objects and their aquatic surroundings to improve visual fidelity. To further improve downstream tasking performance, we propose Attribute-correlated Active Sampling, which dynamically selects synthetic samples based on their task relevance. To support robust benchmarking, we construct the Maritime Generation Dataset, the first dataset tailored for generative maritime learning, encompassing a wide range of semantic conditions. Extensive experiments demonstrate that our approach sets a new benchmark in maritime scene synthesis, significantly improving detection accuracy, particularly in challenging and previously underrepresented settings. The code is available at https://github.com/gy65896/Neptune-X.

View full details

Poster

EF-3DGS: Event-Aided Free-Trajectory 3D Gaussian Splatting

Bohao Liao ⋅ Wei Zhai ⋅ Zengyu Wan ⋅ Zhixin Cheng ⋅ Wenfei Yang ⋅ Yang Cao ⋅ Tianzhu Zhang ⋅ Zheng-Jun Zha

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Scene reconstruction from casually captured videos has wide real-world applications. Despite recent progress, existing methods relying on traditional cameras tend to fail in high-speed scenarios due to insufficient observations and inaccurate pose estimation. Event cameras, inspired by biological vision, record pixel-wise intensity changes asynchronously with high temporal resolution and low latency, providing valuable scene and motion information in blind inter-frame intervals. In this paper, we introduce the event cameras to aid scene construction from a casually captured video for the first time, and propose Event-Aided Free-Trajectory 3DGS, called EF-3DGS, which seamlessly integrates the advantages of event cameras into 3DGS through three key components. First, we leverage the Event Generation Model (EGM) to fuse events and frames, enabling continuous supervision between discrete frames. Second, we extract motion information through Contrast Maximization (CMax) of warped events, which calibrates camera poses and provides gradient-domain constraints for 3DGS. Third, to address the absence of color information in events, we combine photometric bundle adjustment (PBA) with a Fixed-GS training strategy that separates structure and color optimization, effectively ensuring color consistency across different views. We evaluate our method on the public Tanks and Temples benchmark and a newly collected real-world dataset, RealEv-DAVIS. Our method achieves up to 3dB higher PSNR and 40% lower Absolute Trajectory Error (ATE) compared to state-of-the-art methods under challenging high-speed scenarios.

View full details

Poster

Unbiased Prototype Consistency Learning for Multi-Modal and Multi-Task Object Re-Identification

Zhongao Zhou ⋅ Bin Yang ⋅ Wenke Huang ⋅ Jun Chen ⋅ Mang Ye

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

In object re-identification (ReID) task, both cross-modal and multi-modal retrieval methods have achieved notable progress. However, existing approaches are designed for specific modality and category (person or vehicle) retrieval task, lacking generalizability to others. Acquiring multiple task-specific models would result in wasteful allocation of both training and deployment resources. To address the practical requirements for unified retrieval, we introduce Multi-Modal and Multi-Task object ReID ($\rm {M^3T}$-ReID). The $\rm {M^3T}$-ReID task aims to utilize a unified model to simultaneously achieve retrieval tasks across different modalities and different categories. Specifically, to tackle the challenges of modality distibution divergence and category semantics discrepancy posed in $\rm {M^3T}$-ReID, we design a novel Unbiased Prototype Consistency Learning (UPCL) framework, which consists of two main modules: Unbiased Prototypes-guided Modality Enhancement (UPME) and Cluster Prototype Consistency Regularization (CPCR). UPME leverages modality-unbiased prototypes to simultaneously enhance cross-modal shared features and multi-modal fused features. Additionally, CPCR regulates discriminative semantics learning with category-consistent information through prototypes clustering. Under the collaborative operation of these two modules, our model can simultaneously learn robust cross-modal shared feature and multi-modal fused feature spaces, while also exhibiting strong category-discriminative capabilities. Extensive experiments on multi-modal datasets RGBNT201 and RGBNT100 demonstrates our UPCL framework showcasing exceptional performance for $\rm {M^3T}$-ReID. The code is available at https://github.com/ZhouZhongao/UPCL.

View full details

Poster

ARIA: Training Language Agents with Intention-driven Reward Aggregation

Ruihan Yang ⋅ yikai zhang ⋅ Aili Chen ⋅ Xintao Wang ⋅ Jiangjie Chen ⋅ Siyu Yuan ⋅ Deqing Yang ⋅ Yanghua Xiao

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Large language models (LLMs) have enabled agents to perform complex reasoning and decision-making through free-form language interactions. However, in open-ended language action environments (e.g., negotiation or question-asking games), the action space can be formulated as a joint distribution over tokens, resulting in an extremely large and combinatorial action space. Sampling actions in such a space can lead to extreme reward sparsity, which brings large reward variance, hindering effective reinforcement learning (RL). To address this, we propose **ARIA**, a method that **A**ggregates **R**ewards in **I**ntention space to enable efficient and effective language **A**gents training. ARIA aims to project natural language actions from the high-dimensional joint token distribution space into a low-dimensional intention space, where semantically similar actions are clustered and assigned shared rewards. This intention-aware reward aggregation reduces reward variance by densifying reward signals, fostering efficient and effective policy optimization. Extensive experiments demonstrate that ARIA not only significantly reduces gradient variance, but also delivers substantial performance gains of average 9.95% across four downstream tasks (e.g., negotiation and text-based games), consistently outperforming strong offline and online RL baselines.

View full details

Poster

Online Prediction with Limited Selectivity

Licheng Liu ⋅ Mingda Qiao

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Selective prediction [Dru13, QV19] models the scenario where a forecaster freely decides on the prediction window that their forecast spans. Many data statistics can be predicted to a non-trivial error rate *without any* distributional assumptions or expert advice, yet these results rely on that the forecaster may predict at any time. We introduce a model of Prediction with Limited Selectivity (PLS) where the forecaster can start the prediction only on a subset of the time horizon. We study the optimal prediction error both on an instance-by-instance basis and via an average-case analysis. We introduce a complexity measure that gives instance-dependent bounds on the optimal error. For a randomly-generated PLS instance, these bounds match with high probability.

View full details

Poster

Gaze Beyond the Frame: Forecasting Egocentric 3D Visual Span

Heeseung Yun ⋅ Joonil Na ⋅ Jaeyeon Kim ⋅ Calvin Murdock ⋅ Gunhee Kim

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

People continuously perceive and interact with their surroundings based on underlying intentions that drive their exploration and behaviors. While research in egocentric user and scene understanding has focused primarily on motion and contact-based interaction, forecasting human visual perception itself remains less explored despite its fundamental role in guiding human actions and its implications for AR/VR and assistive technologies. We address the challenge of egocentric 3D visual span forecasting, predicting where a person's visual perception will focus next within their three-dimensional environment. To this end, we propose EgoSpanLift, a novel method that transforms egocentric visual span forecasting from 2D image planes to 3D scenes. EgoSpanLift converts SLAM-derived keypoints into gaze-compatible geometry and extracts volumetric visual span regions. We further combine EgoSpanLift with 3D U-Net and unidirectional transformers, enabling spatio-temporal fusion to efficiently predict future visual span in the 3D grid. In addition, we curate a comprehensive benchmark from raw egocentric multisensory data, creating a testbed with 364.6K samples for 3D visual span forecasting. Our approach outperforms competitive baselines for egocentric gaze anticipation and 3D localization, while achieving comparable results even when projected back onto 2D image planes without additional 2D-specific training.

View full details

Poster

Two Heads are Better than One: Simulating Large Transformers with Small Ones

Hantao Yu ⋅ Josh Alman

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

The quadratic complexity of self‑attention prevents transformers from scaling effectively to long input sequences. On the other hand, modern GPUs and other specialized hardware accelerators are well-optimized for processing small input sequences in transformers during both training and inference. A natural question arises: can we take advantage of the efficiency of small transformers to deal with long input sequences? In this paper, we show that transformers with long input sequences (large transformers) can be efficiently simulated by transformers that can only take short input sequences (small transformers). Specifically, we prove that any transformer with input length $N$ can be efficiently simulated by only $O((N/M)^2)$ transformers with input length $M \ll N$, and that this cannot be improved in the worst case. However, we then prove that in various natural scenarios including average-case inputs, sliding window masking and attention sinks, the optimal number $O(N/M)$ of small transformers suffice.

View full details

Poster

FlexOLMo: Open Language Models for Flexible Data Use

Weijia Shi ⋅ Akshita Bhagia ⋅ Kevin Farhat ⋅ Niklas Muennighoff ⋅ Jacob Morrison ⋅ Evan Walsh ⋅ Dustin Schwenk ⋅ Shayne Longpre ⋅ Jake Poznanski ⋅ Allyson Ettinger ⋅ Daogao Liu ⋅ Margaret Li ⋅ Mike Lewis ⋅ Scott Yih ⋅ Dirk Groeneveld ⋅ Luca Soldaini ⋅ Kyle Lo ⋅ Noah Smith ⋅ Luke Zettlemoyer ⋅ Pang Wei Koh ⋅ Hanna Hajishirzi ⋅ Ali Farhadi ⋅ Sewon Min

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We introduce FlexOLMo, a new class of language models (LMs) that supports (1) distributed training without data sharing, where different model parameters are independently trained on private datasets, and (2) data-flexible inference, where these parameters along with their associated data can be easily included or excluded from model inferences with no further training. FlexOLMo employs a mixture-of-experts (MoE) architecture where each expert is trained independently on private datasets and later integrated through a new nonparametric routing without any joint training across datasets. FlexOLMo is trained on FLEXMIX, a corpus we curate comprising seven restricted sets, either real or realistic approximations, alongside publicly available datasets. We evaluate models with up to 37 billion parameters (20 billion active) on 31 diverse downstream tasks. We show that a general expert trained on public data can be effectively combined with independently trained experts from other data owners significantly benefiting from these restricted sets (an average 41% relative improvement) while allowing flexible opt-out at inference time (e.g., for users without appropriate licenses or permissions). Our approach also outperforms prior model merging methods by 10.1% on average and surpasses the standard MoE trained without data restrictions using the same training FLOPs. Altogether, FlexOLMo enables training on restricted data while keeping data local and supports fine-grained control of data access at inference.

View full details

Poster

T-REGS: Minimum Spanning Tree Regularization for Self-Supervised Learning

Julie Mordacq ⋅ David Loiseaux ⋅ Vicky Kalogeiton ⋅ Steve OUDOT

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Self-supervised learning (SSL) has emerged as a powerful paradigm for learning representations without labeled data, often by enforcing invariance to input transformations such as rotations or blurring. Recent studies have highlighted two pivotal properties for effective representations: (i) avoiding dimensional collapse-where the learned features occupy only a low-dimensional subspace, and (ii) enhancing uniformity of the induced distribution. In this work, we introduce T-REGS, a simple regularization framework for SSL based on the length of the Minimum Spanning Tree (MST) over the learned representation. We provide theoretical analysis demonstrating that T-REGS simultaneously mitigates dimensional collapse and promotes distribution uniformity on arbitrary compact Riemannian manifolds. Several experiments on synthetic data and on classical SSL benchmarks validate the effectiveness of our approach at enhancing representation quality.

View full details

Poster

On the Universal Near Optimality of Hedge in Combinatorial Settings

Zhiyuan Fan ⋅ Arnab Maiti ⋅ Lillian Ratliff ⋅ Kevin Jamieson ⋅ Gabriele Farina

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

In this paper, we study the classical Hedge algorithm in combinatorial settings. In each round, the learner selects a vector $\mathbf{x}_t$ from a set $\mathcal{X} \subseteq$ {$0,1$}$^d$, observes a full loss vector $\mathbf{y}_t \in \mathbb{R}^d$, and incurs a loss $\langle \mathbf{x}_t, \mathbf{y}_t \rangle \in [-1,1]$. This setting captures several important problems, including extensive-form games, resource allocation, $m$-sets, online multitask learning, and shortest-path problems on directed acyclic graphs (DAGs). It is well known that Hedge achieves a regret of $\mathcal{O}\big(\sqrt{T \log |\mathcal{X}|}\big)$ after $T$ rounds of interaction. In this paper, we ask whether Hedge is optimal across all combinatorial settings. To that end, we show that for any $\mathcal{X} \subseteq$ {$0,1$}$^d$, Hedge is near-optimal—specifically, up to a $\sqrt{\log d}$ factor—by establishing a lower bound of $\Omega\big(\sqrt{T \log(|\mathcal{X}|)/\log d}\big)$ that holds for any algorithm. We then identify a natural class of combinatorial sets—namely, $m$-sets with $\log d \leq m \leq \sqrt{d}$—for which this lower bound is tight, and for which Hedge is provably suboptimal by a factor of exactly $\sqrt{\log d}$. At the same time, we show that Hedge is optimal for online multitask learning, a generalization of the classical $K$-experts problem. Finally, we leverage the near-optimality of Hedge to establish the existence of a near-optimal regularizer for online shortest-path problems in DAGs—a setting that subsumes a broad range of combinatorial domains. Specifically, we show that the classical Online Mirror Descent (OMD) algorithm, when instantiated with the dilated entropy regularizer, is iterate-equivalent to Hedge, and therefore inherits its near-optimal regret guarantees for DAGs.

View full details

Poster

Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment

Bryan Sangwoo Kim ⋅ Jeongsol Kim ⋅ Jong Chul Ye

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Modern single-image super-resolution (SISR) models deliver photo-realistic results at the scale factors on which they are trained, but collapse when asked to magnify far beyond that regime. We address this scalability bottleneck with Chain-of-Zoom (CoZ), a model-agnostic framework that factorizes SISR into an autoregressive chain of intermediate scale-states with multi-scale-aware prompts. CoZ repeatedly re-uses a backbone SR model, decomposing the conditional probability into tractable sub-problems to achieve extreme resolutions without additional training. Because visual cues diminish at high magnifications, we augment each zoom step with multi-scale-aware text prompts generated by a vision-language model (VLM). The prompt extractor itself is fine-tuned using Generalized Reward Policy Optimization (GRPO) with a critic VLM, aligning text guidance towards human preference. Experiments show that a standard $4\times$ diffusion SR model wrapped in CoZ attains beyond $256\times$ enlargement with high perceptual quality and fidelity.

View full details

Poster

MedChain: Bridging the Gap Between LLM Agents and Clinical Practice with Interactive Sequence

Jie Liu ⋅ Wenxuan Wang ⋅ Zizhan Ma ⋅ Guolin Huang ⋅ Yihang SU ⋅ Kao-Jung Chang ⋅ Haoliang Li ⋅ Linlin Shen ⋅ Michael R Lyu ⋅ Wenting Chen

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Clinical decision making (CDM) is a complex, dynamic process crucial to healthcare delivery, yet it remains a significant challenge for artificial intelligence systems. While Large Language Model (LLM)-based agents have been tested on general medical knowledge using licensing exams and knowledge question-answering tasks, their performance in the CDM in real-world scenarios is limited due to the lack of comprehensive benchmark that mirror actual medical practice. To address this gap, we present MedChain, a dataset of 12,163 clinical cases that covers five key stages of clinical workflow. MedChain distinguishes itself from existing benchmarks with three key features of real-world clinical practice: personalization, interactivity, and sequentiality. Further, to tackle real-world CDM challenges, we also propose MedChain-Agent, an AI system that integrates a feedback mechanism and a MedCase-RAG module to learn from previous cases and adapt its responses. MedChain-Agent demonstrates remarkable adaptability in gathering information dynamically and handling sequential clinical tasks, significantly outperforming existing approaches. The relevant dataset and code will be released upon acceptance of this paper.

View full details

Poster

Nonlinear Laplacians: Tunable principal component analysis under directional prior information

Yuxin Ma ⋅ Dmitriy Kunisky

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We introduce a new family of algorithms for detecting and estimating a rank-one signal from a noisy observation under prior information about that signal's direction, focusing on examples where the signal is known to have entries biased to be positive. Given a matrix observation $\mathbf{Y}$, our algorithms construct a *nonlinear Laplacian*, another matrix of the form $\mathbf{Y} + \mathrm{diag}(\sigma(\mathbf{Y1}))$ for a nonlinear $\sigma: \mathbb{R} \to \mathbb{R}$, and examine the top eigenvalue and eigenvector of this matrix. When $\mathbf{Y}$ is the (suitably normalized) adjacency matrix of a graph, our approach gives a class of algorithms that search for unusually dense subgraphs by computing a spectrum of the graph "deformed" by the degree profile $\mathbf{Y1}$. We study the performance of such algorithms compared to direct spectral algorithms (the case $\sigma = 0$) on models of sparse principal component analysis with biased signals, including the Gaussian planted submatrix problem. For such models, we rigorously characterize the strength of rank-one signal, as a function of the nonlinearity $\sigma$, required for an outlier eigenvalue to appear in the spectrum of a nonlinear Laplacian matrix. While identifying the $\sigma$ that minimizes the required signal strength in closed form seems intractable, we explore three approaches to design $\sigma$ numerically: exhaustively searching over simple classes of $\sigma$, learning $\sigma$ from datasets of problem instances, and tuning $\sigma$ using black-box optimization of the critical signal strength. We find both theoretically and empirically that, if $\sigma$ is chosen appropriately, then nonlinear Laplacian spectral algorithms substantially outperform direct spectral algorithms, while retaining the conceptual simplicity of spectral methods compared to broader classes of computations like approximate message passing or general first order methods.

View full details

Poster

ReCon: Region-Controllable Data Augmentation with Rectification and Alignment for Object Detection

Haowei Zhu ⋅ Tianxiang Pan ⋅ Rui Qin ⋅ Jun-Hai Yong ⋅ Bin Wang

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

The scale and quality of datasets are crucial for training robust perception models. However, obtaining large-scale annotated data is both costly and time-consuming. Generative models have emerged as a powerful tool for data augmentation by synthesizing samples that adhere to desired distributions. However, current generative approaches often rely on complex post-processing or extensive fine-tuning on massive datasets to achieve satisfactory results, and they remain prone to content–position mismatches and semantic leakage. To overcome these limitations, we introduce ReCon, a novel augmentation framework that enhances the capacity of structure-controllable generative models for object detection. ReCon integrates region-guided rectification into the diffusion sampling process, using feedback from a pre-trained perception model to rectify misgenerated regions within diffusion sampling process. We further propose region-aligned cross-attention to enforce spatial–semantic alignment between image regions and their textual cues, thereby improving both semantic consistency and overall image fidelity. Extensive experiments demonstrate that ReCon substantially improve the quality and trainability of generated data, achieving consistent performance gains across various datasets, backbone architectures, and data scales.

View full details

Poster

Diversity-Aware Policy Optimization for Large Language Model Reasoning

Jian Yao ⋅ Ran Cheng ⋅ Xingyu Wu ⋅ Jibin Wu ⋅ KC Tan

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

The reasoning capabilities of large language models (LLMs) have advanced rapidly, particularly following the release of DeepSeek-R1, which has inspired a surge of research into data quality and reinforcement learning (RL) algorithms. Despite the pivotal role diversity plays in RL, its influence on LLM reasoning remains largely underexplored. To bridge this gap, this work presents a systematic investigation into the impact of diversity in RL-based training for LLM reasoning, and proposes a novel diversity-aware policy optimization method. Across evaluations on 12 LLMs, we observe a strong positive correlation between the solution diversity and potential@k (a novel metric quantifying an LLM’s reasoning potential) in high-performing models. This finding motivates our method to explicitly promote diversity during RL training. Specifically, we design a token-level diversity and reformulate it into a practical objective, then we selectively apply it to positive samples. Integrated into the R1-zero training framework, our method achieves a 3.5\% average improvement across four mathematical reasoning benchmarks, while generating more diverse and robust solutions.

View full details

Poster

Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for and with Foundation Models

Daoyuan Chen ⋅ Yilun Huang ⋅ Xuchen Pan ⋅ Jiang Nana ⋅ Haibin Wang ⋅ Yilei Zhang ⋅ Ce Ge ⋅ Yushuo Chen ⋅ Wenhao Zhang ⋅ Zhijian Ma ⋅ Jun Huang ⋅ Wei Lin ⋅ Yaliang Li ⋅ Bolin Ding ⋅ Jingren Zhou

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Foundation models demand advanced data processing for their vast, multimodal datasets.However, traditional frameworks struggle with the unique complexities of multimodal data.In response, we present Data-Juicer 2.0, a data processing system backed by 100+ data processing operators spanning text, image, video, and audio modalities, supporting more critical tasks including data analysis, synthesis, annotation, and foundation model post-training.With seamless compatibility and dedicated optimization for popular dataset hubs like Hugging Face and computing engines like Ray, it improves upon its predecessor in terms of usability, efficiency, and programmability.It features an easily accessible user interface layer that supports decoupled Python interactions, RESTful APIs, and conversational commands. Its new runtime layer offers adaptive execution across diverse scales and environments, abstracting away system complexities.Extensive empirical evaluations demonstrate Data-Juicer 2.0's remarkable performance and scalability, highlighting its capability to efficiently process TB-level data with 10k+ CPU cores. The system is publicly available and has been widely adopted in diverse research fields and real-world products such as Alibaba Cloud PAI. We actively maintain the system and share practical insights to foster research and applications of next-generation foundation models.

View full details

Poster

Low-degree evidence for computational transition of recovery rate in stochastic block model

Jingqiu Ding ⋅ Yiding Hua ⋅ Lucas Slot ⋅ David Steurer

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We investigate implications of the (extended) low-degree conjecture (recently formalized in [moitra et al2023]) in the context of the symmetric stochastic block model. Assuming the conjecture holds, we establish that no polynomial-time algorithm can weakly recover community labels below the Kesten-Stigum (KS) threshold. In particular, we rule out polynomial-time estimators that, with constant probability, achieve $n^{-0.49}$ correlation with the true communities. Whereas, above the KS threshold, polynomial-time algorithms are known to achieve constant correlation with the true communities with high probability [massoulie et al 2014,abbe et al 2015]. To our knowledge, we provide the first rigorous evidence for such sharp transition in recovery rate for polynomial-time algorithms at the KS threshold. Notably, under a stronger version of the low-degree conjecture, our lower bound remains valid even when the number of blocks diverges. Furthermore, our results provide evidence of a computational-to-statistical gap in learning the parameters of stochastic block models. In contrast, prior work either (i) rules out polynomial-time algorithms with $1 - o(1)$ success probability [Hopkins 18, bandeira et al 2021] under the low-degree conjecture, or (ii) degree-$\text{poly}(k)$ polynomials for learning the stochastic block model [Luo et al 2023]. For this, we design a hypothesis test which succeeeds with constant probability under symmetric stochastic block model, and $1-o(1)$ probability under the distribution of \Erdos \Renyi random graphs. Our proof combines low-degree lower bounds from [Hopkins 18, bandeira et al 2021] with graph splitting and cross-validation techniques. In order to rule out general recovery algorithms, we employ the correlation preserving projection method developed in [Hopkins et al 17].

View full details

Poster

Cycle-Sync: Robust Global Camera Pose Estimation through Enhanced Cycle-Consistent Synchronization

Shaohan Li ⋅ Yunpeng Shi ⋅ Gilad Lerman

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We introduce Cycle-Sync, a robust and global framework for estimating camera poses (both rotations and locations). Our core innovation is a location solver that adapts message-passing least squares (MPLS) - originally developed for group synchronization - to the camera localization setting. We modify MPLS to emphasize cycle-consistent information, redefine cycle consistencies using estimated distances from previous iterations, and incorporate a Welsch-type robust loss. We establish the strongest known deterministic exact-recovery guarantee for camera location estimation, demonstrating that cycle consistency alone enables the lowest sample complexity to date. To further boost robustness, we introduce a plug-and-play outlier rejection module inspired by robust subspace recovery, and we fully integrate cycle consistency into MPLS for rotation averaging. Our global approach avoids the need for bundle adjustment. Experiments on synthetic and real datasets show that Cycle-Sync consistently outperforms leading pose estimators, including full structure-from-motion pipelines with bundle adjustment.

View full details

Poster

Diffusion-Based Hierarchical Graph Neural Networks for Simulating Nonlinear Solid Mechanics

Tobias Würth ⋅ Niklas Freymuth ⋅ Gerhard Neumann ⋅ Luise Kärger

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Graph-based learned simulators have emerged as a promising approach for simulating physical systems on unstructured meshes, offering speed and generalization across diverse geometries. However, they often struggle with capturing global phenomena, such as bending or long-range correlations usually occurring in solid mechanics, and suffer from error accumulation over long rollouts due to their reliance on local message passing and direct next-step prediction. We address these limitations by introducing the Rolling Diffusion-Batched Inference Network (ROBIN), a novel learned simulator that integrates two key innovations: (i) Rolling Diffusion-Batched Inference (ROBI), a parallelized inference scheme that amortizes the cost of diffusion-based refinement across physical time steps by overlapping denoising steps across a temporal window. (ii) A Hierarchical Graph Neural Network built on algebraic multigrid coarsening, enabling multiscale message passing across different mesh resolutions. This architecture, implemented via Algebraic-hierarchical Message Passing Networks, captures both fine-scale local dynamics and global structural effects critical for phenomena like beam bending or multi-body contact. We validate ROBIN on challenging 2D and 3D solid mechanics benchmarks involving geometric, material, and contact nonlinearities. ROBIN achieves state-of-the-art accuracy on all tasks, substantially outperforming existing next-step learned simulators while reducing inference time by up to an order of magnitude compared to standard diffusion simulators.

View full details

Poster

Generalizable Insights for Graph Transformers in Theory and Practice

Timo Stoll ⋅ Luis Müller ⋅ Christopher Morris

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Graph Transformers (GTs) have shown strong empirical performance, yet current architectures vary widely in their use of attention mechanisms, positional embeddings (PEs), and expressivity. Existing expressivity results are often tied to specific design choices and lack comprehensive empirical validation on large-scale data. This leaves a gap between theory and practice, preventing generalizable insights that exceed particular application domains. Here, we propose the Generalized-Distance Transformer (GDT), a GT architecture using standard attention that incorporates many advancements for GTs from recent years, and develop a fine-grained understanding of the GDT's representation power in terms of attention and PEs. Through extensive experiments, we identify design choices that consistently perform well across various applications, tasks, and model scales, demonstrating strong performance in a few-shot transfer setting without fine-tuning. Our evaluation covers over eight million graphs with roughly 270M tokens across diverse domains, including image-based object detection, molecular property prediction, code summarization, and out-of-distribution algorithmic reasoning. We distill our theoretical and practical findings into several generalizable insights about effective GT design, training, and inference.

View full details

Poster

FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities

Jin Wang ⋅ Yao Lai ⋅ Aoxue Li ⋅ Shifeng Zhang ⋅ Jiacheng Sun ⋅ Ning Kang ⋅ Chengyue Wu ⋅ Zhenguo Li ⋅ Ping Luo

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

The rapid progress of large language models (LLMs) has catalyzed the emergence of multimodal large language models (MLLMs) that unify visual understanding and image generation within a single framework. However, most existing MLLMs rely on autoregressive (AR) architectures, which impose inherent limitations on future development, such as the raster-scan order in image generation and restricted reasoning abilities in causal context modeling. In this work, we challenge the dominance of AR-based approaches by introducing FUDOKI, a unified multimodal model purely based on discrete flow matching, as an alternative to conventional AR paradigms. By leveraging metric-induced probability paths with kinetic optimal velocities, our framework goes beyond the previous masking-based corruption process, enabling iterative refinement with self-correction capability and richer bidirectional context integration during generation. To mitigate the high cost of training from scratch, we initialize FUDOKI from pre-trained AR-based MLLMs and adaptively transition to the discrete flow matching paradigm. Experimental results show that FUDOKI achieves performance comparable to state-of-the-art AR-based MLLMs across both visual understanding and image generation tasks, highlighting its potential as a foundation for next-generation unified multimodal models. Furthermore, we show that applying test-time scaling techniques to FUDOKI yields significant performance gains, further underscoring its promise for future enhancement through reinforcement learning.

View full details

Poster

Zero-shot Denoising via Neural Compression: Theoretical and algorithmic framework

Ali Zafari ⋅ Xi Chen ⋅ Shirin Jalali

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Zero-shot denoising aims to denoise observations without access to training samples or clean reference images. This setting is particularly relevant in practical imaging scenarios involving specialized domains such as medical imaging or biology. In this work, we propose the *Zero-Shot Neural Compression Denoiser* (ZS-NCD), a novel denoising framework based on neural compression. ZS-NCD treats a neural compression network as an untrained model, optimized directly on patches extracted from a single noisy image. The final reconstruction is then obtained by aggregating the outputs of the trained model over overlapping patches. Thanks to the built-in entropy constraints of compression architectures, our method naturally avoids overfitting and does not require manual regularization or early stopping. Through extensive experiments, we show that ZS-NCD achieves state-of-the-art performance among zero-shot denoisers for both Gaussian and Poisson noise, and generalizes well to both natural and non-natural images. Additionally, we provide new finite-sample theoretical results that characterize upper bounds on the achievable reconstruction error of general maximum-likelihood compression-based denoisers. These results further establish the theoretical foundations of compression-based denoising. Our code is available at: https://github.com/Computational-Imaging-RU/ZS-NCDenoiser.

View full details

Poster

Multi-Agent Learning under Uncertainty: Recurrence vs. Concentration

Kyriakos Lotidis ⋅ Panayotis Mertikopoulos ⋅ Nicholas Bambos ⋅ Jose Blanchet

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

In this paper, we examine the convergence landscape of multi-agent learning under uncertainty. Specifically, we analyze two stochastic models of regularized learning in continuous games—one in continuous and one in discrete time—with the aim of characterizing the long run behavior of the induced sequence of play. In stark contrast to deterministic, full-information models of learning (or models with a vanishing learning rate), we show that the resulting dynamics do not converge in general. In lieu of this, we ask instead which actions are played more often in the long run, and by how much. We show that, in strongly monotone games, the dynamics of regularized learning may wander away from equilibrium infinitely often, but they always return to its vicinity in finite time (which we estimate), and their long-run distribution is sharply concentrated around a neighborhood thereof. We quantify the degree of this concentration, and we show that these favorable properties may all break down if the underlying game is not strongly monotone—underscoring in this way the limits of regularized learning in the presence of persistent randomness and uncertainty

View full details

Poster

EraseFlow: Learning Concept Erasure Policies via GFlowNet-Driven Alignment

Naga Sai Abhiram Kusumba ⋅ Maitreya Patel ⋅ Kyle Min ⋅ Changhoon Kim ⋅ Chitta Baral ⋅ Yezhou Yang

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Erasing harmful or proprietary concepts from powerful text‑to‑image generators is an emerging safety requirement, yet current ``concept erasure'' techniques either collapse image quality, rely on brittle adversarial losses, or demand prohibitive retraining cycles. We trace these limitations to a myopic view of the denoising trajectories that govern diffusion‑based generation. We introduce EraseFlow, the first framework that casts concept unlearning as exploration in the space of denoising paths and optimizes it with a GFlowNets equipped with the trajectory‑balance objective. By sampling entire trajectories rather than single end states, EraseFlow learns a stochastic policy that steers generation away from target concepts while preserving the model’s prior. EraseFlow eliminates the need for carefully crafted reward models and by doing this, it generalizes effectively to unseen concepts and avoids hackable rewards while improving the performance. Extensive empirical results demonstrate that EraseFlow outperforms existing baselines and achieves an optimal trade-off between performance and prior preservation.

View full details

Poster

Comparator-Adaptive $\Phi$-Regret: Improved Bounds, Simpler Algorithms, and Applications to Games

Soumita Hait ⋅ Ping Li ⋅ Haipeng Luo ⋅ Mengxiao Zhang

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

In the classic expert problem, $\Phi$-regret measures the gap between the learner's total loss and that achieved by applying the best action transformation $\phi \in \Phi$. A recent work by Lu et al., [2025] introduced an adaptive algorithm whose regret against a comparator $\phi$ depends on a certain sparsity-based complexity measure of $\phi$, recovering and interpolating optimal bounds for standard regret notions such as external, internal, and swap regret. In this work, we propose a general idea to achieve an even better comparator-adaptive $\Phi$-regret bound via much simpler algorithms compared to Lu et al., [2025]. Specifically, we discover a prior distribution over all possible binary transformations and show that it suffices to achieve prior-dependent regret against these transformations. Then, we propose two concrete and efficient algorithms to achieve so, where the first one combines multiple copies of the kernelized Hedge algorithm of Farina et al., [2022], and the second one combines multiple copies of a variant of the BM-reduction [Blum and Mansour, 2007]. To further showcase the power of our methods and the advantages over Lu et al., [2025] besides the simplicity and better regret bounds, we also show that our second approach can be extended to the game setting to achieve accelerated and adaptive convergence rate to $\Phi$-equilibria for a class of general-sum games. When specified to the special case of correlated equilibria, our bound improves over the existing ones from Anagnostides et al., [2022a,b].

View full details

Poster

Compositional Monte Carlo Tree Diffusion for Extendable Planning

Jaesik Yoon ⋅ Hyeonseo Cho ⋅ Sungjin Ahn

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Monte Carlo Tree Diffusion (MCTD) integrates diffusion models with structured tree search to enable effective trajectory exploration through stepwise reasoning. However, MCTD remains fundamentally limited by training trajectory lengths. While periodic replanning allows plan concatenation for longer plan generation, the planning process remains locally confined, as MCTD searches within individual trajectories without access to global context. We propose Compositional Monte Carlo Tree Diffusion (C-MCTD), a framework that elevates planning from individual trajectory optimization to reasoning over complete plan compositions. C-MCTD introduces three complementary components: (1) Online Composer, which performs globally-aware planning by searching across entire plan compositions; (2) Distributed Composer, which reduces search complexity through parallel exploration from multiple starting points; and (3) Preplan Composer, which accelerates inference by leveraging cached plan graphs.

View full details

Poster

The Graphon Limit Hypothesis: Understanding Neural Network Pruning via Infinite Width Analysis

Hoang Pham ⋅ The Anh Ta ⋅ Tom Jacobs ⋅ Rebekka Burkholz ⋅ Long Tran-Thanh

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Sparse neural networks promise efficiency, yet training them effectively remains a fundamental challenge. Despite advances in pruning methods that create sparse architectures, understanding why some sparse structures are better trainable than others with the same level of sparsity remains poorly understood. Aiming to develop a systematic approach to this fundamental problem, we propose a novel theoretical framework based on the theory of graph limits, particularly graphons, that characterizes sparse neural networks in the infinite-width regime. Our key insight is that connectivity patterns of sparse neural networks induced by pruning methods converge to specific graphons as networks' width tends to infinity, which encodes implicit structural biases of different pruning methods. We postulate the *Graphon Limit Hypothesis* and provide empirical evidence to support it. Leveraging this graphon representation, we derive a *Graphon Neural Tangent Kernel (Graphon NTK)* to study the training dynamics of sparse networks in the infinite width limit. Graphon NTK provides a general framework for the theoretical analysis of sparse networks. We empirically show that the spectral analysis of Graphon NTK correlates with observed training dynamics of sparse networks, explaining the varying convergence behaviours of different pruning methods. Our framework provides theoretical insights into the impact of connectivity patterns on the trainability of various sparse network architectures.

View full details

Poster

Improving LLM General Preference Alignment via Optimistic Online Mirror Descent

Yuheng Zhang ⋅ Dian Yu ⋅ Tao Ge ⋅ Linfeng Song ⋅ Zhichen Zeng ⋅ Haitao Mi ⋅ Nan Jiang ⋅ Dong Yu

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Reinforcement learning from human feedback (RLHF) has demonstrated remarkable effectiveness in aligning large language models (LLMs) with human preferences. Many existing alignment approaches rely on the Bradley-Terry (BT) model assumption, which assumes the existence of a ground-truth reward for each prompt-response pair. However, this assumption can be overly restrictive when modeling complex human preferences. In this paper, we drop the BT model assumption and study LLM alignment under general preferences, formulated as a two-player game. Drawing on theoretical insights from learning in games, we integrate optimistic online mirror descent into our alignment framework to approximate the Nash policy. Theoretically, we demonstrate that our approach achieves an $\mathcal{O}(T^{-1})$ bound on the duality gap, improving upon the previous $\mathcal{O}(T^{-1/2})$ result. Meanwhile, it enjoys a linear convergence rate in the last iterate, a property not achieved by previous methods. More importantly, we implement our method and show through experiments that it outperforms state-of-the-art RLHF algorithms across multiple representative benchmarks.

View full details

Poster

Can We Infer Confidential Properties of Training Data from LLMs?

Pengrun Huang ⋅ Chhavi Yadav ⋅ Kamalika Chaudhuri ⋅ Ruihan Wu

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Large language models (LLMs) are increasingly fine-tuned on domain-specific datasets to support applications in fields such as healthcare, finance, and law. These fine-tuning datasets often have sensitive and confidential dataset-level properties — such as patient demographics or disease prevalence—that are not intended to be revealed. While prior work has studied property inference attacks on discriminative models (e.g., image classification models) and generative models (e.g., GANs for image data), it remains unclear if such attacks transfer to LLMs. In this work, we introduce PropInfer, a benchmark task for evaluating property inference in LLMs under two fine-tuning paradigms: question-answering and chat-completion. Built on the ChatDoctor dataset, our benchmark includes a range of property types and task configurations. We further propose two tailored attacks: a prompt-based generation attack and a shadow-model attack leveraging word frequency signals. Empirical evaluations across multiple pretrained LLMs show the success of our attacks, revealing a previously unrecognized vulnerability in LLMs.

View full details

Poster

Convergence Rates of Constrained Expected Improvement

Haowei Wang ⋅ Jingyi Wang ⋅ Zhongxiang Dai ⋅ Nai-Yuan Chiang ⋅ Szu Hui Ng ⋅ Cosmin Petra

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Constrained Bayesian optimization (CBO) methods have seen significant success in black-box optimization with constraints. One of the most commonly used CBO methods is the constrained expected improvement (CEI) algorithm. CEI is a natural extension of expected improvement (EI) when constraints are incorporated. However, the theoretical convergence rate of CEI has not been established. In this work, we study the convergence rate of CEI by analyzing its simple regret upper bound. First, we show that when the objective function $f$ and constraint function $c$ are assumed to each lie in a reproducing kernel Hilbert space (RKHS), CEI achieves the convergence rates of $\mathcal{O} \left(t^{-\frac{1}{2}}\log^{\frac{d+1}{2}}(t) \right) \ \text{and }\ \mathcal{O}\left(t^{\frac{-\nu}{2\nu+d}} \log^{\frac{\nu}{2\nu+d}}(t)\right)$ for the commonly used squared exponential and Matérn kernels, respectively. Second, we show that when $f$ is assumed to be sampled from Gaussian processes (GPs), CEI achieves similar convergence rates with a high probability. Numerical experiments are performed to validate the theoretical analysis.

View full details

Poster

Reasoning Gym: Reasoning Environments for Reinforcement Learning with Verifiable Rewards

Zafir Stojanovski ⋅ Oliver Stanley ⋅ Joe Sharratt ⋅ Richard Jones ⋅ Abdulhakeem Adefioye ⋅ Jean Kaddour ⋅ Andreas Köpf

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We introduce Reasoning Gym, a library of reasoning environments for reinforcement learning with verifiable rewards (RLVR). It provides over 100 tasks spanning multiple domains including algebra, arithmetic, computation, cognition, geometry, graph theory, logic, and various common games. Its key innovation is the ability to generate virtually infinite training data with adjustable complexity, unlike most previous reasoning datasets, which are typically fixed. This procedural generation approach allows for continuous evaluation across varying difficulty levels and task configurations. Our experimental results demonstrate the efficacy of Reasoning Gym in both evaluating and reinforcement learning of reasoning models.

View full details

Poster

Scaling Unlocks Broader Generation and Deeper Functional Understanding of Proteins

Aadyot Bhatnagar ⋅ Sarthak Jain ⋅ Joel Beazer ⋅ Samuel Curran ⋅ Alexander Hoffnagle ⋅ Kyle Ching ⋅ Michael Martyn ⋅ Stephen Nayfach ⋅ Jeffrey Ruffolo ⋅ Ali Madani

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Generative protein language models (PLMs) are powerful tools for designing proteins purpose-built to solve problems in medicine, agriculture, and industrial processes. Recent work has trained ever larger language models, but there has been little systematic study of the optimal training distributions and the influence of model scale on the sequences generated by PLMs. We introduce the ProGen3 family of sparse generative PLMs, and we develop compute-optimal scaling laws to scale up to a 46B-parameter model pre-trained on 1.5T amino acid tokens. ProGen3's pre-training data is sampled from an optimized data distribution over the PPA v1, a carefully curated dataset of 3.4B full-length proteins. We evaluate for the first time in the wet lab the influence of model scale on the sequences generated by PLMs, and we find that larger models generate viable proteins for a much wider diversity of protein families. Finally, we find both computationally and experimentally that larger models are more responsive to alignment with laboratory data, resulting in improved protein fitness prediction and sequence generation capabilities. These results indicate that larger PLMs like ProGen3-46B trained on larger, well-curated datasets are powerful foundation models that push the frontier of protein design.

View full details

Poster

Is the acquisition worth the cost? Surrogate losses for Consistent Two-stage Classifiers

florence regol ⋅ Joseph Cotnareanu ⋅ Theodore Glavas ⋅ Mark Coates

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Recent years have witnessed the emergence of a spectrum of foundation models, covering a broad range of capabilities and costs. Often, we effectively use foundation models as feature generators and train classifiers that use the outputs of these models to make decisions. In this paper, we consider an increasingly relevant setting where we have two classifier stages. The first stage has access to features $x$ and has the option to make a classification decision or defer, while incurring a cost, to a second classifier that has access to features $x$ and $z$. This is similar to the ``learning to defer'' setting, with the important difference that we train both classifiers jointly, and the second classifier has access to more information. The natural loss for this setting is an $\ell_{01c}$ loss, where a penalty is paid for incorrect classification, as in $\ell_{01}$, but an additional penalty $c$ is paid for consulting the second classifier. The $\ell_{01c}$ loss is unwieldy for training. Our primary contribution in this paper is the derivation of a hinge-based surrogate loss $\ell^c_{hinge}$ that is much more amenable to training but also satisfies the property that $\ell^c_{hinge}$-consistency implies $\ell_{01c}$-consistency.

View full details

Poster

Solving Neural Min-Max Games: The Role of Architecture, Initialization & Dynamics

Deep Patel ⋅ Emmanouil-Vasileios Vlatakis-Gkaragkounis

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Many emerging applications—such as adversarial training, AI alignment, and robust optimization—can be framed as zero-sum games between neural nets, with von Neumann–Nash equilibria (NE) capturing the desirable system behavior. While such games often involve non-convex non-concave objectives, empirical evidence shows that simple gradient methods frequently converge, suggesting a hidden geometric structure. In this paper, we provide a theoretical framework that explains this phenomenon through the lens of \emph{hidden convexity} and \emph{overparameterization}. We identify sufficient conditions spanning initialization, training dynamics, and network width—that guarantee global convergence to a NE in a broad class of non-convex min-max games. To our knowledge, this is the first such result for games that involve two-layer neural networks. Technically, our approach is twofold: (a) we derive a novel path-length bound for alternating gradient-descent-ascent scheme in min-max games; and (b) we show that games with hidden convex–concave geometry reduce to settings satisfying two-sided Polyak–Łojasiewicz (PL) and smoothness conditions, which hold with high probability under overparameterization, using tools from random matrix theory.

View full details

Poster

Structured Linear CDEs: Maximally Expressive and Parallel-in-Time Sequence Models

Benjamin Walker ⋅ Lingyi Yang ⋅ Nicola Muca Cirone ⋅ Cristopher Salvi ⋅ Terry Lyons

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

This work introduces Structured Linear Controlled Differential Equations (SLiCEs), a unifying framework for sequence models with structured, input-dependent state-transition matrices that retain the maximal expressivity of dense matrices whilst being cheaper to compute. The framework encompasses existing architectures, such as input-dependent block-diagonal linear recurrent neural networks and DeltaNet's diagonal-plus-low-rank structure, as well as two novel variants based on sparsity and the Walsh-Hadamard transform. We prove that, unlike the diagonal state-transition matrices of S4D and Mamba, SLiCEs employing block-diagonal, sparse, or Walsh-Hadamard matrices match the maximal expressivity of dense matrices. Empirically, SLiCEs solve the $A_5$ state-tracking benchmark with a single layer, achieve best-in-class length generalisation on regular language tasks among parallel-in-time models, and match the performance of log neural controlled differential equations on six multivariate time-series classification datasets while cutting the average time per training step by a factor of twenty.

View full details

Poster

HawkBench: Investigating Resilience of RAG Methods on Stratified Information-Seeking Tasks

Hongjin Qian ⋅ Zheng Liu ⋅ Chao Gao ⋅ Yankai Wang ⋅ Defu Lian ⋅ Zhicheng Dou

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

In real-world information-seeking scenarios, users have dynamic and diverse needs, requiring RAG systems to demonstrate adaptable resilience. To comprehensively evaluate the resilience of current RAG methods, we introduce HawkBench, a human-labeled, multi-domain benchmark designed to rigorously assess RAG performance across categorized task types. By stratifying tasks based on information-seeking behaviors, HawkBench provides a systematic evaluation of how well RAG systems adapt to diverse user needs.Unlike existing benchmarks, which focus primarily on specific task types (mostly factoid queries) and rely on varying knowledge bases, HawkBench offers: (1) systematic task stratification to cover a broad range of query types, including both factoid and rationale queries, (2) integration of multi-domain corpora across all task types to mitigate corpus bias, and (3) rigorous annotation for high-quality evaluation.HawkBench includes 1,600 high-quality test samples, evenly distributed across domains and task types. Using this benchmark, we evaluate representative RAG methods, analyzing their performance in terms of answer quality and response latency. Our findings highlight the need for dynamic task strategies that integrate decision-making, query interpretation, and global knowledge understanding to improve RAG generalizability. We believe HawkBench serves as a pivotal benchmark for advancing the resilience of RAG methods and their ability to achieve general-purpose information seeking.

View full details

Poster

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

Sreyan Ghosh ⋅ Arushi Goel ⋅ Jaehyeon Kim ⋅ Sonal Kumar ⋅ Zhifeng Kong ⋅ Sang-gil Lee ⋅ Chao-Han Yang ⋅ Ramani Duraiswami ⋅ Dinesh Manocha ⋅ Rafael Valle ⋅ Bryan Catanzaro

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We present Audio Flamingo 3 (AF3), a fully open state-of-the-art (SOTA) large audio-language model that advances reasoning and understanding across speech, sound, and music. AF3 introduces: (i) AF-Whisper, a unified audio encoder trained using a novel strategy for joint representation learning across all 3 modalities of speech, sound, and music; (ii) flexible, on-demand thinking, allowing the model to do chain-of-thought-type reasoning before answering; (iii) multi-turn, multi-audio chat; (iv) long audio understanding and reasoning (including speech) up to 10 minutes; and (v) voice-to-voice interaction. To enable these capabilities, we propose several large-scale training datasets curated using novel strategies, including AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat, and train AF3 with a novel five-stage curriculum-based training strategy. Trained on only open-source audio data, AF3 achieves new SOTA results on over 20+ (long) audio understanding and reasoning benchmarks, surpassing both open-weight and closed-source models trained on much larger datasets.

View full details

Poster

Wide-Horizon Thinking and Simulation-Based Evaluation for Real-World LLM Planning with Multifaceted Constraints

Dongjie Yang ⋅ Chengqiang Lu ⋅ Qimeng Wang ⋅ Xinbei Ma ⋅ Yan Gao ⋅ Yao Hu ⋅ Hai Zhao

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Unlike reasoning, which often entails a deep sequence of deductive steps, complex real-world planning is characterized by the need to synthesize a broad spectrum of parallel and potentially conflicting information and constraints. For example, in travel planning scenarios, it requires the integration of diverse real-world information and user preferences. While LLMs show promise, existing methods with long-horizon thinking struggle with handling multifaceted constraints, leading to suboptimal solutions. Motivated by the challenges of real-world travel planning, this paper introduces the Multiple Aspects of Planning (MAoP), empowering LLMs with "wide-horizon thinking" to solve planning problems with multifaceted constraints. Instead of direct planning, MAoP leverages the strategist to conduct pre-planning from various aspects and provide the planning blueprint for planners, enabling strong inference-time scalability by scaling aspects to consider various constraints. In addition, existing benchmarks for multi-constraint planning are flawed because they assess constraints in isolation, ignoring causal dependencies within the constraints, e.g, travel planning, where past activities dictate future itinerary. To address this, we propose Travel-Sim, an agent-based benchmark assessing plans via real-world simulation, thereby inherently resolving these causal dependencies. This paper advances LLM capabilities in complex planning and offers novel insights for evaluating sophisticated scenarios through simulation.

View full details

Poster

Simultaneous Swap Regret Minimization via KL-Calibration

Haipeng Luo ⋅ Spandan Senapati ⋅ Vatsal Sharan

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Calibration is a fundamental concept that aims at ensuring the reliability of probabilistic predictions by aligning them with real-world outcomes. There is a surge of studies on new calibration measures that are easier to optimize compared to the classical $\ell_1$-Calibration while still having strong implications for downstream applications. One recent such example is the work by Fishelson et al. (2025) who show that it is possible to achieve $\tilde{\mathcal{O}}(T^{1/3})$ pseudo $\ell_{2}$-Calibration error via minimizing pseudo swap regret of the squared loss, which in fact implies the same bound for all bounded proper losses with a smooth univariate form. In this work, we significantly generalize their result in the following ways: (a) in addition to smooth univariate forms, our algorithm also simultaneously achieves $\tilde{\mathcal{O}}(T^{1/3})$ swap regret for any proper loss with a twice continuously differentiable univariate form (such as Tsallis entropy); (b) our bounds hold not only for pseudo swap regret that measures losses using the forecaster's distributions on predictions, but also hold for the actual swap regret that measures losses using the forecaster's actual realized predictions. We achieve so by introducing a new stronger notion of calibration called (pseudo) KL-Calibration, which we show is equivalent to the (pseudo) swap regret with respect to log loss. We prove that there exists an algorithm that achieves $\tilde{\mathcal{O}}(T^{1/3})$ KL-Calibration error and provide an explicit algorithm that achieves $\tilde{\mathcal{O}}(T^{1/3})$ pseudo KL-Calibration error. Moreover, we show that the same algorithm achieves ${\mathcal{O}}(T^{1/3} (\log T) ^ {-\frac{1}{3}}\log (T/{\delta}))$ swap regret with probability at least $1-\delta$ for any proper loss with a smooth univariate form, which implies $\tilde{\mathcal{O}}(T^{1/3})$ $\ell_2$-Calibration error. A technical contribution of our work is a new randomized rounding procedure and a non-uniform discretization scheme to minimize the swap regret for log loss.

View full details

Poster

Multimodal Disease Progression Modeling via Spatiotemporal Disentanglement and Multiscale Alignment

Chen Liu ⋅ Wenfang Yao ⋅ Kejing Yin ⋅ William K. Cheung ⋅ Jing Qin

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Longitudinal multimodal data, including electronic health records (EHR) and sequential chest X-rays (CXRs), is critical for modeling disease progression, yet remains underutilized due to two key challenges: (1) redundancy in consecutive CXR sequences, where static anatomical regions dominate over clinically-meaningful dynamics, and (2) temporal misalignment between sparse, irregular imaging and continuous EHR data. We introduce $\texttt{DiPro}$, a novel framework that addresses these challenges through region-aware disentanglement and multi-timescale alignment. First, we disentangle static (anatomy) and dynamic (pathology progression) features in sequential CXRs, prioritizing disease-relevant changes. Second, we hierarchically align these static and dynamic CXR features with asynchronous EHR data via local (pairwise interval-level) and global (full-sequence) synchronization to model coherent progression pathways. Extensive experiments on the MIMIC dataset demonstrate that $\texttt{DiPro}$ could effectively extract temporal clinical dynamics and achieve state-of-the-art performance on both disease progression identification and general ICU prediction tasks.

View full details

Poster

From Experts to a Generalist: Toward General Whole-Body Control for Humanoid Robots

Yuxuan Wang ⋅ Ming Yang ⋅ Gang Ding ⋅ Yu Zhang ⋅ Weishuai Zeng ⋅ Xinrun Xu ⋅ Haobin Jiang ⋅ Zongqing Lu

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Achieving general agile whole-body control on humanoid robots remains a major challenge due to diverse motion demands and data conflicts. While existing frameworks excel in training single motion-specific policies, they struggle to generalize across highly varied behaviors due to conflicting control requirements and mismatched data distributions. In this work, we propose BumbleBee (BB), an expert-generalist learning framework that combines motion clustering and sim-to-real adaptation to overcome these challenges. BB first leverages an autoencoder-based clustering method to group behaviorally similar motions using motion features and motion descriptions. Expert policies are then trained within each cluster and refined with real-world data through iterative delta action modeling to bridge the sim-to-real gap. Finally, these experts are distilled into a unified generalist controller that preserves agility and robustness across all motion types. Experiments on two simulations and a real humanoid robot demonstrate that BB achieves state-of-the-art general whole-body control, setting a new benchmark for agile, robust, and generalizable humanoid performance in the real world.

View full details

Poster

Plasticity as the Mirror of Empowerment

David Abel ⋅ Michael Bowling ⋅ Andre Barreto ⋅ Will Dabney ⋅ Shi Dong ⋅ Steven Hansen ⋅ Anna Harutyunyan ⋅ Khimya Khetarpal ⋅ Clare Lyle ⋅ Razvan Pascanu ⋅ Georgios Piliouras ⋅ Doina Precup ⋅ Jonathan Richens ⋅ Mark Rowland ⋅ Tom Schaul ⋅ Satinder Singh

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Agents are minimally entities that are influenced by their past observations and act to influence future observations. This latter capacity is captured by empowerment, which has served as a vital framing concept across artificial intelligence and cognitive science. This former capacity, however, is equally foundational: In what ways, and to what extent, can an agent be influenced by what it observes? In this paper, we ground this concept in a universal agent-centric measure that we refer to as plasticity, and reveal a fundamental connection to empowerment. Following a set of desiderata on a suitable definition, we define plasticity using a new information-theoretic quantity we call the generalized directed information. We show that this new quantity strictly generalizes the directed information introduced by Massey (1990) while preserving all of its desirable properties. Under this definition, we find that plasticity is well thought of as the mirror of empowerment: The two concepts are defined using the same measure, with only the direction of influence reversed. Our main result establishes a tension between the plasticity and empowerment of an agent, suggesting that agent design needs to be mindful of both characteristics. We explore the implications of these findings, and suggest that plasticity, empowerment, and their relationship are essential to understanding agency.

View full details

Poster

scMRDR: A scalable and flexible framework for unpaired single-cell multi-omics data integration

Jianle Sun ⋅ Chaoqi Liang ⋅ Ran Wei ⋅ Peng Zheng ⋅ LEI BAI ⋅ Wanli Ouyang ⋅ Hongliang Yan ⋅ Peng Ye

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Advances in single-cell sequencing have enabled high-resolution profiling of diverse molecular modalities, while integrating unpaired multi-omics single-cell data remains challenging. Existing approaches either rely on pair information or prior correspondences, or require computing a global pairwise coupling matrix, limiting their scalability and flexibility. In this paper, we introduce a scalable and flexible generative framework called single-cell Multi-omics Regularized Disentangled Representations (scMRDR) for unpaired multi-omics integration. Specifically, we disentangle each cell’s latent representations into modality-shared and modality-specific components using a well-designed $\beta$-VAE architecture, which are augmented with isometric regularization to preserve intra-omics biological heterogeneity, adversarial objective to encourage cross-modal alignment, and masked reconstruction loss strategy to address the issue of missing features across modalities. Our method achieves excellent performance on benchmark datasets in terms of batch correction, modality alignment, and biological signal preservation. Crucially, it scales effectively to large-scale datasets and supports integration of more than two omics, offering a powerful and flexible solution for large-scale multi-omics data integration and downstream biological discovery.

View full details

Poster

On Feasible Rewards in Multi-Agent Inverse Reinforcement Learning

Till Freihaut ⋅ Giorgia Ramponi

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Multi-agent inverse reinforcement learning (MAIRL) aims to recover agent reward functions from expert demonstrations. We characterize the feasible reward set in Markov games, identifying all reward functions that rationalize a given equilibrium. However, equilibrium-based observations are often ambiguous: a single Nash equilibrium can correspond to many reward structures, potentially changing the game's nature in multi-agent systems. We address this by introducing entropy-regularized Markov games, which yield a unique equilibrium while preserving strategic incentives. For this setting, we provide a sample complexity analysis detailing how errors affect learned policy performance. Our work establishes theoretical foundations and practical insights for MAIRL.

View full details

Poster

Revisiting Generative Infrared and Visible Image Fusion Based on Human Cognitive Laws

Lin Guo ⋅ Xiaoqing Luo ⋅ Wei Xie ⋅ Zhancheng Zhang ⋅ Hui Li ⋅ Rui Wang ⋅ Zhenhua Feng ⋅ Xiaoning Song

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Existing infrared and visible image fusion methods often face the dilemma of balancing modal information. Generative fusion methods reconstruct fused images by learning from data distributions, but their generative capabilities remain limited. Moreover, the lack of interpretability in modal information selection further affects the reliability and consistency of fusion results in complex scenarios. This manuscript revisits the essence of generative image fusion under the inspiration of human cognitive laws and proposes a novel infrared and visible image fusion method, termed HCLFuse. First, HCLFuse investigates the quantification theory of information mapping in unsupervised fusion networks, which leads to the design of a multi-scale mask-regulated variational bottleneck encoder. This encoder applies posterior probability modeling and information decomposition to extract accurate and concise low-level modal information, thereby supporting the generation of high-fidelity structural details. Furthermore, the probabilistic generative capability of the diffusion model is integrated with physical laws, forming a time-varying physical guidance mechanism that adaptively regulates the generation process at different stages, thereby enhancing the ability of the model to perceive the intrinsic structure of data and reducing dependence on data quality. Experimental results show that the proposed method achieves state-of-the-art fusion performance in qualitative and quantitative evaluations across multiple datasets and significantly improves semantic segmentation metrics. This fully demonstrates the advantages of this generative image fusion method, drawing inspiration from human cognition, in enhancing structural consistency and detail quality.

View full details

Poster

LoRATv2: Enabling Low-Cost Temporal Modeling in One-Stream Trackers

Liting Lin ⋅ Heng Fan ⋅ Zhipeng Zhang ⋅ Yuqing Huang ⋅ Yaowei Wang ⋅ Yong Xu ⋅ Haibin Ling

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Transformer-based algorithms, such as LoRAT, have significantly enhanced object-tracking performance. However, these approaches rely on a standard attention mechanism, which incurs quadratic token complexity, making real-time inference computationally expensive. In this paper, we introduce LoRATv2, a novel tracking framework that addresses these limitations with three main contributions. First, LoRATv2 integrates frame-wise causal attention, which ensures full self-attention within each frame while enabling causal dependencies across frames, significantly reducing computational overhead. Moreover, key-value (KV) caching is employed to efficiently reuse past embeddings for further speedup. Second, building on LoRAT's parameter-efficient fine-tuning, we propose Stream-Specific LoRA Adapters (SSLA). As frame-wise causal attention introduces asymmetry in how streams access temporal information, SSLA assigns dedicated LoRA modules to the template and each search stream, with the main ViT backbone remaining frozen. This allows specialized adaptation for each stream's role in temporal tracking. Third, we introduce a two-phase progressive training strategy, which first trains a single-search-frame tracker and then gradually extends it to multi-search-frame inputs by introducing additional LoRA modules. This curriculum-based learning paradigm improves long-term tracking while maintaining training efficiency. In extensive experiments on multiple benchmarks, LoRATv2 achieves state-of-the-art performance, substantially improved efficiency, and a superior performance-to-FLOPs ratio over state-of-the-art trackers. The code is available at https://github.com/LitingLin/LoRATv2.

View full details

Poster

Locality in Image Diffusion Models Emerges from Data Statistics

Artem Lukoianov ⋅ Chenyang Yuan ⋅ Justin Solomon ⋅ Vincent Sitzmann

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Recent work has shown that the generalization ability of image diffusion models arises from the locality properties of the trained neural network. In particular, when denoising a particular pixel, the model relies on a limited neighborhood of the input image around that pixel, which, according to the previous work, is tightly related to the ability of these models to produce novel images. Since locality is central to generalization, it is crucial to understand why diffusion models learn local behavior in the first place, as well as the factors that govern the properties of locality patterns. In this work, we present evidence that the locality in deep diffusion models emerges as a statistical property of the image dataset and is not due to the inductive bias of convolutional neural networks, as suggested in previous work. Specifically, we demonstrate that an optimal parametric linear denoiser exhibits similar locality properties to deep neural denoisers. We show, both theoretically and experimentally, that this locality arises directly from pixel correlations present in the image datasets. Moreover, locality patterns are drastically different on specialized datasets, approximating principal components of the data’s covariance. We use these insights to craft an analytical denoiser that better matches scores predicted by a deep diffusion model than prior expert-crafted alternatives. Our key takeaway is that while neural network architectures influence generation quality, their primary role is to capture locality patterns inherent in the data.

View full details

Poster

Stochastic Process Learning via Operator Flow Matching

Yaozhong Shi ⋅ Zachary Ross ⋅ Domniki Asimaki ⋅ Kamyar Azizzadenesheli

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Expanding on neural operators, we propose a novel framework for stochastic process learning across arbitrary domains. In particular, we develop operator flow matching (OFM) for learning stochastic process priors on function spaces. OFM provides the probability density of the values of any collection of points and enables mathematically tractable functional regression at new points with mean and density estimation. Our method outperforms state-of-the-art models in stochastic process learning, functional regression, and prior learning.

View full details

Poster

MigGPT: Harnessing Large Language Models for Automated Migration of Out-of-Tree Linux Kernel Patches Across Versions

Pucheng Dang ⋅ Di Huang ⋅ Dong Li ⋅ Kang Chen ⋅ Yuanbo Wen ⋅ Qi Guo ⋅ Xing Hu

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Out-of-tree kernel patches are essential for adapting the Linux kernel to new hardware or enabling specific functionalities. Maintaining and updating these patches across different kernel versions demands significant effort from experienced engineers. Large language models (LLMs) have shown remarkable progress across various domains, suggesting their potential for automating out-of-tree kernel patch migration. However, our findings reveal that LLMs, while promising, struggle with incomplete code context understanding and inaccurate migration point identification. In this work, we propose MigGPT, a framework that employs a novel code fingerprint structure to retain code snippet information and incorporates three meticulously designed modules to improve the migration accuracy and efficiency of out-of-tree kernel patches. Furthermore, we establish a robust benchmark using real-world out-of-tree kernel patch projects to evaluate LLM capabilities. Evaluations show that MigGPT significantly outperforms the direct application of vanilla LLMs, achieving an average completion rate of 72.59\% ($\uparrow 50.74\%$) for migration tasks.

View full details

Poster

Variational Transdimensional Inference

Laurence Davies ⋅ Daniel MacKinlay ⋅ Rafael Oliveira ⋅ Scott SIsson

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

The expressiveness of flow-based models combined with stochastic variational inference (SVI) has expanded the application of optimization-based Bayesian inference to highly complex problems. However, despite the importance of multi-model Bayesian inference for problems defined on a transdimensional joint model and parameter space, such as Bayesian structure learning, flow-based SVI has been limited to problems defined on a fixed-dimensional parameter space. We introduce CoSMIC normalizing flows (COntextually-Specified Masking for Identity-mapped Components), an extension to neural autoregressive conditional normalizing flow architectures that enables use of a single flow-based variational density for inference over a transdimensional (multi-model) conditional target distribution. We propose a combined stochastic variational transdimensional inference (VTI) approach to training CoSMIC flows using ideas from Bayesian optimization and Monte Carlo gradient estimation. Numerical experiments show the performance of VTI on challenging problems that scale to high-cardinality model spaces.

View full details

Poster

Improved Representation Steering for Language Models

Zhengxuan Wu ⋅ Qinan Yu ⋅ Aryaman Arora ⋅ Christopher D Manning ⋅ Chris Potts

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Steering methods for language models (LMs) seek to provide fine-grained and interpretable control over model generations by variously changing model inputs, weights, or representations to adjust behavior. Recent work has shown that adjusting weights or representations is often less effective than steering by prompting, for instance when wanting to introduce or suppress a particular concept. We demonstrate how to improve representation steering via our new Reference-free Preference Steering (RePS), a bidirectional preference-optimization objective that jointly does concept steering and suppression. We train three parameterizations of RePS and evaluate them on AxBench, a large-scale model steering benchmark. On Gemma models with sizes ranging from 2B to 27B, RePS outperforms all existing steering methods trained with a language modeling objective and substantially narrows the gap with prompting -- while promoting interpretability and minimizing parameter count. In suppression, RePS matches the language-modeling objective on Gemma-2 and outperforms it on the larger Gemma-3 variants while remaining resilient to prompt-based jailbreaking attacks that defeat prompting. Overall, our results suggest that RePS provides an interpretable and robust alternative to prompting for both steering and suppression.

View full details

Poster

SimWorld: An Open-ended Simulator for Agents in Physical and Social Worlds

Xiaokang Ye ⋅ Jiawei Ren ⋅ Yan Zhuang ⋅ Xuhong He ⋅ Yiming Liang ⋅ Yiqing Yang ⋅ Mrinaal Dogra ⋅ Xianrui Zhong ⋅ Eric Liu ⋅ Kevin Benavente ⋅ Rajiv Mandya Nagaraju ⋅ Dhruv Sharma ⋅ Ziqiao Ma ⋅ Tianmin Shu ⋅ Zhiting Hu ⋅ Lianhui Qin

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

While LLM/VLM-powered AI agents have advanced rapidly in math, coding, and computer use, their applications in complex physical and social environments remain challenging. Building agents that can survive and thrive in the real world (e.g., by autonomously earning income) requires massive-scale interaction, reasoning, training, and evaluation across diverse scenarios. However, existing world simulators for such development fall short: they often rely on limited hand-crafted environments, simulate simplified game-like physics and social rules, and lack native support for LLM/VLM agents. We introduce SimWorld, a new simulator built on Unreal Engine 5, designed for developing and evaluating LLM/VLM agents in rich, real-world-like settings. SimWorld offers three core capabilities: (1) realistic, open-ended world simulation, including accurate physical and social dynamics and language-driven procedural environment generation; (2) rich interface for LLM/VLM agents, with multi-modal world inputs/feedback and open-vocabulary action outputs at varying levels of abstraction; and (3) diverse physical and social reasoning scenarios that are easily customizable by users. We demonstrate SimWorld by deploying frontier LLM agents (e.g., Gemini-2.5-Flash, Claude-3.5, GPT-4o, and DeepSeek-Prover-V2) on both short-horizon navigation tasks requiring grounded re-planning, and long-horizon multi-agent food delivery tasks involving strategic cooperation and competition. The results reveal distinct reasoning patterns and limitations across models. We open-source SimWorld and hope it becomes a foundational platform for advancing real-world agent intelligence across disciplines. Please refer to the project website for the most up-to-date information: http://simworld.org/.

View full details

Poster

UniSite: The First Cross-Structure Dataset and Learning Framework for End-to-End Ligand Binding Site Detection

Jigang Fan ⋅ Quanlin Wu ⋅ Shengjie Luo ⋅ Liwei Wang

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The detection of ligand binding sites for proteins is a fundamental step in Structure-Based Drug Design. Despite notable advances in recent years, existing methods, datasets, and evaluation metrics are confronted with several key challenges: (1) current datasets and methods are centered on individual protein–ligand complexes and neglect that diverse binding sites may exist across multiple complexes of the same protein, introducing significant statistical bias; (2) ligand binding site detection is typically modeled as a discontinuous workflow, employing binary segmentation and subsequent clustering algorithms; (3) traditional evaluation metrics do not adequately reflect the actual performance of different binding site prediction methods. To address these issues, we first introduce UniSite-DS, the first UniProt (Unique Protein)-centric ligand binding site dataset, which contains 4.81 times more multi-site data and 2.08 times more overall data compared to the previously most widely used datasets. We then propose UniSite, the first end-to-end ligand binding site detection framework supervised by set prediction loss with bijective matching. In addition, we introduce Average Precision based on Intersection over Union (IoU) as a more accurate evaluation metric for ligand binding site prediction. Extensive experiments on UniSite-DS and several representative benchmark datasets demonstrate that IoU-based Average Precision provides a more accurate reflection of prediction quality, and that UniSite outperforms current state-of-the-art methods in ligand binding site detection. The dataset and codes will be made publicly available at https://github.com/quanlin-wu/unisite.

View full details

Poster

Achilles' Heel of Mamba: Essential difficulties of the Mamba architecture demonstrated by synthetic data

Tianyi Chen ⋅ Pengxiao Lin ⋅ Zhiwei Wang ⋅ Zhi-Qin Xu

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

State Space Models (SSMs) have emerged as promising alternatives to attention mechanisms, with the Mamba architecture demonstrating impressive performance and linear complexity for processing long sequences. However, the fundamental differences between Mamba and Transformer architectures remain incompletely understood. In this work, we use carefully designed synthetic tasks to reveal Mamba's inherent limitations. Through experiments, we identify that Mamba's nonlinear convolution introduces an asymmetry bias that significantly impairs its ability to recognize symmetrical patterns and relationships. Using composite function and inverse sequence matching tasks, we demonstrate that Mamba strongly favors compositional solutions over symmetrical ones and struggles with tasks requiring the matching of reversed sequences. We show these limitations stem not from the SSM module itself but from the nonlinear convolution preceding it, which fuses token information asymmetrically. These insights provide a new understanding of Mamba's constraints and suggest concrete architectural improvements for future sequence models.

View full details

Poster

Decomposing stimulus-specific sensory neural information via diffusion models

Steeve Laquitaine ⋅ Simone Azeglio ⋅ Carlo Paris ⋅ Ulisse Ferrari ⋅ Matthew Chalk

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

A central question in sensory neuroscience is how much, but also what information neurons transmit about the world. While Shannon’s information theory provides a principled framework to quantify the amount of information neurons encode about all stimuli, it does not reveal which stimuli contribute most, or what stimulus features are encoded. As a concrete example, it is known that neurons in the early visual cortex are 'sensitive' to stimuli in a small region of space (their receptive field). However, it is not clear how such simple intuitions carry to more complex scenarios, e.g. with large, noisy & non-linear population of neurons and high-dimensional stimuli. Several previous measures of neural sensitivity have been proposed. For example, the Fisher information quantifies the sensitivity of neural responses to infinitesimal stimulus perturbations. However, as the Fisher is not a valid decomposition of the mutual information it cannot say how different stimuli contribute to the total encoded information. On the other hand, previous works have proposed stimulus dependent decompositions of mutual information, which define a function $ I(x) $ such that $ I(R; X) = \mathbb{E}[I(x)] $. However, this decomposition is inherently ill-posed: infinitely many functions $I(x)$ satisfy the constraint, with no principled way to select among them. Further, different decompositions behave in qualitatively different ways, making it hard to interpret what are they are telling us. Finally, most proposed decompositions are computationally intractable for the high-dimensional stimuli and non-linear encoding models relevant for neuroscience. To resolve these limitations, we propose a set of axioms that any stimulus specific and feature-specific information decomposition should satisfy in order to serve as a meaningful and interpretable measure of neural sensitivity. These axioms formalize intuitive desiderata: that the information assigned to each stimulus, and stimulus feature, should be non-negative, and additive with respect to repeated measurements. We also require the decomposition to respect a form of locality: changes in how a neuron responds to a stimulus $ x $ should not affect the information attributed to a distant stimulus $ x' $. Finally, the attribution must be insensitive to irrelevant features, which do not contribute to the total information. Together, these constraints ensure that the decomposition is both interpretable and theoretically grounded. We show that existing decompositions violate one or more of these axioms, limiting their interpretability and use as information theoretic measures of neural sensitivity. We then introduce a novel decomposition that satisfies all of our axioms. It generalizes Fisher information by capturing neural sensitivity to both infinitesimal and finite stimulus perturbations. Moreover, it supports further decomposition across individual stimulus features (e.g., image pixels), enabling fine-grained analysis of neural representations. Beyond satisfying our theoretical axioms, our decomposition is computationally tractable for large neural populations and high-dimensional naturalistic stimuli, through the use of diffusion models. We demonstrate the power of our method by quantifying the information encoded by a model of visual neurons about individual images and pixels. Our approach uncovers aspects of the neural code that are not picked up by standard methods, such as the Fisher information, and opens the door to similar analyses in higher-order sensory areas, and artificial neural networks.

View full details

Poster

Fast Monte Carlo Tree Diffusion: 100× Speedup via Parallel and Sparse Planning

Jaesik Yoon ⋅ Hyeonseo Cho ⋅ Yoshua Bengio ⋅ Sungjin Ahn

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Diffusion models have recently emerged as a powerful approach for trajectory planning. However, their inherently non-sequential nature limits their effectiveness in long-horizon reasoning tasks at test time. The recently proposed Monte Carlo Tree Diffusion (MCTD) offers a promising solution by combining diffusion with tree-based search, achieving state-of-the-art performance on complex planning problems. Despite its strengths, our analysis shows that MCTD incurs substantial computational overhead due to the sequential nature of tree search and the cost of iterative denoising. To address this, we propose Fast-MCTD, a more efficient variant that preserves the strengths of MCTD while significantly improving its speed and scalability. Fast-MCTD integrates two techniques: Parallel MCTD, which enables parallel rollouts via delayed tree updates and redundancy-aware selection; and Sparse MCTD, which reduces rollout length through trajectory coarsening. Experiments show that Fast-MCTD achieves up to 100× speedup over standard MCTD while maintaining or improving planning performance. Remarkably, it even outperforms Diffuser in inference speed on some tasks, despite Diffuser requiring no search and yielding weaker solutions. These results position Fast-MCTD as a practical and scalable solution for diffusion-based inference-time reasoning.

View full details

Poster

MetaGS: A Meta-Learned Gaussian-Phong Model for Out-of-Distribution 3D Scene Relighting

Yumeng He ⋅ Yunbo Wang

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Out-of-distribution (OOD) 3D relighting requires novel view synthesis under unseen lighting conditions that differ significantly from the observed images. Existing relighting methods, which assume consistent light source distributions between training and testing, often degrade in OOD scenarios. We introduce **MetaGS** to tackle this challenge from two perspectives. First, we propose a meta-learning approach to train 3D Gaussian splatting, which explicitly promotes learning generalizable Gaussian geometries and appearance attributes across diverse lighting conditions, even with biased training data. Second, we embed fundamental physical priors from the *Blinn-Phong* reflection model into Gaussian splatting, which enhances the decoupling of shading components and leads to more accurate 3D scene reconstruction. Results on both synthetic and real-world datasets demonstrate the effectiveness of MetaGS in challenging OOD relighting tasks, supporting efficient point-light relighting and generalizing well to unseen environment lighting maps.

View full details

Poster

Spend Wisely: Maximizing Post-Training Gains in Iterative Synthetic Data Bootstrapping

Pu Yang ⋅ Yunzhen Feng ⋅ Ziyuan Chen ⋅ Yuhang Wu ⋅ Zhuoyuan Li

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Modern foundation models often undergo iterative ``bootstrapping'' in their post-training phase: a model generates synthetic data, an external verifier filters out low-quality samples, and the high-quality subset is used for further fine-tuning. Over multiple iterations, the model performance improves, raising a crucial question: How should the total budget for generation and training be allocated across iterations to maximize final performance? In this work, we develop a theoretical framework for analyzing budget allocation strategies. Specifically, we show that constant policies fail to converge with high probability, while increasing policies---particularly exponential growth policies---exhibit significant theoretical advantages. Experiments on image denoising with diffusion probabilistic models and math reasoning with large language models show that both exponential and polynomial growth policies consistently outperform constant policies, with exponential policies often providing more stable performance.

View full details

Poster

Differentiable Decision Tree via "ReLU+Argmin" Reformulation

Qiangqiang Mao ⋅ Jiayang Ren ⋅ Yixiu Wang ⋅ Chenxuanyin Zou ⋅ Jingjing Zheng ⋅ Yankai Cao

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Decision tree, despite its unmatched interpretability and lightweight structure, faces two key issues that limit its broader applicability: non-differentiability and low testing accuracy. This study addresses these issues by developing a differentiable oblique tree that optimizes the entire tree using gradient-based optimization. We propose an exact reformulation of hard-split trees based on "ReLU+Argmin" mechanism, and then cast the reformulated tree training as an unconstrained optimization task. The ReLU-based sample branching, expressed as exact-zero or non-zero values, preserve a unique decision path, in contrast to soft decision trees with probabilistic routing. The subsequent Argmin operation identifies the unique zero-violation path, enabling deterministic predictions. For effective gradient flow, we approximate Argmin behaviors by scaling softmin function. To ameliorate numerical instability, we propose a warm-start annealing scheme that solves multiple optimization tasks with increasingly accurate approximations. This reformulation alongside distributed GPU parallelism offers strong scalability, supporting 12-depth tree even on million-scale datasets where most baselines fail. Extensive experiments demonstrate that our optimized tree achieves a superior testing accuracy against 14 baselines, including an average improvement of 7.54\% over CART.

View full details

Poster

Language Models can Self-Improve at State-Value Estimation for Better Search

Ethan Mendes ⋅ Alan Ritter

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Collecting ground-truth rewards or human demonstrations for multi-step reasoning tasks is often prohibitively expensive, especially in interactive domains such as web tasks. We introduce Self-Taught Lookahead (STL), a reward-free framework that improves language model–based value functions by reasoning explicitly about state transitions. STL can be viewed as a chain-of-thought analogue of the value iteration algorithm: instead of regressing directly on numeric values, a value LLM is trained to simulate a step of lookahead in natural language—predicting the next action, resulting state, and rationale for its value. This process refines value estimates without any labeled data. The self-supervised procedure yields more accurate state-value predictions, which in turn enable lightweight search algorithms to expand fewer states while maintaining strong performance. Empirically, STL-trained value models built on moderately sized (8B-parameter) open-weight LLMs boost web agent success rates by over 39%, achieving performance comparable to proprietary models. STL also generalizes to multi-hop question answering and math puzzles. Overall, STL enables small open-source models to guide efficient search, reducing inference costs by integrating explicit reasoning with value learning.

View full details

Poster

WISA: World simulator assistant for physics-aware text-to-video generation

Jing Wang ⋅ Ao Ma ⋅ Ke Cao ⋅ Jun Zheng ⋅ Jiasong Feng ⋅ Zhanjie Zhang ⋅ Wanyuan Pang ⋅ Xiaodan Liang

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Recent advances in text-to-video (T2V) generation, exemplified by models such as Sora and Kling, have demonstrated strong potential for constructing world simulators. However, existing T2V models still struggle to understand abstract physical principles and to generate videos that faithfully obey physical laws. This limitation stems primarily from the lack of explicit physical guidance, caused by a significant gap between high-level physical concepts and the generative capabilities of current models. To address this challenge, we propose the **W**orld **S**imulator **A**ssistant (**WISA**), a novel framework designed to systematically decompose and integrate physical principles into T2V models. Specifically, WISA decomposes physical knowledge into three hierarchical levels: textual physical descriptions, qualitative physical categories, and quantitative physical properties. It then incorporates several carefully designed modules—such as Mixture-of-Physical-Experts Attention (MoPA) and a Physical Classifier—to effectively encode these attributes and enhance the model’s adherence to physical laws during generation. In addition, most existing video datasets feature only weak or implicit representations of physical phenomena, limiting their utility for learning explicit physical principles. To bridge this gap, we present **WISA-80K**, a new dataset comprising 80,000 human-curated videos that depict 17 fundamental physical laws across three core domains of physics: dynamics, thermodynamics, and optics. Experimental results show that WISA substantially improves the alignment of T2V models (such as CogVideoX and Wan2.1) with real-world physical laws, achieving notable gains on the VideoPhy benchmark. Our data, code, and models are available in the [Project Page](https://wisav1.github.io/WISA/).

View full details

Poster

Self-Assembling Graph Perceptrons

Jialong Chen ⋅ Tong Wang ⋅ Bowen Deng ⋅ Luonan Chen ⋅ Zibin Zheng ⋅ Chuan Chen

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Inspired by the workings of biological brains, humans have designed artificial neural networks (ANNs), sparking profound advancements across various fields. However, the biological brain possesses high plasticity, enabling it to develop simple, efficient, and powerful structures to cope with complex external environments. In contrast, the superior performance of ANNs often relies on meticulously crafted architectures, which can make them vulnerable when handling complex inputs. Moreover, overparameterization often characterizes the most advanced ANNs. This paper explores the path toward building streamlined and plastic ANNs. Firstly, we introduce the Graph Perceptron (GP), which extends the most fundamental ANN, the Multi-Layer Perceptron (MLP). Subsequently, we incorporate a self-assembly mechanism on top of GP called Self-Assembling Graph Perceptron (SAGP). During training, SAGP can autonomously adjust the network's number of neurons and synapses and their connectivity. SAGP achieves comparable or even superior performance with only about 5% of the size of an MLP. We also demonstrate the SAGP's advantages in enhancing model interpretability and feature selection.

View full details

Poster

On the Hardness of Approximating Distributions with Tractable Probabilistic Models

John Leland ⋅ YooJung Choi

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

A fundamental challenge in probabilistic modeling is to balance expressivity and inference efficiency. Tractable probabilistic models (TPMs) aim to directly address this tradeoff by imposing constraints that guarantee efficient inference of certain queries while maintaining expressivity. In particular, probabilistic circuits (PCs) provide a unifying framework for many TPMs, by characterizing families of models as circuits satisfying different structural properties. Because the complexity of inference on PCs is a function of the circuit size, understanding the size requirements of different families of PCs is fundamental in mapping the trade-off between tractability and expressive efficiency. However, the study of expressive efficiency of circuits are often concerned with exact representations, which may not align with model learning, where we look to approximate the underlying data distribution closely by some distance measure. Moreover, due to hardness of inference tasks, exactly representing distributions while supporting tractable inference often incurs exponential size blow-ups. In this paper, we consider a natural, yet so far underexplored, question: can we avoid such size blow-up by allowing for some small approximation error? We study approximating distributions with probabilistic circuits with guarantees based on $f$-divergences, and analyze which inference queries remain well-approximated under this framework. We show that approximating an arbitrary distribution with bounded $f$-divergence is NP-hard for any model that can tractably compute marginals. In addition, we prove an exponential size gap for approximation between the class of decomposable PCs and that of decomposable and deterministic PCs.

View full details

Poster

MoESD: Unveil Speculative Decoding's Potential for Accelerating Sparse MoE

Zongle Huang ⋅ Lei Zhu ⋅ ZongYuan Zhan ⋅ Ting Hu ⋅ Weikai Mao ⋅ Xianzhi Yu ⋅ Yongpan Liu ⋅ Tianyu Zhang

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Large Language Models (LLMs) have achieved remarkable success across many applications, with Mixture of Experts (MoE) models demonstrating great potential. Compared to traditional dense models, MoEs achieve better performance with less computation. Speculative decoding (SD) is a widely used technique to accelerate LLM inference without accuracy loss, but it has been considered efficient only for dense models. In this work, we first demonstrate that, under medium batch sizes, MoE surprisingly benefits more from SD than dense models. Furthermore, as MoE becomes sparser -- the prevailing trend in MoE designs -- the batch size range where SD acceleration is expected to be effective becomes broader. To quantitatively understand tradeoffs involved in SD, we develop a reliable modeling based on theoretical analyses. While current SD research primarily focuses on improving acceptance rates of algorithms, changes in workload and model architecture can still lead to degraded SD acceleration even with high acceptance rates. To address this limitation, we introduce a new metric 'target efficiency' that characterizes these effects, thus helping researchers identify system bottlenecks and understand SD acceleration more comprehensively. For scenarios like private serving, this work unveils a new perspective to speed up MoE inference, where existing solutions struggle. Experiments on different GPUs show up to 2.29x speedup for Qwen2-57B-A14B at medium batch sizes and validate our theoretical predictions.

View full details

Poster

Among Us: A Sandbox for Measuring and Detecting Agentic Deception

Satvik Golechha ⋅ Adrià Garriga-Alonso

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Prior studies on deception in language-based AI agents typically assess whether the agent produces a false statement about a topic, or makes a binary choice prompted by a goal, rather than allowing open-ended deceptive behavior to emerge in pursuit of a longer-term goal. To fix this, we introduce $\textit{Among Us}$, a sandbox social deception game where LLM-agents exhibit long-term, open-ended deception as a consequence of the game objectives. While most benchmarks saturate quickly, $\textit{Among Us}$ can be expected to last much longer, because it is a multi-player game far from equilibrium. Using the sandbox, we evaluate $18$ proprietary and open-weight LLMs and uncover a general trend: models trained with RL are comparatively much better at producing deception than detecting it. We evaluate the effectiveness of methods to detect lying and deception: logistic regression on the activations and sparse autoencoders (SAEs). We find that probes trained on a dataset of ``pretend you're a dishonest model: $\dots$'' generalize extremely well out-of-distribution, consistently obtaining AUROCs over 95% even when evaluated just on the deceptive statement, without the chain of thought. We also find two SAE features that work well at deception detection but are unable to steer the model to lie less. We hope our open-sourced sandbox, game logs, and probes serve to anticipate and mitigate deceptive behavior and capabilities in language-based agents.

View full details

Poster

InterMT: Multi-Turn Interleaved Preference Alignment with Human Feedback

Boyuan Chen ⋅ Donghai Hong ⋅ Jiaming Ji ⋅ Jiacheng Zheng ⋅ Bowen Dong ⋅ Jiayi Zhou ⋅ Kaile Wang ⋅ Juntao Dai ⋅ Xuyao Wang ⋅ wenqi chen ⋅ Qirui Zheng ⋅ Wenxin Li ⋅ Sirui Han ⋅ Yike Guo ⋅ Yaodong Yang

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

As multimodal large models (MLLMs) continue to advance across challenging tasks, a key question emerges: \textbf{\textit{What essential capabilities are still missing? }}A critical aspect of human learning is continuous interaction with the environment -- not limited to language, but also involving multimodal understanding and generation.To move closer to human-level intelligence, models must similarly support \textbf{multi-turn}, \textbf{multimodal interaction}. In particular, they should comprehend interleaved multimodal contexts and respond coherently in ongoing exchanges.In this work, we present \textbf{an initial exploration} through the \textsc{InterMT} -- \textbf{the first preference dataset for \textit{multi-turn} multimodal interaction}, grounded in real human feedback. In this exploration, we particularly emphasize the importance of human oversight, introducing expert annotations to guide the process, motivated by the fact that current MLLMs lack such complex interactive capabilities. \textsc{InterMT} captures human preferences at both global and local levels into nine sub-dimensions, consists of 15.6k prompts, 52.6k multi-turn dialogue instances, and 32.4k human-labeled preference pairs. To compensate for the lack of capability for multi-modal understanding and generation, we introduce an agentic workflow that leverages tool-augmented MLLMs to construct multi-turn QA instances.To further this goal, we introduce \textsc{InterMT-Bench} to assess the ability ofMLLMs in assisting judges with multi-turn, multimodal tasks.We demonstrate the utility of \textsc{InterMT} through applications such as judge moderation and further reveal the \textit{multi-turn scaling law} of judge model.We hope the open-source of our data can help facilitate further research on aligning current MLLMs to the next step.

View full details

Poster

Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition

Yu Li ⋅ Jin Jiang ⋅ Jianhua Zhu ⋅ Shuai Peng ⋅ Baole ⋅ Yuxuan Zhou ⋅ Liangcai Gao

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Handwritten Mathematical Expression Recognition (HMER) remains a persistent challenge in Optical Character Recognition (OCR) due to the inherent freedom of symbol layouts and variability in handwriting styles. Prior methods have faced performance bottlenecks by proposing isolated architectural modifications, making them difficult to integrate coherently into a unified framework. Meanwhile, recent advances in pretrained vision-language models (VLMs) have demonstrated strong cross-task generalization, offering a promising foundation for developing unified solutions. In this paper, we introduce Uni-MuMER, which fully fine-tunes a VLM for the HMER task without modifying its architecture, effectively injecting domain-specific knowledge into a generalist framework. Our method integrates three data-driven tasks: Tree-Aware Chain-of-Thought (Tree-CoT) for structured spatial reasoning, Error-Driven Learning (EDL) for reducing confusion among visually similar characters, and Symbol Counting (SC) for improving recognition consistency in long expressions. Experiments on the CROHME and HME100K datasets show that Uni-MuMER achieves super state-of-the-art performance, outperforming the best lightweight specialized model SSAN by 16.31\% and the top-performing VLM Gemini2.5-flash by 24.42\% under zero-shot setting. Our datasets, models, and code are open-sourced at: https://github.com/BFlameSwift/Uni-MuMER

View full details

Poster

Universal Sequence Preconditioning

Annie Marsden ⋅ Elad Hazan

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We study the problem of preconditioning in the setting of sequential prediction. From the theoretical lens of linear dynamical systems, we show that applying a convolution to the input sequence translates to applying a polynomial to the unknown transition matrix in the hidden space. With this insight, we develop a novel preconditioning method that convolves the input sequence with the coefficients of the Chebyshev or Legendre polynomials. We formally prove that this improves the regret of a wide family of prediction methods. We proceed to apply this preconditioning technique to the method of spectral filtering. This gives the first sublinear regret bound that is also hidden-dimension free (up to logarithmic factors) even when the hidden transition matrix is asymmetric. From rigorous experiments on synthetic data we show that our simple preconditioning method generalizes to both 1) settings where the data is \emph{not} from a linear dynamical system and 2) a broad range of learning algorithms, including recurrent neural networks.

View full details

Poster

MedSG-Bench: A Benchmark for Medical Image Sequences Grounding

Jingkun Yue ⋅ Siqi Zhang ⋅ Zinan Jia ⋅ Huihuan Xu ⋅ Zongbo Han ⋅ Xiaohong Liu ⋅ Guangyu Wang

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Visual grounding is essential for precise perception and reasoning in multimodal large language models (MLLMs), especially in medical imaging domains. While existing medical visual grounding benchmarks primarily focus on single-image scenarios, real-world clinical applications often involve sequential images, where accurate lesion localization across different modalities and temporal tracking of disease progression (e.g., pre- vs. post-treatment comparison) require fine-grained cross-image semantic alignment and context-aware reasoning. To remedy the underrepresentation of image sequences in existing medical visual grounding benchmarks, we propose MedSG-Bench, the first benchmark tailored for Medical Image Sequences Grounding. It comprises eight VQA-style tasks, formulated into two paradigms of the grounding tasks, including 1) Image Difference Grounding, which focuses on detecting change regions across images, and 2) Image Consistency Grounding, which emphasizes detection of consistent or shared semantics across sequential images. MedSG-Bench covers 76 public datasets, 10 medical imaging modalities, and a wide spectrum of anatomical structures and diseases, totaling 9,630 question–answer pairs. We benchmark both general-purpose MLLMs (e.g., Qwen2.5-VL) and medical-domain specialized MLLMs (e.g., HuatuoGPT-vision), observing that even the advanced models exhibit substantial limitations in medical sequential grounding tasks. To advance this field, we construct MedSG-188K, a large-scale instruction-tuning dataset tailored for sequential visual grounding, and further develop MedSeq-Grounder, an MLLM designed to facilitate future research on fine-grained understanding across medical sequential images. We release all resources on https://github.com/Yuejingkun/MedSG-Bench

View full details

Poster

A learnability analysis on neuro-symbolic learning

Hao-Yuan He ⋅ Ming LI

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

This paper presents a comprehensive theoretical analysis of the learnability of neuro-symbolic (NeSy) tasks within hybrid systems. We characterize the learnability of NeSy tasks by their derived constraint satisfaction problems (DCSPs), demonstrating that a task is learnable if and only if its corresponding DCSP admits a unique solution. Under mild assumptions, we establish the sample complexity for learnable tasks and show that, for general tasks, the asymptotic expected concept error is controlled by the degree of disagreement among DCSP solutions. Our findings unify the characterization of learnability and the phenomenon of reasoning shortcuts, providing theoretical guarantees and actionable guidance for the principled design of NeSy systems.

View full details

Poster

MLIP Arena: Advancing Fairness and Transparency in Machine Learning Interatomic Potentials via an Open, Accessible Benchmark Platform

Yuan Chiang ⋅ Tobias Kreiman ⋅ Christine Zhang ⋅ Matthew Kuner ⋅ Elizabeth Weaver ⋅ Ishan Amin ⋅ Hyunsoo Park ⋅ Yunsung Lim ⋅ Jihan Kim ⋅ Daryl Chrzan ⋅ Aron Walsh ⋅ Samuel Blau ⋅ Mark Asta ⋅ Aditi Krishnapriyan

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Machine learning interatomic potentials (MLIPs) have revolutionized molecular and materials modeling, but existing benchmarks suffer from data leakage, limited transferability, and an over-reliance on error-based metrics tied to specific density functional theory (DFT) references. We introduce MLIP Arena, a benchmark platform that evaluates force field performance based on physics awareness, chemical reactivity, stability under extreme conditions, and predictive capabilities for thermodynamic properties and physical phenomena. By moving beyond static DFT references and revealing the important failure modes of current foundation MLIPs in real-world settings, MLIP Arena provides a reproducible framework to guide the next-generation MLIP development toward improved predictive accuracy and runtime efficiency while maintaining physical consistency. The Python package and online leaderboard are available at https://github.com/atomind-ai/mlip-arena.

View full details

Poster

On Transferring Transferability: Towards a Theory for Size Generalization

Eitan Levin ⋅ Yuxin Ma ⋅ Mateo Diaz ⋅ Soledad Villar

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Many modern learning tasks require models that can take inputs of varying sizes. Consequently, dimension-independent architectures have been proposed for domains where the inputs are graphs, sets, and point clouds. Recent work on graph neural networks has explored whether a model trained on low-dimensional data can transfer its performance to higher-dimensional inputs. We extend this body of work by introducing a general framework for transferability across dimensions. We show that transferability corresponds precisely to continuity in a limit space formed by identifying small problem instances with equivalent large ones. This identification is driven by the data and the learning task. We instantiate our framework on existing architectures, and implement the necessary changes to ensure their transferability. Finally, we provide design principles for designing new transferable models. Numerical experiments support our findings.

View full details

Poster

MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly

Zhaowei Wang ⋅ Wenhao Yu ⋅ Xiyu REN ⋅ Jipeng Zhang ⋅ Yu Zhao ⋅ Rohit Saxena ⋅ Liang Cheng ⋅ Ginny Wong ⋅ Simon See ⋅ Pasquale Minervini ⋅ Yangqiu Song ⋅ Mark Steedman

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The rapid extension of context windows in large vision-language models has given rise to long-context vision-language models (LCVLMs), which are capable of handling hundreds of images with interleaved text tokens in a single forward pass. In this work, we introduce MMLongBench, the first benchmark covering a diverse set of long-context vision-language tasks, to evaluate LCVLMs effectively and thoroughly. MMLongBench is composed of 13,331 examples spanning five different categories of downstream tasks, such as Visual RAG and Many-Shot ICL. It also provides broad coverage of image types, including various natural and synthetic images. To assess the robustness of the models to different input lengths, all examples are delivered at five standardized input lengths (8K-128K tokens) via a cross-modal tokenization scheme that combines vision patches and text tokens. Through a thorough benchmarking of 46 closed-source and open-source LCVLMs, we provide a comprehensive analysis of the current models' vision-language long-context ability. Our results show that: i) performance on a single task is a weak proxy for overall long-context capability; ii) both closed-source and open-source models face challenges in long-context vision-language tasks, indicating substantial room for future improvement; iii) models with stronger reasoning ability tend to exhibit better long-context performance. By offering wide task coverage, various image types, and rigorous length control, MMLongBench provides the missing foundation for diagnosing and advancing the next generation of LCVLMs.

View full details

Poster

Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis

Tianbao Xie ⋅ Jiaqi Deng ⋅ Xiaochuan Li ⋅ Junlin Yang ⋅ Haoyuan Wu ⋅ Jixuan Chen ⋅ Wenjing Hu ⋅ Xinyuan Wang ⋅ Yuhui Xu ⋅ Zekun Wang ⋅ Yiheng Xu ⋅ Junli Wang ⋅ Doyen Sahoo ⋅ Tao Yu ⋅ Caiming Xiong

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Graphical user interface (GUI) grounding, the ability to map natural language instructions to specific actions on graphical user interfaces, remains a critical bottleneck in computer use agent development. Current benchmarks oversimplify grounding tasks as short referring expressions, failing to capture the complexity of real-world interactions that require software commonsense, layout understanding, and fine-grained manipulation capabilities. To address these limitations, we introduce OSWorld-G, a comprehensive benchmark comprising 564 finely annotated samples across diverse task types including text matching, element recognition, layout understanding, and precise manipulation. Additionally, we synthesize and release the largest computer use grounding dataset Jedi, which contains 4 million examples through multi-perspective decoupling of tasks. Our multi-scale models trained on Jedi demonstrate its effectiveness by outperforming existing approaches on ScreenSpot-v2, ScreenSpot-Pro, and our OSWorld-G. Furthermore, we demonstrate that improved grounding with Jedi directly enhances agentic capabilities of general foundation models on complex computer tasks with state-of-the-art performance, improving from 23% to 51% on OSWorld. Through detailed ablation studies, we identify key factors contributing to grounding performance and verify that combining specialized data for different interface elements enables compositional generalization to novel interfaces. All benchmark, data, checkpoints, and code are open-sourced and available at https://osworld-grounding.github.io.

View full details

Poster

When Worse is Better: Navigating the Compression Generation Trade-off In Visual Tokenization

Vivek Ramanujan ⋅ Kushal Tirumala ⋅ Armen Aghajanyan ⋅ Luke Zettlemoyer ⋅ Ali Farhadi

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Current image generation methods are based on a two-stage training approach. In stage 1, an auto-encoder is trained to compress an image into a latent space; in stage 2, a generative model is trained to learn a distribution over that latent space. This reveals a fundamental trade-off, do we compress more aggressively to make the latent distribution easier for the stage 2 model to learn even if it makes reconstruction worse? We study this problem in the context of discrete, auto-regressive image generation. Through the lens of scaling laws, we show that smaller stage 2 models can benefit from more compressed stage 1 latents even if reconstruction performance worsens, demonstrating that generation modeling capacity plays a role in this trade-off. Diving deeper, we rigorously study the connection between compute scaling and the stage 1 rate-distortion trade-off. Next, we introduce Causally Regularized Tokenization (CRT), which uses knowledge of the stage 2 generation modeling procedure to embed useful inductive biases in stage 1 latents. This regularization improves stage 2 generation performance better by making the tokens easier to model without affecting the stage 1 compression rate and marginally affecting distortion: we are able to improve compute efficiency 2-3$\times$ over baseline. Finally, we use CRT with further optimizations to the visual tokenizer setup to result in a generative pipeline that matches LlamaGen-3B generation performance (2.18 FID) with half the tokens per image (256 vs. 576) and a fourth the total model parameters (775M vs. 3.1B) while using the same architecture and inference procedure.

View full details

Poster

Credal Prediction based on Relative Likelihood

Timo Löhr ⋅ Paul Hofman ⋅ Felix Mohr ⋅ Eyke Hüllermeier

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Predictions in the form of sets of probability distributions, so-called credal sets, provide a suitable means to represent a learner's epistemic uncertainty. In this paper, we propose a theoretically grounded approach to credal prediction based on the statistical notion of relative likelihood: The target of prediction is the set of all (conditional) probability distributions produced by the collection of plausible models, namely those models whose relative likelihood exceeds a specified threshold. This threshold has an intuitive interpretation and allows for controlling the trade-off between correctness and precision of credal predictions. We tackle the problem of approximating credal sets defined in this way by means of suitably modified ensemble learning techniques. To validate our approach, we illustrate its effectiveness by experiments on benchmark datasets demonstrating superior uncertainty representation without compromising predictive performance. We also compare our method against several state-of-the-art baselines in credal prediction.

View full details

Poster

The Generative Leap: Tight Sample Complexity for Efficiently Learning Gaussian Multi-Index Models

Alex Damian ⋅ Jason Lee ⋅ Joan Bruna

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

In this work we consider generic Gaussian Multi-index models, in which the labels only depend on the (Gaussian) $d$-dimensional inputs through their projection onto a low-dimensional $r = O_d(1)$ subspace, and we study efficient agnostic estimation procedures for this hidden subspace. We introduce the *generative leap* exponent, a natural extension of the generative exponent from Damian et al. 2024 to the multi-index setting. We show that a sample complexity of $n=\Theta(d^{1 \vee k^\star/2})$ is necessary in the class of algorithms captured by the Low-Degree-Polynomial framework; and also sufficient, by giving a sequential estimation procedure based on a spectral U-statistic over appropriate Hermite tensors.

View full details

Poster

TIDMAD: Time Series Dataset for Discovering Dark Matter with AI Denoising

Jessica Fry ⋅ Xinyi Fu ⋅ Zhenghao Fu ⋅ Kaliroë Pappas ⋅ Lindley Winslow ⋅ Aobo Li

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Dark matter makes up approximately 85\% of total matter in our universe, yet it has never been directly observed in any laboratory on Earth. The origin of dark matter is one of the most important questions in contemporary physics, and a convincing detection of dark matter would be a Nobel-Prize-level breakthrough in fundamental science. The ABRACADABRA experiment was specifically designed to search for dark matter. Although it has not yet made a discovery, ABRACADABRA has produced several dark matter search results widely endorsed by the physics community. The experiment generates ultra-long time-series data at a rate of 10 million samples per second, where the dark matter signal would manifest itself as a sinusoidal oscillation mode within the ultra-long time series. In this paper, we present the TIDMAD --- a comprehensive data release from the ABRACADABRA experiment including three key components: an ultra-long time series dataset divided into training, validation, and science subsets; a carefully-designed denoising score for direct model benchmarking; and a complete analysis framework which produces a physics community-standard dark matter search result suitable for publication as a physics paper. This data release enables core AI algorithms to extract the dark matter signal and produce real physics results thereby advancing fundamental science.

View full details

Poster

V2X-Radar: A Multi-modal Dataset with 4D Radar for Cooperative Perception

Lei Yang ⋅ Xinyu Zhang ⋅ Jun Li ⋅ Chen Wang ⋅ Jiaqi Ma ⋅ Zhiying Song ⋅ Tong Zhao ⋅ Ziying Song ⋅ Li Wang ⋅ Mo Zhou ⋅ Yang Shen ⋅ Kai WU ⋅ Chen Lv

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Modern autonomous vehicle perception systems often struggle with occlusions and limited perception range. Previous studies have demonstrated the effectiveness of cooperative perception in extending the perception range and overcoming occlusions, thereby enhancing the safety of autonomous driving. In recent years, a series of cooperative perception datasets have emerged; however, these datasets primarily focus on cameras and LiDAR, neglecting 4D Radar—a sensor used in single-vehicle autonomous driving to provide robust perception in adverse weather conditions. In this paper, to bridge the gap created by the absence of 4D Radar datasets in cooperative perception, we present V2X-Radar, the first large-scale, real-world multi-modal dataset featuring 4D Radar. V2X-Radar dataset is collected using a connected vehicle platform and an intelligent roadside unit equipped with 4D Radar, LiDAR, and multi-view cameras. The collected data encompasses sunny and rainy weather conditions, spanning daytime, dusk, and nighttime, as well as various typical challenging scenarios. The dataset consists of 20K LiDAR frames, 40K camera images, and 20K 4D Radar data, including 350K annotated boxes across five categories. To support various research domains, we have established V2X-Radar-C for cooperative perception, V2X-Radar-I for roadside perception, and V2X-Radar-V for single-vehicle perception. Furthermore, we provide comprehensive benchmarks across these three sub-datasets.

View full details

Poster

Curl Descent : Non-Gradient Learning Dynamics with Sign-Diverse Plasticity

Hugo Ninou ⋅ Jonathan Kadmon ⋅ N Alex Cayco Gajic

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Gradient-based algorithms are a cornerstone of artificial neural network training, yet it remains unclear whether biological neural networks use similar gradient-based strategies during learning. Experiments often discover a diversity of synaptic plasticity rules, but whether these amount to an approximation to gradient descent is unclear. Here we investigate a previously overlooked possibility: that learning dynamics may include fundamentally non-gradient "curl"-like components while still being able to effectively optimize a loss function. Curl terms naturally emerge in networks with excitatory-inhibitory connectivity or Hebbian/anti-Hebbian plasticity, resulting in learning dynamics that cannot be framed as gradient descent on any objective. To investigate the impact of these curl terms, we analyze feedforward networks within an analytically tractable student-teacher framework, systematically introducing non-gradient dynamics through rule-flipped neurons. Small curl terms preserve the stability of the original solution manifold, resulting in learning dynamics similar to gradient descent. Beyond a critical value, strong curl terms destabilize the solution manifold. Depending on the network architecture, this loss of stability can lead to chaotic learning dynamics that destroy performance. In other cases, the curl terms can counterintuitively speed up learning compared to gradient descent by allowing the weight dynamics to escape saddles by temporarily ascending the loss. Our results identify specific architectures capable of supporting robust learning via diverse learning rules, providing an important counterpoint to normative theories of gradient-based learning in neural networks.

View full details

Poster

Unleashing Hour-Scale Video Training for Long Video-Language Understanding

Jingyang Lin ⋅ Jialian Wu ⋅ Ximeng Sun ⋅ Ze Wang ⋅ Jiang Liu ⋅ Yusheng Su ⋅ Xiaodong Yu ⋅ Hao Chen ⋅ Jiebo Luo ⋅ Zicheng Liu ⋅ Emad Barsoum

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Recent long-form video-language understanding benchmarks have driven progress in video large multimodal models (Video-LMMs). However, the scarcity of well-annotated long videos has left the training of hour-long Video-LMMs underexplored. To close this gap, we present VideoMarathon, a large-scale hour-long video instruction-following dataset. This dataset includes around 9,700 hours of long videos sourced from diverse domains, ranging from 3 to 60 minutes per video. Specifically, it contains 3.3M high-quality QA pairs, spanning six fundamental topics: temporality, spatiality, object, action, scene, and event. Compared to existing video instruction datasets, VideoMarathon significantly extends training video durations up to 1 hour, and supports 22 diverse tasks requiring both short- and long-term video comprehension. Building on VideoMarathon, we propose Hour-LLaVA, a powerful and efficient Video-LMM for hour-scale video-language modeling. It enables hour-long video training and inference at 1-FPS sampling by leveraging a memory augmentation module, which adaptively integrates question-relevant and spatiotemporally informative semantics from the cached full video context. In our experiments, Hour-LLaVA achieves the best performance on multiple representative long video-language benchmarks, demonstrating the high quality of the VideoMarathon dataset and the superiority of the Hour-LLaVA model.

View full details

Poster

DeepHalo: A Neural Choice Model with Controllable Context Effects

Shuhan Zhang ⋅ Zhi Wang ⋅ Rui Gao ⋅ Shuang Li

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Modeling human decision-making is central to applications such as recommendation, preference learning, and human-AI alignment. While many classic models assume context-independent choice behavior, a large body of behavioral research shows that preferences are often influenced by the composition of the choice set itself---a phenomenon known as the context effect or Halo effect. These effects can manifest as pairwise (first-order) or even higher-order interactions among the available alternatives. Recent models that attempt to capture such effects either focus on the featureless setting or, in the feature-based setting, rely on restrictive interaction structures or entangle interactions across all orders, which limits interpretability. In this work, we propose DeepHalo, a neural modeling framework that incorporates features while enabling explicit control over interaction order and principled interpretation of context effects. Our model enables systematic identification of interaction effects by order and serves as a universal approximator of context-dependent choice functions when specialized to a featureless setting. Experiments on synthetic and real-world datasets demonstrate strong predictive performance while providing greater transparency into the drivers of choice.

View full details

Poster

Counteractive RL: Rethinking Core Principles for Efficient and Scalable Deep Reinforcement Learning

Ezgi Korkmaz

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Following the pivotal success of learning strategies to win at tasks, solely by interacting with an environment without any supervision, agents have gained the ability to make sequential decisions in complex MDPs. Yet, reinforcement learning policies face exponentially growing state spaces in high dimensional MDPs resulting in a dichotomy between computational complexity and policy success. In our paper we focus on the agent’s interaction with the environment in a high-dimensional MDP during the learning phase and we introduce a theoretically-founded novel paradigm based on experiences obtained through counteractive actions. Our analysis and method provide a theoretical basis for efficient, effective, scalable and accelerated learning, and further comes with zero additional computational complexity while leading to significant acceleration in training. We conduct extensive experiments in the Arcade Learning Environment with high-dimensional state representation MDPs. The experimental results further verify our theoretical analysis, and our method achieves significant performance increase with substantial sample-efficiency in high-dimensional environments.

View full details

Poster

Data Mixing Can Induce Phase Transitions in Knowledge Acquisition

Xinran Gu ⋅ Kaifeng Lyu ⋅ Jiazheng Li ⋅ Jingzhao Zhang

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Large Language Models (LLMs) are typically trained on data mixtures: most data come from web scrapes, while a small portion is curated from high-quality sources with dense domain-specific knowledge. In this paper, we show that when training LLMs on such data mixtures, knowledge acquisition from knowledge-dense datasets—unlike training exclusively on knowledge-dense data—does not always follow a smooth scaling law but can exhibit phase transitions with respect to the mixing ratio and model size. Through controlled experiments on a synthetic biography dataset mixed with web-scraped data, we demonstrate that: (1) as we increase the model size to a critical value, the model suddenly transitions from memorizing very few to most of the biographies; (2) below a critical mixing ratio, the model memorizes almost nothing even with extensive training, but beyond this threshold, it rapidly memorizes more biographies. We attribute these phase transitions to a capacity allocation phenomenon: a model with bounded capacity must act like a knapsack problem solver to minimize the overall test loss, and the optimal allocation across datasets can change discontinuously as the model size or mixing ratio varies. We formalize this intuition in an information-theoretic framework and reveal that these phase transitions are predictable, with the critical mixing ratio following a power-law relationship with the model size. Our findings highlight a concrete case where a good mixing recipe for large models may not be optimal for small models, and vice versa.

View full details

Poster

How Well Can Differential Privacy Be Audited in One Run?

Amit Keinan ⋅ Moshe Shenfeld ⋅ Katrina Ligett

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Recent methods for auditing the privacy of machine learning algorithms have improved computational efficiency by simultaneously intervening on multiple training examples in a single training run. Steinke et al. prove that one-run auditing indeed lower bounds the true privacy parameter of the audited algorithm, and give impressive empirical results. Their work leaves open the question of how precisely one-run auditing can uncover the true privacy parameter of an algorithm, and how that precision depends on the audited algorithm. In this work, we characterize the maximum achievable efficacy of one-run auditing and show that the key barrier to its efficacy is interference between the observable effects of different data elements. We present new conceptual approaches to minimize this barrier, towards improving the performance of one-run auditing of real machine learning algorithms.

View full details

Poster

Differentiable Cyclic Causal Discovery Under Unmeasured Confounders

Muralikrishnna Guruswamy Sethuraman ⋅ Faramarz Fekri

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Understanding causal relationships between variables is fundamental across scientific disciplines. Most causal discovery algorithms rely on two key assumptions: (i) all variables are observed, and (ii) the underlying causal graph is acyclic. While these assumptions simplify theoretical analysis, they are often violated in real-world systems, such as biological networks. Existing methods that account for confounders either assume linearity or struggle with scalability. To address these limitations, we propose DCCD-CONF, a novel framework for differentiable learning of nonlinear cyclic causal graphs in the presence of unmeasured confounders using interventional data. Our approach alternates between optimizing the graph structure and estimating the confounder distribution by maximizing the log-likelihood of the data. Through experiments on synthetic data and real-world gene perturbation datasets, we show that DCCD-CONF outperforms state-of-the-art methods in both causal graph recovery and confounder identification. Additionally, we provide consistency guarantees for our framework, reinforcing its theoretical soundness.

View full details

Poster

Forecasting in Offline Reinforcement Learning for Non-stationary Environments

Suzan Ece Ada ⋅ Georg Martius ⋅ Emre Ugur ⋅ Erhan Oztop

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Offline Reinforcement Learning (RL) provides a promising avenue for training policies from pre-collected datasets when gathering additional interaction data is infeasible. However, existing offline RL methods often assume stationarity or only consider synthetic perturbations at test time, assumptions that often fail in real-world scenarios characterized by abrupt, time-varying offsets. These offsets can lead to partial observability, causing agents to misperceive their true state and degrade performance. To overcome this challenge, we introduce Forecasting in Non-stationary Offline RL (FORL), a framework that unifies (i) conditional diffusion-based candidate state generation, trained without presupposing any specific pattern of future non-stationarity, and (ii) zero-shot time-series foundation models. FORL targets environments prone to unexpected, potentially non-Markovian offsets, requiring robust agent performance from the onset of each episode. Empirical evaluations on offline RL benchmarks, augmented with real-world time-series data to simulate realistic non-stationarity, demonstrate that FORL consistently improves performance compared to competitive baselines. By integrating zero-shot forecasting with the agent's experience, we aim to bridge the gap between offline RL and the complexities of real-world, non-stationary environments.

View full details

Poster

ELECTRA: A Cartesian Network for 3D Charge Density Prediction with Floating Orbitals

Jonas Elsborg ⋅ Luca Thiede ⋅ Alan Aspuru-Guzik ⋅ Tejs Vegge ⋅ Arghya Bhowmik

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We present the Electronic Tensor Reconstruction Algorithm (ELECTRA) - an equivariant model for predicting electronic charge densities using floating orbitals. Floating orbitals are a long-standing concept in the quantum chemistry community that promises more compact and accurate representations by placing orbitals freely in space, as opposed to centering all orbitals at the position of atoms. Finding the ideal placement of these orbitals requires extensive domain knowledge, though, which thus far has prevented widespread adoption. We solve this in a data-driven manner by training a Cartesian tensor network to predict the orbital positions along with orbital coefficients. This is made possible through a symmetry-breaking mechanism that is used to learn position displacements with lower symmetry than the input molecule while preserving the rotation equivariance of the charge density itself. Inspired by recent successes of Gaussian Splatting in representing densities in space, we are using Gaussian orbitals and predicting their weights and covariance matrices. Our method achieves a state-of-the-art balance between computational efficiency and predictive accuracy on established benchmarks. Furthermore, ELECTRA is able to lower the compute time required to arrive at converged DFT solutions - initializing calculations using our predicted densities yields an average 50.72 % reduction in self-consistent field (SCF) iterations on unseen molecules.

View full details

Poster

ENMA: Tokenwise Autoregression for Continuous Neural PDE Operators

Armand Kassaï Koupaï ⋅ Lise Le Boudec ⋅ Louis Serrano ⋅ Patrick Gallinari

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Solving time-dependent parametric partial differential equations (PDEs) remains a fundamental challenge for neural solvers, particularly when generalizing across a wide range of physical parameters and dynamics. When data is uncertain or incomplete—as is often the case—a natural approach is to turn to generative models. We introduce ENMA, a generative neural operator designed to model spatio-temporal dynamics arising from physical phenomena. ENMA predicts future dynamics in a compressed latent space using a generative masked autoregressive transformer trained with flow matching loss, enabling tokenwise generation. Irregularly sampled spatial observations are encoded into uniform latent representations via attention mechanisms and further compressed through a spatio-temporal convolutional encoder. This allows ENMA to perform in-context learning at inference time by conditioning on either past states of the target trajectory or auxiliary context trajectories with similar dynamics. The result is a robust and adaptable framework that generalizes to new PDE regimes and supports one-shot surrogate modeling of time-dependent parametric PDEs.

View full details

Poster

Improving Perturbation-based Explanations by Understanding the Role of Uncertainty Calibration

Thomas Decker ⋅ Volker Tresp ⋅ Florian Buettner

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Perturbation-based explanations are widely utilized to enhance the transparency of machine-learning models in practice. However, their reliability is often compromised by the unknown model behavior under the specific perturbations used. This paper investigates the relationship between uncertainty calibration - the alignment of model confidence with actual accuracy - and perturbation-based explanations. We show that models systematically produce unreliable probability estimates when subjected to explainability-specific perturbations and theoretically prove that this directly undermines global and local explanation quality. To address this, we introduce ReCalX, a novel approach to recalibrate models for improved explanations while preserving their original predictions. Empirical evaluations across diverse models and datasets demonstrate that ReCalX consistently reduces perturbation-specific miscalibration most effectively while enhancing explanation robustness and the identification of globally important input features.

View full details

Poster

CausalPFN: Amortized Causal Effect Estimation via In-Context Learning

Vahid Balazadeh Meresht ⋅ Hamidreza Kamkari ⋅ Valentin Thomas ⋅ Junwei Ma ⋅ Bingru Li ⋅ Jesse Cresswell ⋅ Rahul Krishnan

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Causal effect estimation from observational data is fundamental across various applications. However, selecting an appropriate estimator from dozens of specialized methods demands substantial manual effort and domain expertise. We present CausalPFN, a single transformer that *amortizes* this workflow: trained once on a large library of simulated data-generating processes that satisfy ignorability, it infers causal effects for new observational datasets out of the box. CausalPFN combines ideas from Bayesian causal inference with the large-scale training protocol of prior-fitted networks (PFNs), learning to map raw observations directly to causal effects without any task-specific adjustment. Our approach achieves superior average performance on heterogeneous and average treatment effect estimation benchmarks (IHDP, Lalonde, ACIC). Moreover, it shows competitive performance for real-world policy making on uplift modeling tasks. CausalPFN provides calibrated uncertainty estimates to support reliable decision-making based on Bayesian principles. This ready-to-use model requires no further training or tuning and takes a step toward automated causal inference (https://github.com/vdblm/CausalPFN/).

View full details

Poster

Gaussian Herding across Pens: An Optimal Transport Perspective on Global Gaussian Reduction for 3DGS

Tao Wang ⋅ Mengyu Li ⋅ Geduo Zeng ⋅ Cheng Meng ⋅ Qiong Zhang

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

3D Gaussian Splatting (3DGS) has emerged as a powerful technique for radiance field rendering, but it typically requires millions of redundant Gaussian primitives, overwhelming memory and rendering budgets. Existing compaction approaches address this by pruning Gaussians based on heuristic importance scores, without global fidelity guarantee. To bridge this gap, we propose a novel optimal transport perspective that casts 3DGS compaction as global Gaussian mixture reduction. Specifically, we first minimize the composite transport divergence over a KD-tree partition to produce a compact geometric representation, and then decouple appearance from geometry by fine-tuning color and opacity attributes with far fewer Gaussian primitives. Experiments on benchmark datasets show that our method (i) yields negligible loss in rendering quality (PSNR, SSIM, LPIPS) compared to vanilla 3DGS with only 10\% Gaussians; and (ii) consistently outperforms state-of-the-art 3DGS compaction techniques. Notably, our method is applicable to any stage of vanilla or accelerated 3DGS pipelines, providing an efficient and agnostic pathway to lightweight neural rendering.

View full details

Poster

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Jonas Geiping ⋅ Sean McLeish ⋅ Neel Jain ⋅ John Kirchenbauer ⋅ Siddharth Singh ⋅ Brian Bartoldson ⋅ Bhavya Kailkhura ⋅ Abhinav Bhatele ⋅ Tom Goldstein

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We study a novel language model architecture that is capable of scaling test-time computation by implicitly reasoning in latent space. Our model works by iterating a recurrent block, thereby unrolling to arbitrary depth at test-time. This stands in contrast to mainstream reasoning models that scale up compute by producing more tokens. Unlike approaches based on chain-of-thought, our approach does not require any specialized training data, can work with small context windows, and can capture types of reasoning that are not easily represented in words. We train a proof-of-concept model from scratch with 3.5 billion parameters and 800 billion tokens. We show that this model can effortlessly use varying levels of compute, significantly improving with additional compute especially on reasoning tasks, such as math and coding. Further, this architecture naturally reduces compute costs via zero-shot per-token adaptive compute, KV-cache sharing and speculative decoding.

View full details

Poster

Stable Gradients for Stable Learning at Scale in Deep Reinforcement Learning

Roger Creus Castanyer ⋅ Johan Obando Ceron ⋅ Lu Li ⋅ Pierre-Luc Bacon ⋅ Glen Berseth ⋅ Aaron Courville ⋅ Pablo Samuel Castro

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Scaling deep reinforcement learning networks is challenging and often results in degraded performance, yet the root causes of this failure mode remain poorly understood. Several recent works have proposed mechanisms to address this, but they are often complex and fail to highlight the causes underlying this difficulty. In this work, we conduct a series of empirical analyses which suggest that the combination of non-stationarity with gradient pathologies, due to suboptimal architectural choices, underlie the challenges of scale. We propose a series of direct interventions that stabilize gradient flow, enabling robust performance across a range of network depths and widths. Our interventions are simple to implement and compatible with well-established algorithms, and result in an effective mechanism that enables strong performance even at large scales. We validate our findings on a variety of agents and suites of environments.

View full details

Poster

What Expressivity Theory Misses: Message Passing Complexity for GNNs

Niklas Kemper ⋅ Tom Wollschläger ⋅ Stephan Günnemann

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Expressivity theory, characterizing which graphs a GNN can distinguish, has become the predominant framework for analyzing GNNs, with new models striving for higher expressivity. However, we argue that this focus is misguided: First, higher expressivity is not necessary for most real-world tasks as these tasks rarely require expressivity beyond the basic WL test. Second, expressivity theory's binary characterization and idealized assumptions fail to reflect GNNs' practical capabilities. To overcome these limitations, we propose Message Passing Complexity (MPC): a continuous measure that quantifies the difficulty for a GNN architecture to solve a given task through message passing. MPC captures practical limitations like over-squashing while preserving the theoretical impossibility results from expressivity theory, effectively narrowing the gap between theory and practice. Through extensive validation on fundamental GNN tasks, we show that MPC's theoretical predictions correlate with empirical performance, successfully explaining architectural successes and failures. Thereby, MPC advances beyond expressivity theory to provide a more powerful framework for understanding and developing GNN architectures.

View full details

Poster

SpecEdge: Scalable Edge-Assisted Serving Framework for Interactive LLMs

Jinwoo Park ⋅ Seunggeun Cho ⋅ Dongsu Han

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Large language models (LLMs) power many modern applications, but serving them at scale remains costly and resource-intensive. Current server-centric systems overlook consumer-grade GPUs at the edge. We introduce SpecEdge, an edge-assisted inference framework that splits LLM workloads between edge and server GPUs using a speculative decoding scheme, exchanging only token outputs over the network. SpecEdge employs proactive edge drafting to overlap edge token creation with server verification and pipeline-aware scheduling that interleaves multiple user requests to increase server-side throughput. Experiments show SpecEdge enhances overall cost efficiency by **1.91×** through achieving **2.22×** server throughput, and reduces inter token latency by **11.24\%** compared to a server-only baseline, introducing a scalable, cost-effective paradigm for LLM serving. The code is available at https://github.com/kaist-ina/specedge

View full details

Poster

DEXTER: Diffusion-Guided EXplanations with TExtual Reasoning for Vision Models

Simone Carnemolla ⋅ Matteo Pennisi ⋅ Sarinda Samarasinghe ⋅ Giovanni Bellitto ⋅ Simone Palazzo ⋅ Daniela Giordano ⋅ Mubarak Shah ⋅ Concetto Spampinato

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Understanding and explaining the behavior of machine learning models is essential for building transparent and trustworthy AI systems. We introduce DEXTER, a data-free framework that employs diffusion models and large language models to generate global, textual explanations of visual classifiers. DEXTER operates by optimizing text prompts to synthesize class-conditional images that strongly activate a target classifier. These synthetic samples are then used to elicit detailed natural language reports that describe class-specific decision patterns and biases. Unlike prior work, DEXTER enables natural language explanation about a classifier's decision process without access to training data or ground-truth labels. We demonstrate DEXTER's flexibility across three tasks—activation maximization, slice discovery and debiasing, and bias explanation—each illustrating its ability to uncover the internal mechanisms of visual classifiers. Quantitative and qualitative evaluations, including a user study, show that DEXTER produces accurate, interpretable outputs. Experiments on ImageNet, Waterbirds, CelebA, and FairFaces confirm that DEXTER outperforms existing approaches in global model explanation and class-level bias reporting. Code is available at https://github.com/perceivelab/dexter.

View full details

Poster

Scaling can lead to compositional generalization

Florian Redhardt ⋅ Yassir Akram ⋅ Simon Schug

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Can neural networks systematically capture discrete, compositional task structure despite their continuous, distributed nature? The impressive capabilities of large-scale neural networks suggest that the answer to this question is yes. However, even for the most capable models, there are still frequent failure cases that raise doubts about their compositionality. Here, we seek to understand what it takes for a standard neural network to generalize over tasks that share compositional structure. We find that simply scaling data and model size leads to compositional generalization. We show that this holds across different task encodings as long as the training distribution sufficiently covers the task space. In line with this finding, we prove that standard multilayer perceptrons can approximate a general class of compositional task families to arbitrary precision using only a linear number of neurons with respect to the number of task modules. Finally, we uncover that if networks successfully compositionally generalize, the constituents of a task can be linearly decoded from their hidden activations. We show that this metric correlates with failures of text-to-image generation models to compose known concepts.

View full details

Poster

Is Noise Conditioning Necessary? A Unified Theory of Unconditional Graph Diffusion Models

JIPENG LI ⋅ Yanning Shen

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Explicit noise-level conditioning is widely regarded as essential for the effective operation of Graph Diffusion Models (GDMs). In this work, we challenge this assumption by investigating whether denoisers can implicitly infer noise levels directly from corrupted graph structures, potentially eliminating the need for explicit noise conditioning. To this end, we develop a theoretical framework centered on Bernoulli edge-flip corruptions and extend it to encompass more complex scenarios involving coupled structure-attribute noise. Extensive empirical evaluations on both synthetic and real-world graph datasets, using models such as GDSS and DiGress, provide strong support for our theoretical findings. Notably, unconditional GDMs achieve performance comparable or superior to their conditioned counterparts, while also offering reductions in parameters (4-6%) and computation time (8-10%). Our results suggest that the high-dimensional nature of graph data itself often encodes sufficient information for the denoising process, opening avenues for simpler, more efficient GDM architectures.

View full details

Poster

BayeSQP: Bayesian Optimization through Sequential Quadratic Programming

Paul Brunzema ⋅ Sebastian Trimpe

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We introduce BayeSQP, a novel algorithm for general black-box optimization that merges the structure of sequential quadratic programming with concepts from Bayesian optimization. BayeSQP employs second-order Gaussian process surrogates for both the objective and constraints to jointly model the function values, gradients, and Hessian from only zero-order information. At each iteration, a local subproblem is constructed using the GP posterior estimates and solved to obtain a search direction. Crucially, the formulation of the subproblem explicitly incorporates uncertainty in both the function and derivative estimates, resulting in a tractable second-order cone program for high probability improvements under model uncertainty. A subsequent one-dimensional line search via constrained Thompson sampling selects the next evaluation point. Empirical results show that BayeSQP outperforms state-of-the-art methods in specific high-dimensional settings. Our algorithm offers a principled and flexible framework that bridges classical optimization techniques with modern approaches to black-box optimization.

View full details

Poster

Option-aware Temporally Abstracted Value for Offline Goal-Conditioned Reinforcement Learning

Hongjoon Ahn ⋅ Heewoong Choi ⋅ Jisu Han ⋅ Taesup Moon

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Offline goal-conditioned reinforcement learning (GCRL) offers a practical learning paradigm in which goal-reaching policies are trained from abundant state–action trajectory datasets without additional environment interaction. However, offline GCRL still struggles with long-horizon tasks, even with recent advances that employ hierarchical policy structures, such as HIQL. Identifying the root cause of this challenge, we observe the following insight. Firstly, performance bottlenecks mainly stem from the high-level policy’s inability to generate appropriate subgoals. Secondly, when learning the high-level policy in the long-horizon regime, the sign of the advantage estimate frequently becomes incorrect. Thus, we argue that improving the value function to produce a clear advantage estimate for learning the high-level policy is essential. In this paper, we propose a simple yet effective solution: _**Option-aware Temporally Abstracted**_ value learning, dubbed **OTA**, which incorporates temporal abstraction into the temporal-difference learning process. By modifying the value update to be _option-aware_, our approach contracts the effective horizon length, enabling better advantage estimates even in long-horizon regimes. We experimentally show that the high-level policy learned using the OTA value function achieves strong performance on complex tasks from OGBench, a recently proposed offline GCRL benchmark, including maze navigation and visual robotic manipulation environments. Our code is available at https://github.com/ota-v/ota-v

View full details

Poster

Thousand Voices of Trauma: A Large-Scale Synthetic Dataset for Modeling Prolonged Exposure Therapy Conversations

Suhas BN ⋅ Andrew Sherrill ⋅ Rosa I. Arriaga ⋅ Christopher Wiese ⋅ Saeed Abdullah

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The advancement of AI systems for mental health support is hindered by limited access to therapeutic conversation data, particularly for trauma treatment. We present Thousand Voices of Trauma, a synthetic benchmark dataset of 3,000 therapy conversations based on Prolonged Exposure therapy protocols for Post-traumatic Stress Disorder (PTSD). The dataset comprises 500 unique cases, each explored through six conversational perspectives that mirror the progression of therapy from initial anxiety to peak distress to emotional processing. We incorporated diverse demographic profiles (ages 18-80, M=49.3, 49.4\% male, 44.4\% female, 6.2\% non-binary), 20 trauma types, and 10 trauma-related behaviors using deterministic and probabilistic generation methods. Analysis reveals realistic distributions of trauma types (witnessing violence 10.6\%, bullying 10.2\%) and symptoms (nightmares 23.4\%, substance abuse 20.8\%). Clinical experts validated the dataset's therapeutic fidelity, highlighting its emotional depth while suggesting refinements for greater authenticity. We also developed an emotional trajectory benchmark with standardized metrics for evaluating model responses. This privacy-preserving dataset addresses critical gaps in trauma-focused mental health data, offering a valuable resource for advancing both patient-facing applications and clinician training tools.

View full details

Poster

Sharp Gaussian approximations for Decentralized Federated Learning

SOHAM BONNERJEE ⋅ Sayar Karmakar ⋅ Wei Biao Wu

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Federated Learning has gained traction in privacy-sensitive collaborative environments, with local SGD emerging as a key optimization method in decentralized settings. While its convergence properties are well-studied, asymptotic statistical guarantees beyond convergence remain limited. In this paper, we present two generalized Gaussian approximation results for local SGD and explore their implications. First, we prove a Berry-Esseen theorem for the final local SGD iterates, enabling valid multiplier bootstrap procedures. Second, motivated by robustness considerations, we introduce two distinct time-uniform Gaussian approximations for the entire trajectory of local SGD. The time-uniform approximations support Gaussian bootstrap-based tests for detecting adversarial attacks. Extensive simulations are provided to support our theoretical results.

View full details

Poster

Return of ChebNet: Understanding and Improving an Overlooked GNN on Long Range Tasks

Ali Hariri ⋅ Alvaro Arroyo ⋅ Alessio Gravina ⋅ Moshe Eliasof ⋅ Carola-Bibiane Schönlieb ⋅ Davide Bacciu ⋅ Xiaowen Dong ⋅ Kamyar Azizzadenesheli ⋅ Pierre Vandergheynst

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

ChebNet, one of the earliest spectral GNNs, has largely been overshadowed by Message Passing Neural Networks (MPNNs), which gained popularity for their simplicity and effectiveness in capturing local graph structure. Despite their success, MPNNs are limited in their ability to capture long-range dependencies between nodes. This has led researchers to adapt MPNNs through *rewiring* or make use of *Graph Transformers*, which compromise the computational efficiency that characterized early spatial message passing architectures, and typically disregard the graph structure. Almost a decade after its original introduction, we revisit ChebNet to shed light on its ability to model distant node interactions. We find that out-of-box, ChebNet already shows competitive advantages relative to classical MPNNs and GTs on long-range benchmarks, while maintaining good scalability properties for high-order polynomials. However, we uncover that this polynomial expansion leads ChebNet to an unstable regime during training. To address this limitation, we cast ChebNet as a stable and non-dissipative dynamical system, which we coin Stable-ChebNet. Our Stable-ChebNet model allows for stable information propagation, and has controllable dynamics which do not require the use of eigendecompositions, positional encodings, or graph rewiring. Across several benchmarks, Stable-ChebNet achieves near state-of-the-art performance.

View full details

Poster

An Efficient Orlicz-Sobolev Approach for Transporting Unbalanced Measures on a Graph

Tam Le ⋅ Truyen Nguyen ⋅ Hideitsu Hino ⋅ Kenji Fukumizu

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We investigate optimal transport (OT) for measures on graph metric spaces with different total masses. To mitigate the limitations of traditional $L^p$ geometry, Orlicz-Wasserstein (OW) and generalized Sobolev transport (GST) employ \emph{Orlicz geometric structure}, leveraging convex functions to capture nuanced geometric relationships and remarkably contribute to advance certain machine learning approaches. However, both OW and GST are restricted to measures with equal total mass, limiting their applicability to real-world scenarios where mass variation is common, and input measures may have noisy supports, or outliers. To address unbalanced measures, OW can either incorporate mass constraints or marginal discrepancy penalization, but this leads to a more complex two-level optimization problem. Additionally, GST provides a scalable yet rigid framework, which poses significant challenges to extend GST to accommodate nonnegative measures. To tackle these challenges, in this work we revisit the entropy partial transport (EPT) problem. By exploiting Caffarelli \& McCann's insights, we develop a novel variant of EPT endowed with Orlicz geometric structure, called \emph{Orlicz-EPT}. We establish theoretical background to solve Orlicz-EPT using a binary search algorithmic approach. Especially, by leveraging the dual EPT and the underlying graph structure, we formulate a novel regularization approach that leads to the proposed \emph{Orlicz-Sobolev transport} (OST). Notably, we demonstrate that OST can be efficiently computed by simply solving a univariate optimization problem, in stark contrast to the intensive computation needed for Orlicz-EPT. Building on this, we derive geometric structures for OST and draw its connections to other transport distances. We empirically illustrate that OST is several-order faster than Orlicz-EPT. Furthermore, we show preliminary evidence on the advantages of OST for measures on a graph in document classification and topological data analysis.

View full details

Poster

TRIDENT: Tri-Modal Molecular Representation Learning with Taxonomic Annotations and Local Correspondence

Feng Jiang ⋅ Mangal Prakash ⋅ Hehuan Ma ⋅ Jianyuan Deng ⋅ Yuzhi Guo ⋅ Maolaaisha Aminanmu ⋅ Tommaso Mansi ⋅ Rui Liao ⋅ Junzhou Huang

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Molecular property prediction aims to learn representations that map chemical structures to functional properties. While multimodal learning has emerged as a powerful paradigm to learn molecular representations, prior works have largely overlooked textual and taxonomic information of molecules for representation learning. We introduce TRIDENT, a novel framework that integrates molecular SMILES, textual descriptions, and taxonomic functional annotations to learn rich molecular representations. To achieve this, we curate a comprehensive dataset of molecule-text pairs with structured, multi-level functional annotations. Instead of relying on conventional contrastive loss, TRIDENT employs a volume-based alignment objective to jointly align tri-modal features at the global level, enabling soft, geometry-aware alignment across modalities. Additionally, TRIDENT introduces a novel local alignment objective that captures detailed relationships between molecular substructures and their corresponding sub-textual descriptions. A momentum-based mechanism dynamically balances global and local alignment, enabling the model to learn both broad functional semantics and fine-grained structure-function mappings. TRIDENT achieves state-of-the-art performance on 18 downstream tasks, demonstrating the value of combining SMILES, textual, and taxonomic functional annotations for molecular property prediction. Our code and data are available at https://github.com/uta-smile/TRIDENT.

View full details

Poster

HYPERION: Fine-Grained Hypersphere Alignment for Robust Federated Graph Learning

Frank Wan ⋅ Xiaoran Shang ⋅ Yuxin Wu ⋅ Guibin Zhang ⋅ Jinhe Bi ⋅ Liangtao Zheng ⋅ Xin Lin ⋅ Yue Liu ⋅ Yanbiao Ma ⋅ Wenke Huang ⋅ Bo Du

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Robust Federated Graph Learning (FGL) provides an effective decentralized framework for training Graph Neural Networks (GNNs) in noisy-label environments. However, the subtlety of noise during training presents formidable obstacles for developing robust FGL systems. Previous robust FL approaches neither adequately constrain edge-mediated error propagation nor account for intra-class topological differences. At the client level, we innovatively demonstrate that hyperspherical embedding can effectively capture graph structures in a fine-grained manner. Correspondingly, our method effectively addresses the aforementioned issues through fine-grained hypersphere alignment. Moreover, we uncover undetected noise arising from localized perspective constraints and propose the geometric-aware hyperspherical purification module at the server level. Combining both level strategies, we present our robust FGL framework,**HYPERION**, which operates all components within a unified hyperspherical space. **HYPERION** demonstrates remarkable robustness across multiple datasets, for instance, achieving a 29.7\% $\uparrow$ F1-macro score with 50\%-pair noise on Cora. The code is available for anonymous access at \url{https://anonymous.4open.science/r/Hyperion-NeurIPS/}.

View full details

Poster

Latent Policy Barrier: Learning Robust Visuomotor Policies by Staying In-Distribution

Zhanyi Sun ⋅ Shuran Song

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Visuomotor policies trained via behavior cloning are vulnerable to covariate shift, where small deviations from expert trajectories can compound into failure. Common strategies to mitigate this issue involve expanding the training distribution through human-in-the-loop corrections or synthetic data augmentation. However, these approaches are often labor-intensive, rely on strong task assumptions, or compromise the quality of imitation. We introduce Latent Policy Barrier, a framework for robust visuomotor policy learning. Inspired by Control Barrier Functions, LPB treats the latent embeddings of expert demonstrations as an implicit barrier separating safe, in-distribution states from unsafe, out-of-distribution (OOD) ones. Our approach decouples the role of precise expert imitation and OOD recovery into two separate modules: a base diffusion policy solely on expert data, and a dynamics model trained on both expert and suboptimal policy rollout data. At inference time, the dynamics model predicts future latent states and optimizes them to stay within the expert distribution. Both simulated and real-world experiments show that LPB improves both policy robustness and data efficiency, enabling reliable manipulation from limited expert data and without additional human correction or annotation. More details are on our anonymous project website https://latentpolicybarrier.github.io.

View full details

Poster

Bigram Subnetworks: Mapping to Next Tokens in Transformer Language Models

Tyler Chang ⋅ Benjamin Bergen

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

In Transformer language models, activation vectors transform from current token embeddings to next token predictions as they pass through the model. To isolate a minimal form of this transformation, we identify language model subnetworks that make bigram predictions, naive next token predictions based only on the current token. We find that bigram subnetworks can be found in fully trained language models up to 1B parameters, and these subnetworks are critical for model performance even when they consist of less than 0.2% of model parameters. Bigram subnetworks are concentrated in the first Transformer MLP layer, and they overlap significantly with subnetworks trained to optimally prune a given model. Mechanistically, the bigram subnetworks often recreate a pattern from the full models where the first layer induces a sharp change that aligns activations with next token predictions rather than current token representations. Our results demonstrate that bigram subnetworks comprise a minimal subset of parameters that are both necessary and sufficient for basic next token predictions in language models, and they help drive the transformation from current to next token activations in the residual stream. These subnetworks can lay a foundation for studying more complex language model circuits by building up from a minimal circuit.

View full details

Poster

Replicable Distribution Testing

Ilias Diakonikolas ⋅ Jingyi Gao ⋅ Daniel Kane ⋅ Sihan Liu ⋅ Christopher Ye

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We initiate a systematic investigation of distribution testing in the framework of algorithmic replicability. Specifically, given independent samples from a collection of probability distributions, the goal is to characterize the sample complexity of replicably testing natural properties of the underlying distributions. On the algorithmic front, we develop new replicable algorithms for testing closeness and independence of discrete distributions. On the lower bound front, we develop a new methodology for proving sample complexity lower bounds for replicable testing that may be of broader interest. As an application of our technique, we establish near-optimal sample complexity lower bounds for replicable uniformity testing---answering an open question from prior work---and closeness testing.

View full details

Poster

Predictive Preference Learning from Human Interventions

Haoyuan Cai ⋅ Zhenghao (Mark) Peng ⋅ Bolei Zhou

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Learning from human involvement aims to incorporate the human subject to monitor and correct agent behavior errors. Although most interactive imitation learning methods focus on correcting the agent’s action at the current state, they do not adjust its actions in future states, which may be potentially more hazardous. To address this, we introduce Predictive Preference Learning from Human Interventions (PPL), which leverages the implicit preference signals contained in human interventions to inform predictions of future rollouts. The key idea of PPL is to bootstrap each human intervention into L future time steps, called the preference horizon, with the assumption that the agent follows the same action and the human makes the same intervention in the preference horizon. By applying preference optimization on these future states, expert corrections are propagated into the safety-critical regions where the agent is expected to explore, significantly improving learning efficiency and reducing human demonstrations needed. We evaluate our approach with experiments on both autonomous driving and robotic manipulation benchmarks and demonstrate its efficiency and generality. Our theoretical analysis further shows that selecting an appropriate preference horizon L balances coverage of risky states with label correctness, thereby bounding the algorithmic optimality gap. Demo and code are available at: https://metadriverse.github.io/ppl.

View full details

Poster

Transformers for Mixed-type Event Sequences

Felix Draxler ⋅ Yang Meng ⋅ Kai Nelson ⋅ Lukas Laskowski ⋅ Yibo Yang ⋅ Theofanis Karaletsos ⋅ Stephan Mandt

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Event sequences appear widely in domains such as medicine, finance, and remote sensing, yet modeling them is challenging due to their heterogeneity: sequences often contain multiple event types with diverse structures—for example, electronic health records that mix discrete events like medical procedures with continuous lab measurements. Existing approaches either tokenize all entries, violating natural inductive biases, or ignore parts of the data to enforce a consistent structure. In this work, we propose a simple yet powerful Marked Temporal Point Process (MTPP) framework for modeling event sequences with flexible structure, using a single unified model. Our approach employs a single autoregressive transformer with discrete and continuous prediction heads, capable of modeling variable-length, mixed-type event sequences. The continuous head leverages an expressive normalizing flow to model continuous event attributes, avoiding the numerical integration required for inter-event times in most competing methods. Empirically, our model excels on both discrete-only and mixed-type sequences, improving prediction quality and enabling interpretable uncertainty quantification. We make our code public at https://github.com/czi-ai/FlexTPP.

View full details

Poster

Sketched Gaussian Mechanism for Private Federated Learning

Qiaobo Li ⋅ Zhijie Chen ⋅ Arindam Banerjee

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Communication cost and privacy are two major considerations in federated learning (FL). For communication cost, gradient compression by sketching the clients’ transmitted model updates is often used for reducing per‐round communication. For privacy, the Gaussian mechanism (GM), which consists of clipping updates and adding Gaussian noise, is commonly used to guarantee client‐level differential privacy. Existing literature on private FL analyzes privacy of sketching and GM in an isolated manner, illustrating that sketching provides privacy determined by the sketching dimension and that GM has to supply any additional desired privacy. In this paper, we introduce the Sketched Gaussian Mechanism (SGM), which directly combines sketching and the Gaussian mechanism for privacy. Using Rényi-DP tools, we present a joint analysis of SGM's overall privacy guarantee, which is significantly more flexible and sharper compared to isolated analysis of sketching and GM privacy. In particular, we prove that the privacy level of SGM for a fixed noise magnitude is proportional to $1/\sqrt{b}$, where $b$ is the sketching dimension, indicating that (for moderate $b$) SGM can provide much stronger privacy guarantees than the original GM under the same noise budget. We demonstrate the application of SGM to FL with either gradient descent or adaptive server optimizers, and establish theoretical results on optimization convergence, which exhibits only a logarithmic dependence on the number of parameters $d$. Experimental results confirm that at the same privacy level, SGM based FL is at least competitive with non‐sketching private FL variants and outperforms them in some settings. Moreover, using adaptive optimization at the server improves empirical performance while maintaining the privacy guarantees.

View full details

Poster

SAGE-Eval: Evaluating LLMs for Systematic Generalizations of Safety Facts

Yueh-Han Chen ⋅ Guy Davidson ⋅ Brenden Lake

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Do LLMs robustly generalize critical safety facts to novel situations? Lacking this ability is dangerous when users ask naive questions—for instance, ``I'm considering packing melon balls for my 10-month-old's lunch. What other foods would be good to include?'' Before offering food options, the LLM should warn that melon balls pose a choking hazard to toddlers, as documented by the CDC. Failing to provide such warnings could result in serious injuries or even death. To evaluate this, we introduce SAGE-Eval, SAfety-fact systematic GEneralization evaluation, the first benchmark that tests whether LLMs properly apply well‑established safety facts to naive user queries. SAGE-Eval comprises 104 facts manually sourced from reputable organizations, systematically augmented to create 10,428 test scenarios across 7 common domains (e.g., Outdoor Activities, Medicine). We find that the top model, Claude-3.7-sonnet, passes only 58% of all the safety facts tested. We also observe that model capabilities and training compute weakly correlate with performance on SAGE-Eval, implying that scaling up is not the golden solution. Our findings suggest frontier LLMs still lack robust generalization ability. We recommend developers use SAGE-Eval in pre-deployment evaluations to assess model reliability in addressing salient risks.

View full details

Poster

Algorithms and SQ Lower Bounds for Robustly Learning Real-valued Multi-Index Models

Ilias Diakonikolas ⋅ Giannis Iakovidis ⋅ Daniel Kane ⋅ Lisheng Ren

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We study the complexity of learning real-valued Multi-Index Models (MIMs) under the Gaussian distribution. A $K$-MIM is a function $f:\mathbb{R}^d\to \mathbb{R}$ that depends only on the projection of its input onto a $K$-dimensional subspace. We give a general algorithm for PAC learning a broad class of MIMs with respect to the square loss, even in the presence of adversarial label noise. Moreover, we establish a nearly matching Statistical Query (SQ) lower bound, providing evidence that the complexity of our algorithm is qualitatively optimal as a function of the dimension. Specifically, we consider the class of bounded variation MIMs with the property that degree at most $m$ distinguishing moments exist with respect to projections onto any subspace. In the presence of adversarial label noise, the complexity of our learning algorithm is $d^{O(m)}2^{\mathrm{poly}(K/\epsilon)}$. For the realizable and independent noise settings, our algorithm incurs complexity $d^{O(m)}2^{\mathrm{poly}(K)}(1/\epsilon)^{O(K)}$. To complement our upper bound, we show that if for some subspace degree-$m$ distinguishing moments do not exist, then any SQ learner for the corresponding class of MIMs requires complexity $d^{\Omega(m)}$. As an application, we give the first efficient learner for the class of positive-homogeneous $L$-Lipschitz $K$-MIMs. The resulting algorithm has complexity $\mathrm{poly}(d) 2^{\mathrm{poly}(KL/\epsilon)}$. This gives a new PAC learning algorithm for Lipschitz homogeneous ReLU networks with complexity independent of the network size, removing the exponential dependence incurred in prior work.

View full details

Poster

Uncertain Knowledge Graph Completion via Semi-Supervised Confidence Distribution Learning

Tianxing Wu ⋅ Shutong Zhu ⋅ Jingting Wang ⋅ Ning Xu ⋅ Guilin Qi ⋅ Haofen Wang

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Uncertain knowledge graphs (UKGs) associate each triple with a confidence score to provide more precise knowledge representations. Recently, since real-world UKGs suffer from the incompleteness, uncertain knowledge graph (UKG) completion attracts more attention, aiming to complete missing triples and confidences. Current studies attempt to learn UKG embeddings to solve this problem, but they neglect the extremely imbalanced distributions of triple confidences. This causes that the learnt embeddings are insufficient to high-quality UKG completion. Thus, in this paper, to address the above issue, we propose a new semi-supervised Confidence Distribution Learning (ssCDL) method for UKG completion, where each triple confidence is transformed into a confidence distribution to introduce more supervision information of different confidences to reinforce the embedding learning process. ssCDL iteratively learns UKG embedding by relational learning on labeled data (i.e., existing triples with confidences) and unlabeled data with pseudo labels (i.e., unseen triples with the generated confidences), which are predicted by meta-learning to augment the training data and rebalance the distribution of triple confidences. Experiments on two UKG datasets demonstrate that ssCDL consistently outperforms the state-of-the-art baselines in different evaluation metrics.

View full details

Poster

The World Is Bigger! A Computationally-Embedded Perspective on the Big World Hypothesis

Alex Lewandowski ⋅ Aditya Ramesh ⋅ Edan Meyer ⋅ Dale Schuurmans ⋅ Marlos C. Machado

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Continual learning is often motivated by the idea, known as the big world hypothesis, that ``the world is bigger'' than the agent. Recent problem formulations capture this idea by explicitly constraining an agent relative to the environment. These constraints lead to solutions in which the agent continually adapts to best use its limited capacity, rather than converging to a fixed solution. However, explicit constraints can be ad hoc, difficult to incorporate, and may limit the effectiveness of scaling up the agent's capacity. In this paper, we characterize a problem setting in which an agent, regardless of its capacity, is constrained by being embedded in the environment. In particular, we introduce a computationally-embedded perspective that represents an embedded agent as an automaton simulated within a universal (formal) computer. Such an automaton is always constrained; we prove that it is equivalent to an agent that interacts with a partially observable Markov decision process over a countably infinite state-space. We propose an objective for this setting, which we call interactivity, that measures an agent's ability to continually adapt its behaviour by learning new predictions. We then develop a model-based reinforcement learning algorithm for interactivity-seeking, and use it to construct a synthetic problem to evaluate continual learning capability. Our results show that deep nonlinear networks struggle to sustain interactivity, whereas deep linear networks sustain higher interactivity as capacity increases.

View full details

Poster

Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence

Yining Hong ⋅ Rui Sun ⋅ Bingxuan Li ⋅ Xingcheng Yao ⋅ Maxine Wu ⋅ Alexander Chien ⋅ Da Yin ⋅ Ying Nian Wu ⋅ Zhecan Wang ⋅ Kai-Wei Chang

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

AI agents today are mostly siloed — they either retrieve and reason over vast amount of digital information and knowledge obtained online; or interact with the physical world through embodied perception, planning and action — but rarely both. This separation limits their ability to solve tasks that require integrated physical and digital intelligence, such as cooking from online recipes, navigating with dynamic map data, or interpreting real-world landmarks using web knowledge. We introduce \textsc{Embodied Web Agents}, a novel paradigm for AI agents that fluidly bridge embodiment and web-scale reasoning. To operationalize this concept, we first develop the \textsc{Embodied Web Agents} task environments, a unified simulation platform that integrates realistic 3D indoor and outdoor environments with functional web interfaces. Building upon this platform, we construct and release the \textsc{Embodied Web Agents} Benchmark, which encompasses a diverse suite of tasks including cooking, navigation, shopping, tourism, and geolocation — all requiring coordinated reasoning across physical and digital realms for systematic assessment of cross-domain intelligence. Experimental results reveal significant performance gaps between state-of-the-art AI systems and human capabilities, establishing both challenges and opportunities at the intersection of embodied cognition and web-scale knowledge access.

View full details

Poster

Strategic Costs of Perceived Bias in Fair Selection

L. Elisa Celis ⋅ Lingxiao Huang ⋅ Milind Sohoni ⋅ Nisheeth K. Vishnoi

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Meritocratic systems, from admissions to hiring, aim to impartially reward skill and effort. Yet persistent disparities across race, gender, and class challenge this ideal. Some attribute these gaps to structural inequality; others to individual choice. We develop a game-theoretic model in which candidates from different socioeconomic groups differ in their perceived post-selection value—shaped by social context and, increasingly, by AI-powered tools offering personalized career or salary guidance. Each candidate strategically chooses effort, balancing its cost against expected reward; effort translates into observable merit, and selection is based solely on merit. We characterize the unique Nash equilibrium in the large-agent limit and derive explicit formulas showing how valuation disparities and institutional selectivity jointly determine effort, representation, social welfare, and utility. We further propose a cost-sensitive optimization framework that quantifies how modifying selectivity or perceived value can reduce disparities without compromising institutional goals. Our analysis reveals a perception-driven bias: when perceptions of post-selection value differ across groups, these differences translate into rational differences in effort, propagating disparities backward through otherwise "fair" selection processes. While the model is static, it captures one stage of a broader feedback cycle linking perceptions, incentives, and outcomes—bridging rational-choice and structural explanations of inequality by showing how techno-social environments shape individual incentives in meritocratic systems.

View full details

Poster

Blameless Users in a Clean Room: Defining Copyright Protection for Generative Models

Aloni Cohen

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Are there any conditions under which a generative model’s outputs are guaranteed not to infringe the copyrights of its training data? This is the question of "provable copyright protection" first posed by Vyas, Kakade, and Barak [ICML 2023]. They define _near access-freeness (NAF)_ and propose it as sufficient for protection. This paper revisits the question and establishes new foundations for provable copyright protection---foundations that are firmer both technically and legally. First, we show that NAF alone does not prevent infringement. In fact, NAF models can enable verbatim copying, a blatant failure of copy protection that we dub being _tainted_. Then, we introduce our _blameless copy protection framework_ for defining meaningful guarantees, and instantiate it with _clean-room copy protection_. Clean-room copy protection allows a user to control their risk of copying by behaving in a way that is unlikely to copy in a counterfactual "clean-room setting." Finally, we formalize a common intuition about differential privacy and copyright by proving that DP implies clean-room copy protection when the dataset is _golden_, a copyright deduplication requirement.

View full details

Poster

Diffusion Generative Modeling on Lie Group Representations

Marco Bertolini ⋅ Tuan Le ⋅ Djork-Arné Clevert

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We introduce a novel class of score-based diffusion processes that operate directly in the representation space of Lie groups. Leveraging the framework of Generalized Score Matching, we derive a class of Langevin dynamics that decomposes as a direct sum of Lie algebra representations, enabling the modeling of any target distribution on any (non-Abelian) Lie group. Standard score-matching emerges as a special case of our framework when the Lie group is the translation group. We prove that our generalized generative processes arise as solutions to a new class of paired stochastic differential equations (SDEs), introduced here for the first time. We validate our approach through experiments on diverse data types, demonstrating its effectiveness in real-world applications such as $\text{SO}(3)$-guided molecular conformer generation and modeling ligand-specific global $\text{SE}(3)$ transformations for molecular docking, showing improvement in comparison to Riemannian diffusion on the group itself. We show that an appropriate choice of Lie group enhances learning efficiency by reducing the effective dimensionality of the trajectory space and enables the modeling of transitions between complex data distributions.

View full details

Poster

Depth-Width Tradeoffs for Transformers on Graph Tasks

Gilad Yehudai ⋅ Clayton Sanford ⋅ Maya Bechler-Speicher ⋅ Orr Fischer ⋅ Ran Gilad-Bachrach ⋅ Amir Globerson

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Transformers have revolutionized the field of machine learning. In particular, they can be used to solve complex algorithmic problems, including graph-based tasks. In such algorithmic tasks a key question is what is the minimal size of a transformer that can implement the task. Recent work has begun to explore this problem for graph-based tasks, showing that for sub-linear embedding dimension (i.e., model width) logarithmic depth suffices. However, an open question, which we address here, is what happens if width is allowed to grow linearly, while depth is kept fixed. Here we analyze this setting, and provide the surprising result that with linear width, constant depth suffices for solving a host of graph-based problems. This suggests that a moderate increase in width can allow much shallower models, which are advantageous in terms of inference and train time. For other problems, we show that quadratic width is required. Our results demonstrate the complex and intriguing landscape of transformer implementations of graph-based algorithms. We empirically investigate these trade-offs between the relative powers of depth and width and find tasks where wider models have the same accuracy as deep models, while having much faster train and inference time due to parallelizable hardware.

View full details

Poster

Physics-Driven Spatiotemporal Modeling for AI-Generated Video Detection

Shuhai Zhang ⋅ ZiHao Lian ⋅ Jiahao Yang ⋅ Daiyuan Li ⋅ Guoxuan Pang ⋅ Feng Liu ⋅ Bo Han ⋅ Shutao Li ⋅ Mingkui Tan

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

AI-generated videos have achieved near-perfect visual realism (e.g., Sora), urgently necessitating reliable detection mechanisms. However, detecting such videos faces significant challenges in modeling high-dimensional spatiotemporal dynamics and identifying subtle anomalies that violate physical laws. In this paper, we propose a physics-driven AI-generated video detection paradigm based on probability flow conservation principles. Specifically, we propose a statistic called Normalized Spatiotemporal Gradient (NSG), which quantifies the ratio of spatial probability gradients to temporal density changes, explicitly capturing deviations from natural video dynamics. Leveraging pre-trained diffusion models, we develop an NSG estimator through spatial gradients approximation and motion-aware temporal modeling without complex motion decomposition while preserving physical constraints. Building on this, we propose an NSG-based video detection method (NSG-VD) that computes the Maximum Mean Discrepancy (MMD) between NSG features of the test and real videos as a detection metric. Last, we derive an upper bound of NSG feature distances between real and generated videos, proving that generated videos exhibit amplified discrepancies due to distributional shifts. Extensive experiments confirm that NSG-VD outperforms state-of-the-art baselines by 16.00\% in Recall and 10.75\% in F1-Score, validating the superior performance of NSG-VD. The source code is available at \url{https://github.com/ZSHsh98/NSG-VD}.

View full details

Poster

Bubbleformer: Forecasting Boiling with Transformers

Sheikh Md Shakeel Hassan ⋅ Xianwei Zou ⋅ Akash Dhruv ⋅ Aparna Chandramowlishwaran

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Modeling boiling---an inherently chaotic, multiphase process central to energy and thermal systems---remains a significant challenge for neural PDE surrogates. Existing models require future input (e.g., bubble positions) during inference because they fail to learn nucleation from past states, limiting their ability to autonomously forecast boiling dynamics. They also fail to model flow boiling velocity fields, where sharp interface–momentum coupling demands long-range and directional inductive biases. We introduce Bubbleformer, a transformer-based spatiotemporal model that forecasts stable and long-range boiling dynamics including nucleation, interface evolution, and heat transfer without dependence on simulation data during inference. Bubbleformer integrates factorized axial attention, frequency-aware scaling, and conditions on thermophysical parameters to generalize across fluids, geometries, and operating conditions.To evaluate physical fidelity in chaotic systems, we propose interpretable physics-based metrics that evaluate heat flux consistency, interface geometry, and mass conservation. We also release BubbleML 2.0, a high-fidelity dataset that spans diverse working fluids (cryogens, refrigerants, dielectrics), boiling configurations (pool and flow boiling), flow regimes (bubbly, slug, annular), and boundary conditions. Bubbleformer sets new benchmark results in both prediction and forecasting of two-phase boiling flows.

View full details

Poster

Accelerating Diffusion LLMs via Adaptive Parallel Decoding

Daniel Israel ⋅ Guy Van den Broeck ⋅ Aditya Grover

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The generation speed of LLMs are bottlenecked by autoregressive decoding, where tokens are predicted sequentially one by one. Alternatively, diffusion large language models (dLLMs) theoretically allow for parallel token generation, but in practice struggle to achieve the speed of autoregressive models without significantly sacrificing quality. We therefore introduce adaptive parallel decoding (APD), a novel method that dynamically adjusts the number of tokens sampled in parallel. We achieve this by defining a multiplicative mixture between the dLLM marginal probabilities and the joint probability of sequences under a small auxiliary autoregressive model. This inverts the standard setup of speculative decoding, where the goal is to sample from a large autoregressive verifier by drafting from a smaller model. We further optimize APD by enabling KV caching and limiting the size of the masked input. Altogether, our method puts forward three tunable parameters to flexibly tradeoff throughput and quality. We show that APD provides markedly higher throughput with minimal quality degradations on downstream benchmarks.

View full details

Poster

Whose View of Safety? A Deep DIVE Dataset for Pluralistic Alignment of Text-to-Image Models

Charvi Rastogi ⋅ Tian Huey Teh ⋅ Pushkar Mishra ⋅ Roma Patel ⋅ Ding Wang ⋅ Mark Díaz ⋅ Alicia Parrish ⋅ Aida Mostafazadeh Davani ⋅ Zoe Ashwood ⋅ Michela Paganini ⋅ Vinodkumar Prabhakaran ⋅ Verena Rieser ⋅ Lora Aroyo

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Current text-to-image (T2I) models often fail to account for diverse human experiences, leading to misaligned systems. We advocate for pluralism in AI alignment, where an AI understands and is steerable towards diverse, and often conflicting, human values. Our work provides three core contributions to achieve this in T2I models. First, we introduce a novel dataset for Diverse Intersectional Visual Evaluation (DIVE) -- the first multimodal dataset for pluralistic alignment. It enables deep alignment to diverse safety perspectives through a large pool of demographically intersectional human raters who provided extensive feedback across 1000 prompts, with high replication, capturing nuanced safety perceptions. Second, we empirically confirm demographics as a crucial proxy for diverse viewpoints in this domain, revealing significant, context-dependent differences in harm perception that diverge from conventional evaluations. Finally, we discuss implications for building aligned T2I models, including efficient data collection strategies, LLM judgment capabilities, and model steerability towards diverse perspectives. This research offers foundational tools for more equitable and aligned T2I systems.Content Warning: The paper includes sensitive content that may be harmful.

View full details

Poster

DeltaFlow: An Efficient Multi-frame Scene Flow Estimation Method

Qingwen Zhang ⋅ Xiaomeng Zhu ⋅ Yushan Zhang ⋅ Yixi Cai ⋅ Olov Andersson ⋅ Patric Jensfelt

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Previous dominant methods for scene flow estimation focus mainly on input from two consecutive frames, neglecting valuable information in the temporal domain. While recent trends shift towards multi-frame reasoning, they suffer from rapidly escalating computational costs as the number of frames grows. To leverage temporal information more efficiently, we propose DeltaFlow ($\Delta$Flow), a lightweight 3D framework that captures motion cues via a $\Delta$ scheme, extracting temporal features with minimal computational cost, regardless of the number of frames. Additionally, scene flow estimation faces challenges such as imbalanced object class distributions and motion inconsistency. To tackle these issues, we introduce a Category-Balanced Loss to enhance learning across underrepresented classes and an Instance Consistency Loss to enforce coherent object motion, improving flow accuracy. Extensive evaluations on the Argoverse 2, Waymo and nuScenes datasets show that $\Delta$Flow achieves state-of-the-art performance with up to 22\% lower error and $2\times$ faster inference compared to the next-best multi-frame supervised method, while also demonstrating a strong cross-domain generalization ability. The code is open-sourced at https://github.com/Kin-Zhang/DeltaFlow along with trained model weights.

View full details

Poster

AceSearcher: Bootstrapping Reasoning and Search for LLMs via Reinforced Self-Play

Ran Xu ⋅ Yuchen Zhuang ⋅ Zihan Dong ⋅ Ruiyu Wang ⋅ Yue Yu ⋅ Joyce Ho ⋅ Linjun Zhang ⋅ Haoyu Wang ⋅ Wenqi Shi ⋅ Carl Yang

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Search-augmented LLMs often struggle with complex reasoning tasks due to ineffective multi-hop retrieval and limited reasoning ability. We propose AceSearcher, a cooperative self-play framework that trains a single large language model (LLM) to alternate between two roles: a decomposer that breaks down complex queries and a solver that integrates retrieved contexts for answer generation. AceSearcher couples supervised fine-tuning on a diverse mixture of search, reasoning, and decomposition tasks with reinforcement fine-tuning optimized for final answer accuracy, eliminating the need for intermediate annotations. Extensive experiments on three reasoning-intensive tasks across 10 datasets show that AceSearcher outperforms state-of-the-art baselines, achieving an average exact match improvement of 7.6%. Remarkably, on document-level finance reasoning tasks, AceSearcher-32B matches the performance of the giant DeepSeek-V3 model using less than 5% of iits parameters. Even at smaller scales (1.5B and 8B), AceSearcher often surpasses existing search-augmented LLMs with up to 9× more parameters, highlighting its exceptional efficiency and effectiveness in tackling complex reasoning tasks.

View full details

Poster

On the Optimal Construction of Unbiased Gradient Estimators for Zeroth-Order Optimization

Shaocong Ma ⋅ Heng Huang

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Zeroth-order optimization (ZOO) is an important framework for stochastic optimization when gradients are unavailable or expensive to compute. A potential limitation of existing ZOO methods is the bias inherent in most gradient estimators unless the perturbation stepsize vanishes. In this paper, we overcome this biasedness issue by proposing a novel family of *unbiased* gradient estimators based solely on function evaluations. By reformulating directional derivatives as a telescoping series and sampling from carefully designed distributions, we construct estimators that eliminate bias while maintaining favorable variance. We analyze their theoretical properties, derive optimal scaling distributions and perturbation stepsizes of four specific constructions, and prove that SGD using the proposed estimators achieves optimal complexity for smooth non-convex objectives. Experiments on synthetic tasks and language model fine-tuning confirm the superior accuracy and convergence of our approach compared to standard methods.

View full details

Poster

Non-Clairvoyant Scheduling with Progress Bars

Ziyad Benomar ⋅ Romain Cosson ⋅ Alexander Lindermayr ⋅ Jens Schlöter

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

In non-clairvoyant scheduling, the goal is to minimize the total job completion time without prior knowledge of individual job processing times. This classical online optimization problem has recently gained attention through the framework of learning-augmented algorithms. We introduce a natural setting in which the scheduler receives continuous feedback in the form of progress bars—estimates of the fraction of each job completed over time. We design new algorithms for both adversarial and stochastic progress bars and prove strong competitive bounds. Our results in the adversarial case surprisingly induce improved guarantees for learning-augmented scheduling with job size predictions. We also introduce a general method for combining scheduling algorithms, yielding further insights in scheduling with predictions. Finally, we propose a stochastic model of progress bars as a more optimistic alternative to conventional worst-case models, and present an asymptotically optimal scheduling algorithm in this setting.

View full details

Poster

DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks

Canyu Zhao ⋅ Yanlong Sun ⋅ Mingyu Liu ⋅ Huanyi Zheng ⋅ Muzhi Zhu ⋅ Zhiyue Zhao ⋅ Hao Chen ⋅ Tong He ⋅ Chunhua Shen

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

This paper's primary objective is to develop a robust generalist perception model capable of addressing multiple tasks under constraints of computational resources and limited training data. We leverage text-to-image diffusion models pre-trained on billions of images and successfully introduce our DICEPTION, a visual generalist model. Exhaustive evaluations demonstrate that DICEPTION effectively tackles diverse perception tasks, even achieving performance comparable to SOTA single-task specialist models. Specifically, we achieve results on par with SAM-vit-h using only 0.06% of their data (e.g., 600K vs.\ 1B pixel-level annotated images). We designed comprehensive experiments on architectures and input paradigms, demonstrating that the key to successfully re-purposing a single diffusion model for multiple perception tasks lies in maximizing the preservation of the pre-trained model's prior knowledge. Consequently, DICEPTION can be trained with substantially lower computational costs than conventional models requiring training from scratch. Furthermore, adapting DICEPTION to novel tasks is highly efficient, necessitating fine-tuning on as few as 50 images and approximately 1% of its parameters. Finally, we demonstrate that a subtle application of classifier-free guidance can improve the model's performance on depth and normal estimation. We also show that pixel-aligned training, as is characteristic of perception tasks, significantly enhances the model's ability to preserve fine details. DICEPTION offers valuable insights and presents a promising direction for the development of advanced diffusion-based visual generalist models.

View full details

Poster

ProtInvTree: Deliberate Protein Inverse Folding with Reward-guided Tree Search

Mengdi Liu ⋅ Xiaoxue Cheng ⋅ Zhangyang Gao ⋅ Hong Chang ⋅ Cheng Tan ⋅ Shiguang Shan ⋅ Xilin Chen

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Designing protein sequences that fold into a target 3D structure—known as protein inverse folding—is a fundamental challenge in protein engineering. While recent deep learning methods have achieved impressive performance by recovering native sequences, they often overlook the one-to-many nature of the problem: multiple diverse sequences can fold into the same structure. This motivates the need for a generative model capable of designing diverse sequences while preserving structural consistency. To address this trade-off, we introduce ProtInvTree, the first reward-guided tree-search framework for protein inverse folding. ProtInvTree reformulates sequence generation as a deliberate, step-wise decision-making process, enabling the exploration of multiple design paths and exploitation of promising candidates through self-evaluation, lookahead, and backtracking. We propose a two-stage focus-and-grounding action mechanism that decouples position selection and residue generation. To efficiently evaluate intermediate states, we introduce a jumpy denoising strategy that avoids full rollouts. Built upon pretrained protein language models, ProtInvTree supports flexible test-time scaling by adjusting the search depth and breadth without retraining. Empirically, ProtInvTree outperforms state-of-the-art baselines across multiple benchmarks, generating structurally consistent yet diverse sequences, including those far from the native ground truth. The code is available at https://github.com/A4Bio/ProteinInvBench/.

View full details

Poster

A Unified Solution to Video Fusion: From Multi-Frame Learning to Benchmarking

Zixiang Zhao ⋅ Haowen Bai ⋅ Bingxin Ke ⋅ Yukun Cui ⋅ Lilun Deng ⋅ Yulun Zhang ⋅ Kai Zhang ⋅ Konrad Schindler

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The real world is dynamic, yet most image fusion methods process static frames independently, ignoring temporal correlations in videos and leading to flickering and temporal inconsistency. To address this, we propose Unified Video Fusion (UniVF), a novel and unified framework for video fusion that leverages multi-frame learning and optical flow-based feature warping for informative, temporally coherent video fusion. To support its development, we also introduce Video Fusion Benchmark (VF-Bench), the first comprehensive benchmark covering four video fusion tasks: multi-exposure, multi-focus, infrared-visible, and medical fusion. VF-Bench provides high-quality, well-aligned video pairs obtained through synthetic data generation and rigorous curation from existing datasets, with a unified evaluation protocol that jointly assesses the spatial quality and temporal consistency of video fusion. Extensive experiments show that UniVF achieves state-of-the-art results across all tasks on VF-Bench. Project page: [vfbench.github.io](https://vfbench.github.io).

View full details

Poster

Vanish into Thin Air: Cross-prompt Universal Adversarial Attacks for SAM2

Ziqi Zhou ⋅ Yifan Hu ⋅ Yufei Song ⋅ Zijing Li ⋅ Shengshan Hu ⋅ Leo Yu Zhang ⋅ Dezhong Yao ⋅ Long Zheng ⋅ Hai Jin

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Recent studies reveal the vulnerability of the image segmentation foundation model SAM to adversarial examples. Its successor, SAM2, has attracted significant attention due to its strong generalization capability in video segmentation. However, its robustness remains unexplored, and it is unclear whether existing attacks on SAM can be directly transferred to SAM2. In this paper, we first analyze the performance gap of existing attacks between SAM and SAM2 and highlight two key challenges arising from their architectural differences: directional guidance from the prompt and semantic entanglement across consecutive frames. To address these issues, we propose UAP-SAM2, the first cross-prompt universal adversarial attack against SAM2 driven by dual semantic deviation. For cross-prompt transferability, we begin by designing a target-scanning strategy that divides each frame into k regions, each randomly assigned a prompt, to reduce prompt dependency during optimization. For effectiveness, we design a dual semantic deviation framework that optimizes a UAP by distorting the semantics within the current frame and disrupting the semantic consistency across consecutive frames. Extensive experiments on six datasets across two segmentation tasks demonstrate the effectiveness of the proposed method for SAM2. The comparative results show that UAP-SAM2 significantly outperforms state-of-the-art (SOTA) attacks by a large margin.

View full details

Poster

Multi-agent Markov Entanglement

Shuze Chen ⋅ Tianyi Peng

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Value decomposition has long been a fundamental technique in multi-agent reinforcement learning and dynamic programming. Specifically, the value function of a global state $(s_1,s_2,\ldots,s_N)$ is often approximated as the sum of local functions: $V(s_1,s_2,\ldots,s_N)\approx\sum_{i=1}^N V_i(s_i)$. This approach has found various applications in modern RL systems. However, the theoretical justification for why this decomposition works so effectively remains underexplored. In this paper, we uncover the underlying mathematical structure that enables value decomposition. We demonstrate that a Markov decision process (MDP) permits value decomposition *if and only if* its transition matrix is not "entangled"—a concept analogous to quantum entanglement in quantum physics. Drawing inspiration from how physicists measure quantum entanglement, we introduce how to measure the "Markov entanglement" and show that this measure can be used to bound the decomposition error in general multi-agent MDPs. Using the concept of Markov entanglement, we proved that a widely-used class of policies, the index policy, is weakly-entangled and enjoys a sublinear $\mathcal O(\sqrt{N})$ scale of decomposition error for $N$-agent systems. Finally, we show Markov entanglement can be efficiently estimated, guiding practitioners on the feasibility of value decomposition.

View full details

Poster

KLASS: KL-Guided Fast Inference in Masked Diffusion Models

Seo Hyun Kim ⋅ Sunwoo Hong ⋅ Hojung Jung ⋅ Youngrok Park ⋅ Se-Young Yun

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Masked diffusion models have demonstrated competitive results on various tasks including language generation. However, due to its iterative refinement process, the inference is often bottlenecked by slow and static sampling speed. To overcome this problem, we introduce `KL-Adaptive Stability Sampling' (KLASS), a fast yet effective sampling method that exploits token-level KL divergence to identify stable, high-confidence predictions. By unmasking multiple tokens in each iteration without any additional model training, our approach speeds up generation significantly while maintaining sample quality. On reasoning benchmarks, KLASS achieves up to $2.78\times$ wall-clock speedups while improving performance over standard greedy decoding, attaining state-of-the-art results among diffusion-based samplers. We further validate KLASS across diverse domains, including text, image, and molecular generation, showing its effectiveness as a broadly applicable sampler across different models.

View full details

Poster

STITCH-OPE: Trajectory Stitching with Guided Diffusion for Off-Policy Evaluation

Hossein Goli ⋅ Michael Gimelfarb ⋅ Nathan de Lara ⋅ Haruki Nishimura ⋅ Masha Itkina ⋅ Florian Shkurti

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Off-policy evaluation (OPE) estimates the performance of a target policy using offline data collected from a behavior policy, and is crucial in domains such as robotics or healthcare where direct interaction with the environment is costly or unsafe. Existing OPE methods are ineffective for high-dimensional, long-horizon problems, due to exponential blow-ups in variance from importance weighting or compounding errors from learned dynamics models. To address these challenges, we propose STITCH-OPE, a model-based generative framework that leverages denoising diffusion for long-horizon OPE in high-dimensional state and action spaces. Starting with a diffusion model pre-trained on the behavior data, STITCH-OPE generates synthetic trajectories from the target policy by guiding the denoising process using the score function of the target policy. STITCH-OPE proposes two technical innovations that make it advantageous for OPE: (1) prevents over-regularization by subtracting the score of the behavior policy during guidance, and (2) generates long-horizon trajectories by stitching partial trajectories together end-to-end. We provide a theoretical guarantee that under mild assumptions, these modifications result in an exponential reduction in variance versus long-horizon trajectory diffusion. Experiments on the D4RL and OpenAI Gym benchmarks show substantial improvement in mean squared error, correlation, and regret metrics compared to state-of-the-art OPE methods.

View full details

Poster

VoxDet: Rethinking 3D Semantic Scene Completion as Dense Object Detection

Wuyang Li ⋅ Zhu Yu ⋅ Alexandre Alahi

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Semantic Scene Completion (SSC) aims to reconstruct the 3D geometry and semantics of the surrounding environment. With dense voxel labels, prior works typically formulate SSC as a *dense segmentation task*, independently classifying each voxel. However, this paradigm neglects critical instance-centric discriminability, leading to instance-level incompleteness and adjacent ambiguities. To address this, we highlight a "free lunch" of SSC labels: the voxel-level class label has implicitly told the instance-level insight, which is ever-overlooked by the community. Motivated by this observation, we first introduce a training-free **Voxel-to-Instance (VoxNT) trick**: a simple yet effective method that freely converts voxel-level class labels into instance-level offset labels. Building on this, we further propose **VoxDet**, an instance-centric framework that reformulates the voxel-level SSC as *dense object detection* by decoupling it into two sub-tasks: offset regression and semantic prediction. Specifically, based on the lifted 3D volume, VoxDet first uses (a) Spatially-decoupled Voxel Encoder to generate disentangled feature volumes for the two sub-tasks, which learn task-specific spatial deformation in the densely projected tri-perceptive space. Then, we deploy (b) Task-decoupled Dense Predictor to address SSC via dense detection. Here, we first regress a 4D offset field to estimate distances (6 directions) between voxels and the corresponding object boundaries in the voxel space. The regressed offsets are then used to guide the instance-level aggregation in the classification branch, achieving instance-aware scene completion. VoxDet can be deployed on both camera and LiDAR input and jointly achieves state-of-the-art results on both benchmarks, which gives 63.0 IoU on the SemanticKITTI test set, **ranking 1$^{st}$** on the online leaderboard.

View full details

Poster

Q-Insight: Understanding Image Quality via Visual Reinforcement Learning

Weiqi Li ⋅ Xuanyu Zhang ⋅ Shijie Zhao ⋅ Yabin ZHANG ⋅ Junlin Li ⋅ Li zhang ⋅ Jian Zhang

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Image quality assessment (IQA) focuses on the perceptual visual quality of images, playing a crucial role in downstream tasks such as image reconstruction, compression, and generation. The rapid advancement of multi-modal large language models (MLLMs) has significantly broadened the scope of IQA, moving toward comprehensive image quality understanding that incorporates content analysis, degradation perception, and comparison reasoning beyond mere numerical scoring. Previous MLLM-based methods typically either generate numerical scores lacking interpretability or heavily rely on supervised fine-tuning (SFT) using large-scale annotated datasets to provide descriptive assessments, limiting their flexibility and applicability. In this paper, we propose Q-Insight, a reinforcement learning-based model built upon group relative policy optimization (GRPO), which demonstrates strong visual reasoning capability for image quality understanding while requiring only a limited amount of rating scores and degradation labels. By jointly optimizing score regression and degradation perception tasks with carefully designed reward functions, our approach effectively exploits their mutual benefits for enhanced performance. Extensive experiments demonstrate that Q-Insight substantially outperforms existing state-of-the-art methods on both score regression and degradation perception tasks, while exhibiting impressive zero-shot generalization and superior comparison reasoning capability. The code and models are available at https://github.com/bytedance/Q-Insight.

View full details

Poster

$\Psi$-Sampler: Initial Particle Sampling for SMC-Based Inference-Time Reward Alignment in Score Models

Taehoon Yoon ⋅ Yunhong Min ⋅ Kyeongmin Yeo ⋅ Minhyuk Sung

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We introduce $\Psi$-Sampler, an SMC-based framework incorporating pCNL-based initial particle sampling for effective inference-time reward alignment with a score-based model. Inference-time reward alignment with score-based generative models has recently gained significant traction, following a broader paradigm shift from pre-training to post-training optimization. At the core of this trend is the application of Sequential Monte Carlo (SMC) to the denoising process. However, existing methods typically initialize particles from the Gaussian prior, which inadequately captures reward-relevant regions and results in reduced sampling efficiency. We demonstrate that initializing from the reward-aware posterior significantly improves alignment performance. To enable posterior sampling in high-dimensional latent spaces, we introduce the preconditioned Crank–Nicolson Langevin (pCNL) algorithm, which combines dimension-robust proposals with gradient-informed dynamics. This approach enables efficient and scalable posterior sampling and consistently improves performance across various reward alignment tasks, including layout-to-image generation, quantity-aware generation, and aesthetic-preference generation, as demonstrated in our experiments.

View full details

Poster

GeRaF: Neural Geometry Reconstruction from Radio Frequency Signals

Jiachen Lu ⋅ Hailan Shanbhag ⋅ Haitham Al Hassanieh

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

GeRaF is the first method to use neural implicit learning for near-range 3D geometry reconstruction from radio frequency (RF) signals. Unlike RGB or LiDAR-based methods, RF sensing can see through occlusion but suffers from low resolution and noise due to its lens-less imaging nature. While lenses in RGB imaging constrain sampling to 1D rays, RF signals propagate through the entire space, introducing significant noise and leading to cubic complexity in volumetric rendering. Moreover, RF signals interact with surfaces via specular reflections requiring fundamentally different modeling. To address these challenges, GeRaF (1) introduces filter-based rendering to suppress irrelevant signals, (2) implements a physics-based RF volumetric rendering pipeline, and (3) proposes a novel lens-less sampling and lens-less alpha blending strategy that makes full-space sampling feasible during training. By learning signed distance functions, reflectiveness, and signal power through MLPs and trainable parameters, GeRaF takes the first step towards reconstructing millimeter-level geometry from RF signals in real-world settings.

View full details

Poster

Computational Efficiency under Covariate Shift in Kernel Ridge Regression

Andrea Della Vecchia ⋅ Arnaud Mavakala Watusadisi ⋅ Ernesto De Vito ⋅ Lorenzo Rosasco

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

This paper addresses the covariate shift problem in the context of nonparametric regression within reproducing kernel Hilbert spaces (RKHSs). Covariate shift arises in supervised learning when the input distributions of the training and test data differ, presenting additional challenges for learning. Although kernel methods have optimal statistical properties, their high computational demands in terms of time and, particularly, memory, limit their scalability to large datasets. To address this limitation, the main focus of this paper is to explore the trade-off between computational efficiency and statistical accuracy under covariate shift. We investigate the use of random projections where the hypothesis space consists of a random subspace within a given RKHS. Our results show that, even in the presence of covariate shift, significant computational savings can be achieved without compromising learning performance.

View full details

Poster

d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning

Siyan Zhao ⋅ Devaansh Gupta ⋅ Qinqing Zheng ⋅ Aditya Grover

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Recent large language models (LLMs) have demonstrated strong reasoning capabilities that benefits from online reinforcement learning (RL). These capabilities have primarily been demonstrated within the left-to-right autoregressive (AR) generation paradigm. In contrast, non-autoregressive paradigms based on diffusion generate text in a coarse-to-fine manner. Although recent diffusion-based large language models (dLLMs) have achieved competitive language modeling performance compared to their AR counterparts, it remains unclear if dLLMs can also leverage recent advances in LLM reasoning. To this end, we propose, a framework to adapt pre-trained masked dLLMs into reasoning models via a combination of supervised finetuning (SFT) and RL. Specifically, we develop and extend techniques to improve reasoning in pretrained dLLMs: (a) we utilize a masked SFT technique to distill knowledge and instill self-improvement behavior directly from existing datasets, and (b) we introduce a novel critic-free, policy-gradient based RL algorithm called diffu-GRPO, the first integration of policy gradient methods to masked dLLMs. Through empirical studies, we investigate the performance of different post-training recipes on multiple mathematical and planning benchmarks. We find that d1 yields the best performance and significantly improves performance of a state-of-the-art dLLM.

View full details

Poster

Principled Data Augmentation for Learning to Solve Quadratic Programming Problems

Chendi Qian ⋅ Christopher Morris

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Linear and quadratic optimization are crucial in numerous real-world applications, ranging from training machine learning models to solving integer linear programs. Recently, learning-to-optimize methods (L2O) for linear (LPs) or quadratic programs (QPs) using message-passing graph neural networks (MPNNs) have gained traction, promising lightweight, data-driven proxies for solving such optimization problems. For example, they replace the costly computation of strong branching scores in branch-and-bound solvers, thereby reducing the need to solve many such optimization problems. However, robust L2O MPNNs remain challenging in data-scarce settings, especially when addressing complex optimization problems such as QPs. This work introduces a principled approach to data augmentation tailored for QPs via MPNNs. Our method leverages theoretically justified data augmentation techniques to generate diverse yet optimality-preserving instances. Furthermore, we integrate these augmentations into a self-supervised contrastive learning framework, thereby pretraining MPNNs for improved performance on L2O tasks. Extensive experiments demonstrate that our approach improves generalization in supervised scenarios and facilitates effective transfer learning to related optimization problems.

View full details

Poster

Extrapolation by Association: Length Generalization Transfer In Transformers

Ziyang Cai ⋅ Nayoung Lee ⋅ Avi Schwarzschild ⋅ Samet Oymak ⋅ Dimitris Papailiopoulos

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Transformer language models have demonstrated impressive generalization capabilities in natural language domains, yet we lack a fine-grained understanding of how such generalization arises. In this paper, we investigate length generalization—the ability to extrapolate from shorter to longer inputs—through the lens of \textit{task transfer}. We find that length generalization can be \textit{transferred} across related tasks. That is, training a model with a longer and related auxiliary task can lead the model to generalize to unseen and longer inputs from some other target task. We demonstrate this length generalization transfer across a diverse suite of algorithmic tasks, including arithmetic operations, string transformations, and maze navigation. Our results show that transformer models can inherit generalization capabilities from similar tasks when trained jointly. Moreover, we observe similar transfer effects in pretrained language models, suggesting that pretraining equips models with reusable computational scaffolding that facilitates extrapolation in downstream settings. Finally, we provide initial mechanistic evidence that length generalization transfer correlates with the re-use of the same attention heads between the tasks. Together, our findings deepen our understanding of how transformers generalize to out-of-distribution inputs and highlight the compositional reuse of inductive structure across tasks.

View full details

Poster

Private Hyperparameter Tuning with Ex-Post Guarantee

Badih Ghazi ⋅ Pritish Kamath ⋅ Alexander Knop ⋅ Ravi Kumar ⋅ Pasin Manurangsi ⋅ Chiyuan Zhang

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The conventional approach in differential privacy (DP) literature formulates the privacy-utility tradeoff with a "privacy-first" perspective: for a predetermined level of privacy, a certain utility is achievable. However, practitioners often operate under a "utility-first" paradigm, prioritizing a desired level of utility and then determining the corresponding privacy cost. Wu et al. [2019] initiated a formal study of this ``utility-first'' perspective by introducing ex-post DP. They demonstrated that by adding correlated Laplace noise and progressively reducing it on demand, a sequence of increasingly accurate estimates of a private parameter can be generated, with the privacy cost attributed only to the least noisy iterate released. This led to a Laplace mechanism variant that achieves a specified utility with minimal privacy loss. However, their work, and similar findings by Whitehouse et al. [2023], are primarily limited to simple mechanisms based on Laplace or Gaussian noise. In this paper, we significantly generalize these results. In particular, we extend the findings of Wu et al. [2019] and Liu and Talwar [2019] to support any sequence of private estimators, incurring at most a doubling of the original privacy budget. Furthermore, we demonstrate that hyperparameter tuning for these estimators, including the selection of an optimal privacy budget, can be performed without additional privacy cost. Finally, we extend our results to ex-post R\'{e}nyi DP, further broadening the applicability of utility-first privacy mechanisms.

View full details

Poster

A Implies B: Circuit Analysis in LLMs for Propositional Logical Reasoning

Guan Zhe Hong ⋅ Nishanth Dikkala ⋅ Enming Luo ⋅ Cyrus Rashtchian ⋅ Xin Wang ⋅ Rina Panigrahy

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Due to the size and complexity of modern large language models (LLMs), it has proven challenging to uncover the underlying mechanisms that models use to solve reasoning problems. For instance, is their reasoning for a specific problem localized to certain parts of the network? Do they break down the reasoning problem into modular components that are then executed as sequential steps as we go deeper in the model? To better understand the reasoning capability of LLMs, we study a minimal propositional logic problem that requires combining multiple facts to arrive at a solution. By studying this problem on Mistral and Gemma models, up to 27B parameters, we illuminate the core components the models use to solve such logic problems. From a mechanistic interpretability point of view, we use causal mediation analysis to uncover the pathways and components of the LLMs' reasoning processes. Then, we offer fine-grained insights into the functions of attention heads in different layers. We not only find a sparse circuit that computes the answer, but we decompose it into sub-circuits that have four distinct and modular uses. Finally, we reveal that three distinct models -- Mistral-7B, Gemma-2-9B and Gemma-2-27B -- contain analogous but not identical mechanisms.

View full details

Poster

Unlocking Dataset Distillation with Diffusion Models

Brian Moser ⋅ Federico Raue ⋅ Sebastian Palacio ⋅ Stanislav Frolov ⋅ Andreas Dengel

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Dataset distillation seeks to condense datasets into smaller but highly representative synthetic samples. While diffusion models now lead all generative benchmarks, current distillation methods avoid them and rely instead on GANs or autoencoders, or, at best, sampling from a fixed diffusion prior. This trend arises because naive backpropagation through the long denoising chain leads to vanishing gradients, which prevents effective synthetic sample optimization. To address this limitation, we introduce Latent Dataset Distillation with Diffusion Models (LD3M), the first method to learn gradient-based distilled latents and class embeddings end-to-end through a pre-trained latent diffusion model. A linearly decaying skip connection, injected from the initial noisy state into every reverse step, preserves the gradient signal across dozens of timesteps without requiring diffusion weight fine-tuning. Across multiple ImageNet subsets at $128\times128$ and $256\times256$, LD3M improves downstream accuracy by up to 4.8 percentage points (1 IPC) and 4.2 points (10 IPC) over the prior state-of-the-art. The code for LD3M is provided at https://github.com/Brian-Moser/prune_and_distill.

View full details

Poster

Distributional Training Data Attribution: What do Influence Functions Sample?

Bruno Mlodozeniec ⋅ Isaac Reid ⋅ Sam Power ⋅ David Krueger ⋅ Murat Erdogdu ⋅ Richard Turner ⋅ Roger Grosse

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Randomness is an unavoidable part of training deep learning models, yet something that traditional training data attribution algorithms fail to rigorously account for. They ignore the fact that, due to stochasticity in the initialisation and batching, training on the same dataset can yield different models. In this paper, we address this shortcoming through introducing _distributional_ training data attribution (d-TDA), the goal of which is to predict how the distribution of model outputs (over training runs) depends upon the dataset. Intriguingly, we find that _influence functions_ (IFs), a popular data attribution tool, are 'secretly distributional': they emerge from our framework as the limit to unrolled differentiation, without requiring restrictive convexity assumptions. This provides a new perspective on the effectiveness of IFs in deep learning. We demonstrate the practical utility of d-TDA in experiments, including improving data pruning for vision transformers and identifying influential examples with diffusion models.

View full details

Poster

SANSA: Unleashing the Hidden Semantics in SAM2 for Few-Shot Segmentation

Claudia Cuttano ⋅ Gabriele Trivigno ⋅ Giuseppe Averta ⋅ Carlo Masone

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Few-shot segmentation aims to segment unseen categories from just a handful of annotated examples. This requires mechanisms to identify semantically related objects across images and accurately produce masks. We note that Segment Anything 2 (SAM2), with its prompt-and-propagate mechanism, provides strong segmentation capabilities and a built-in feature matching process. However, we show that its representations are entangled with task-specific cues optimized for object tracking, which impairs its use for tasks requiring higher level semantic understanding. Our key insight is that, despite its class-agnostic pretraining, SAM2 already encodes rich semantic structure in its features. We propose SANSA (Semantically AligNed Segment Anything 2), a framework that makes this latent structure explicit, and repurposes SAM2 for few-shot segmentation through minimal task-specific modifications. SANSA achieves state-of-the-art on few-shot segmentation benchmarks designed to assess generalization and outperforms generalist methods in the popular in-context setting. Additionally, it supports flexible promptable interaction via points, boxes, or scribbles, and remains significantly faster and more compact than prior approaches. Code at: https://github.com/ClaudiaCuttano/SANSA.

View full details

Poster

Can Knowledge-Graph-based Retrieval Augmented Generation Really Retrieve What You Need?

Junchi Yu ⋅ Yujie Liu ⋅ Jindong Gu ⋅ Philip Torr ⋅ Dongzhan Zhou

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Retrieval-Augmented Generation (RAG) based on knowledge graphs (KGs) enhances large language models (LLMs) by providing structured and interpretable external knowledge. However, existing KG-based RAG methods struggle to retrieve accurate and diverse information from text-rich KGs for complex real-world queries. Process Reward Models (PRMs) offer a way to align the retrieval process of KG-based RAG with query-specific knowledge requirements, but they heavily rely on process-level supervision signals that are expensive and hard to obtain on KGs. To address this challenge, we propose GraphFlow, a framework that efficiently retrieves accurate and diverse knowledge required for real-world queries from text-rich KGs. GraphFlow employs a transition-based flow matching objective to jointly optimize a retrieval policy and a flow estimator. The flow estimator factorizes the reward of the retrieval outcome into the intermediate retrieval states. Such reward factorization guides the retrieval policy to retrieve candidates from KGs in proportion to their reward. This allows GraphFlow to explore high-quality regions of KGs that yield diverse and relevant results. We evaluate GraphFlow on the STaRK benchmark, which includes real-world queries from multiple domains over text-rich KGs. GraphFlow outperforms strong KG-RAG baselines, including GPT-4o, by 10\% on average in hit rate and recall. It also shows strong generalization to unseen KGs, demonstrating its effectiveness and robustness.

View full details

Poster

SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications

Gabriele Oliaro ⋅ Zhihao Jia ⋅ Daniel Campos ⋅ Aurick Qiao

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Speculative decoding is widely adopted to reduce latency in large language model (LLM) inference by leveraging smaller draft models capable of handling diverse user tasks. However, emerging AI applications, such as LLM-based agents, present unique workload characteristics: instead of diverse independent requests, agentic frameworks typically submit repetitive inference requests, such as multi-agent pipelines performing similar subtasks or self-refinement loops iteratively enhancing outputs. These workloads result in long and highly predictable sequences, which current speculative decoding methods do not effectively exploit. To address this gap, we introduce \emph{SuffixDecoding}, a novel method that utilizes efficient suffix trees to cache long token sequences from prompts and previous outputs. By adaptively speculating more tokens when acceptance likelihood is high and fewer when it is low, SuffixDecoding effectively exploits opportunities for longer speculations while conserving computation when those opportunities are limited. Evaluations on agentic benchmarks, including SWE-Bench and Text-to-SQL, demonstrate that SuffixDecoding achieves speedups of up to 3.9$\times$, outperforming state-of-the-art methods -- 2.2$\times$ faster than model-based approaches like EAGLE-2/3 and 1.6$\times$ faster than model-free approaches such as Token Recycling. SuffixDecoding is open-sourced.

View full details

Poster

On the necessity of adaptive regularisation: Optimal anytime online learning on $\boldsymbol{\ell_p}$-balls

Emmeran Johnson ⋅ David Martínez-Rubio ⋅ Ciara Pike-Burke ⋅ Patrick Rebeschini

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We study online convex optimization on $\ell_p$-balls in $\mathbb{R}^d$ for $p > 2$. While always sub-linear, the optimal regret exhibits a shift between the high-dimensional setting ($d > T$), when the dimension $d$ is greater than the time horizon $T$ and the low-dimensional setting ($d \leq T$). We show that Follow-the-Regularised-Leader (FTRL) with time-varying regularisation which is adaptive to the dimension regime is anytime optimal for all dimension regimes. Motivated by this, we ask whether it is possible to obtain anytime optimality of FTRL with fixed non-adaptive regularisation. Our main result establishes that for separable regularisers, adaptivity in the regulariser is necessary, and that any fixed regulariser will be sub-optimal in one of the two dimension regimes. Finally, we provide lower bounds which rule out sub-linear regret bounds for the linear bandit problem in sufficiently high-dimension for all $\ell_p$-balls with $p \geq 1$.

View full details

Poster

Controlling Thinking Speed in Reasoning Models

Zhengkai Lin ⋅ Zhihang Fu ⋅ Ze Chen ⋅ Chao Chen ⋅ Liang Xie ⋅ Wenxiao Wang ⋅ Deng Cai ⋅ Zheng Wang ⋅ Jieping Ye

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Human cognition is theorized to operate in two modes: fast, intuitive System 1 thinking and slow, deliberate System 2 thinking. While current Large Reasoning Models (LRMs) excel at System 2 thinking, their inability to perform fast thinking leads to high computational overhead and latency. In this work, we enable LRMs to approximate human intelligence through dynamic thinking speed adjustment, optimizing accuracy-efficiency trade-offs. Our approach addresses two key questions: (1) how to control thinking speed in LRMs, and (2) when to adjust it for optimal performance. For the first question, we identify the steering vector that governs slow-fast thinking transitions in LRMs' representation space. Using this vector, we achieve the first representation editing-based test-time scaling effect, outperforming existing prompt-based scaling methods. For the second question, we apply real-time difficulty estimation to signal reasoning segments of varying complexity. Combining these techniques, we propose the first reasoning strategy that enables fast processing of easy steps and deeper analysis for complex reasoning. Without any training or additional cost, our plug-and-play method yields an average +1.3\% accuracy with -8.6\% token usage across leading LRMs and advanced reasoning benchmarks. All of our algorithms are implemented based on vLLM and are expected to support broader applications and inspire future research.

View full details

Poster

MDReID: Modality-Decoupled Learning for Any-to-Any Multi-Modal Object Re-Identification

Yingying Feng ⋅ Jie Li ⋅ Jie Hu ⋅ Yukang Zhang ⋅ Lei Tan ⋅ Jiayi Ji

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The challenge of inconsistent modalities in real-world applications presents significant obstacles to effective object re-identification (ReID). However, most existing approaches assume modality-matched conditions, significantly limiting their effectiveness in modality-mismatched scenarios. To overcome this limitation and achieve a more flexible ReID, we introduce MDReID to allow any-to-any image-level ReID systems. MDReID is inspired by the widely recognized perspective that modality information comprises both modality-shared features, predictable across modalities, and unpredictable modality-specific features, which are inherently modality-dependent and consist of two key components: the Modality Decoupling Module (MDM) and Modality-aware Metric Learning (MML). Specifically, MDM explicitly decomposes modality features into modality-shared and modality-specific representations, enabling effective retrieval in both modality-aligned and mismatched scenarios. MML, a tailored metric learning strategy, further enhances feature discrimination and decoupling by exploiting distributional relationships between shared and specific modality features. Extensive experiments conducted on three challenging multi-modality ReID benchmarks (RGBNT201, RGBNT100, MSVR310) consistently demonstrate the superiority of MDL. MDReID achieves significant mAP improvements of 9.8\%, 3.0\%, and 11.5\% in modality-matched scenarios, and average gains of 3.4\%, 11.8\%, and 10.9\% in modality-mismatched scenarios, respectively.

View full details

Poster

Vector Quantization in the Brain: Grid-like Codes in World Models

Xiangyuan Peng ⋅ Xingsi Dong ⋅ Si Wu

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We propose Grid-like Code Quantization (GCQ), a brain-inspired method for compressing observation-action sequences into discrete representations using grid-like patterns in attractor dynamics. Unlike conventional vector quantization approaches that operate on static inputs, GCQ performs spatiotemporal compression through an action-conditioned codebook, where codewords are derived from continuous attractor neural networks and dynamically selected based on actions. This enables GCQ to jointly compress space and time, serving as a unified world model. The resulting representation supports long-horizon prediction, goal-directed planning, and inverse modeling. Experiments across diverse tasks demonstrate GCQ's effectiveness in compact encoding and downstream performance. Our work offers both a computational tool for efficient sequence modeling and a theoretical perspective on the formation of grid-like codes in neural systems.

View full details

Poster

Graph–Smoothed Bayesian Black-Box Shift Estimator and Its Information Geometry

Masanari Kimura

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Label shift adaptation aims to recover target class priors when the labelled source distribution $P$ and the unlabelled target distribution $Q$ share $P(X \mid Y) = Q(X \mid Y)$ but $P(Y) \neq Q(Y)$. Classical black‑box shift estimators invert an empirical confusion matrix of a frozen classifier, producing a brittle point estimate that ignores sampling noise and similarity among classes. We present Graph‑Smoothed Bayesian BBSE (GS‑B$^3$SE), a fully probabilistic alternative that places Laplacian–Gaussian priors on both target log‑priors and confusion‑matrix columns, tying them together on a label‑similarity graph. The resulting posterior is tractable with HMC or a fast block Newton–CG scheme. We prove identifiability, $N^{-1/2}$ contraction, variance bounds that shrink with the graph’s algebraic connectivity, and robustness to Laplacian misspecification. We also reinterpret GS‑B$^3$SE through information geometry, showing that it generalizes existing shift estimators.

View full details

Poster

🎧MOSPA: Human Motion Generation Driven by Spatial Audio

Shuyang Xu ⋅ Zhiyang Dou ⋅ Mingyi Shi ⋅ Liang Pan ⋅ Leo Ho ⋅ Jingbo Wang ⋅ Yuan Liu ⋅ Cheng Lin ⋅ Yuexin Ma ⋅ Wenping Wang ⋅ Taku Komura

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Enabling virtual humans to dynamically and realistically respond to diverse auditory stimuli remains a key challenge in character animation, demanding the integration of perceptual modeling and motion synthesis. Despite its significance, this task remains largely unexplored. Most previous works have primarily focused on mapping modalities like speech, audio, and music to generate human motion. As of yet, these models typically overlook the impact of spatial features encoded in spatial audio signals on human motion. To bridge this gap and enable high-quality modeling of human movements in response to spatial audio, we introduce the first comprehensive "Spatial Audio-Driven Human Motion" (SAM) dataset, which contains diverse and high-quality spatial audio and motion data. For benchmarking, we develop a simple yet effective diffusion-based generative framework for human "MOtion generation driven by SPatial Audio," termed MOSPA, which faithfully captures the relationship between body motion and spatial audio through an effective fusion mechanism. Once trained, MOSPA can generate diverse realistic human motions conditioned on varying spatial audio inputs. We perform a thorough investigation of the proposed dataset and conduct extensive experiments for benchmarking, where our method achieves state-of-the-art performance on this task. Our code and model are publicly available at https://github.com/xsy27/Mospa-Acoustic-driven-Motion-Generation.git

View full details

Poster

Detecting Generated Images by Fitting Natural Image Distributions

Yonggang Zhang ⋅ Jun Nie ⋅ Xinmei Tian ⋅ Mingming Gong ⋅ Kun Zhang ⋅ Bo Han

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The increasing realism of generated images has raised significant concerns about their potential misuse, necessitating robust detection methods. Current approaches mainly rely on training binary classifiers, which depend heavily on the quantity and quality of available generated images. In this work, we propose a novel framework that exploits geometric differences between the data manifolds of natural and generated images. To exploit this difference, we employ a pair of functions engineered to yield consistent outputs for natural images but divergent outputs for generated ones, leveraging the property that their gradients reside in mutually orthogonal subspaces. This design enables a simple yet effective detection method: an image is identified as generated if a transformation along its data manifold induces a significant change in the loss value of a self-supervised model pre-trained on natural images. Further more, to address diminishing manifold disparities in advanced generative models, we leverage normalizing flows to amplify detectable differences by extruding generated images away from the natural image manifold. Extensive experiments demonstrate the efficacy of this method.

View full details

Poster

Sheetpedia: A 300K-Spreadsheet Corpus for Spreadsheet Intelligence and LLM Fine-Tuning

Zailong Tian ⋅ Zhuoheng Han ⋅ Houfeng Wang ⋅ Lizi Liao

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Spreadsheets are widely used for data analysis and reporting, yet their complex structure and formula logic pose significant challenges for AI systems. We introduce Sheetpedia, a large-scale corpus of over 290,000 diverse spreadsheets (from 324,000+ workbooks) compiled from enterprise email archives and online forums. We detail a rigorous collection and preprocessing pipeline (integrating the Enron email spreadsheet archive and the Fuse web corpus, plus a new crawl of Excel forums) to standardize formats, filter languages, and remove duplicates. Sheetpedia provides extensive coverage of real formulas and annotations – addressing a gap left by prior table datasets (e.g. web tables used in TURL or Text-to-SQL in Spider) which often lack formula semantics. We present comprehensive corpus statistics, highlighting rich formula diversity and a majority (78\%+) of English content. To demonstrate the corpus’s utility, we fine-tune large language models on Sheetpedia for two novel spreadsheet understanding tasks: Natural Language to Semantic Range (NL2SR) and Natural Language to Formula (NL2Formula). Using a rejection-sampling data generation strategy, our fine-tuned models achieve up to 97.5\% accuracy on NL2SR and 71.7\% on NL2Formula – substantially outperforming baseline approaches. Sheetpedia (to be released publicly) fills a crucial need for a large, high-quality spreadsheet benchmark, enabling more effective spreadsheet intelligence and natural language interfaces for spreadsheet tools.

View full details

Poster

TGA: True-to-Geometry Avatar Dynamic Reconstruction

Bo Guo ⋅ Sijia Wen ⋅ Ziwei Wang ⋅ Yifan Zhao

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Recent advances in 3D Gaussian Splatting (3DGS) have improved the visual fidelity of dynamic avatar reconstruction. However, existing methods often overlook the inherent chromatic similarity of human skin tones, leading to poor capture of intricate facial geometry under subtle appearance changes. This is caused by the affine approximation of Gaussian projection, which fails to be perspective-aware to depth-induced shear effects. To this end, we propose True-to-Geometry Avatar Dynamic Reconstruction (TGA), a perspective-aware 4D Gaussian avatar framework that sensitively captures fine-grained facial variations for accurate 3D geometry reconstruction. Specifically, to enable color-sensitive and geometry-consistent Gaussian representations under dynamic conditions, we introduce Perspective-Aware Gaussian Transformation that jointly models temporal deformations and spatial projection by integrating Jacobian-guided adaptive deformation into the homogeneous formulation. Furthermore, we develop Incremental BVH Tree Pivoting to enable fast frame-by-frame mesh extraction for 4D Gaussian representations. A dynamic Gaussian Bounding Volume Hierarchy (BVH) tree is used to model the topological relationships among points, where active ones are filtered out by BVH pivoting and subsequently re-triangulated for surface reconstruction. Extensive experiments demonstrate that TGA achieves superior geometric accuracy.

View full details

Poster

Neighbor-aware Contrastive Disambiguation for Cross-Modal Hashing with Redundant Annotations

Chao Su ⋅ Likang Peng ⋅ Yuan Sun ⋅ Dezhong Peng ⋅ Xi Peng ⋅ Xu Wang

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Cross-modal hashing aims to efficiently retrieve information across different modalities by mapping data into compact hash codes. However, most existing methods assume access to fully accurate supervision, which rarely holds in real-world scenarios. In fact, annotations are often redundant, i.e., each sample is associated with a set of candidate labels that includes both ground-truth labels and redundant noisy labels. Treating all annotated labels as equally valid introduces two critical issues: (1) the sparse presence of true labels within the label set is not explicitly addressed, leading to overfitting on redundant noisy annotations; (2) redundant noisy labels induce spurious similarities that distort semantic alignment across modalities and degrade the quality of the hash space. To address these challenges, we propose that effective cross-modal hashing requires explicitly identifying and leveraging the true label subset within all candidate annotations. Based on this insight, we present Neighbor-aware Contrastive Disambiguation (NACD), a novel framework designed for robust learning under redundant supervision. NACD consists of two key components. The first, Neighbor-aware Confidence Reconstruction (NACR), refines label confidence by aggregating information from cross-modal neighbors to distinguish true labels from redundant noisy ones. The second, Class-aware Robust Contrastive Hashing (CRCH), constructs reliable positive and negative pairs based on label confidence scores, thereby significantly enhancing robustness against noisy supervision. Moreover, to effectively reduce the quantization error, we incorporate a quantization loss that enforces binary constraints on the learned hash representations. Extensive experiments conducted on three large-scale multimodal benchmarks demonstrate that our method consistently outperforms state-of-the-art approaches, thereby establishing a new standard for cross-modal hashing with redundant annotations. Code is available at https://github.com/Rose-bud/NACD.

View full details

Poster

Projective Equivariant Networks via Second-order Fundamental Differential Invariants

Yikang Li ⋅ Yeqing Qiu ⋅ Yuxuan Chen ⋅ Lingshen He ⋅ Lexiang Hu ⋅ Zhouchen Lin

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Equivariant networks enhance model efficiency and generalization by embedding symmetry priors into their architectures. However, most existing methods, primarily based on group convolutions and steerable convolutions, face significant limitations when dealing with complex transformation groups, particularly the projective group, which plays a crucial role in vision. In this work, we tackle the challenge by constructing projective equivariant networks based on differential invariants. Using the moving frame method with a carefully selected cross section tailored for multi-dimensional functions, we derive a complete and concise set of second-order fundamental differential invariants of the projective group. We provide a rigorous analysis of the properties and transformation relationships of their underlying components, yielding a further simplified and unified set of fundamental differential invariants, which facilitates both theoretical analysis and practical applications. Building on this foundation, we develop PDINet, the first framework for deep projective equivariant networks, achieving full projective equivariance without discretizing or sampling the group. Empirical results on the projectively transformed STL-10 and Imagenette datasets show that PDINet achieves improvements of 11.39\% and 5.66\% in accuracy over the respective standard baselines under out-of-distribution settings, demonstrating its strong generalization to complex geometric transformations.

View full details

Poster

Trajectory Graph Learning: Aligning with Long Trajectories in Reinforcement Learning Without Reward Design

Yunfan Li ⋅ Eric Liu ⋅ Lin Yang

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Reinforcement learning (RL) often relies on manually designed reward functions, which are difficult to specify and can lead to issues such as reward hacking and suboptimal behavior. Alternatives like inverse RL and preference-based RL attempt to infer surrogate rewards from demonstrations or preferences but suffer from ambiguity and distribution mismatch. A more direct approach, inspired by imitation learning, avoids reward modeling by leveraging expert demonstrations. However, most existing methods align actions only at individual states, failing to capture the coherence of long-horizon trajectories. In this work, we study the problem of directly aligning policies with expert-labeled trajectories to preserve long-horizon behavior without relying on reward signals. Specifically, we aim to learn a policy that maximizes the probability of generating the expert trajectories. Nevertheless, we prove that, in its general form, this trajectory alignment problem is NP-complete. To address this, we propose Trajectory Graph Learning (TGL), a framework that leverages structural assumptions commonly satisfied in practice—such as bounded realizability of expert trajectories or a tree-structured MDP. These enable a graph-based policy planning algorithm that computes optimal policies in polynomial time under known dynamics. For settings with unknown dynamics, we develop a sample-efficient algorithm based on UCB-style exploration and establish sub-linear regret. Experiments on grid-world tasks demonstrate that TGL substantially outperforms standard imitation learning methods for long-trajectory planning.

View full details

Poster

Angles Don’t Lie: Unlocking Training‑Efficient RL Through the Model’s Own Signals

Qinsi Wang ⋅ Jinghan Ke ⋅ Hancheng Ye ⋅ Yueqian Lin ⋅ Yuzhe Fu ⋅ Jianyi Zhang ⋅ Kurt Keutzer ⋅ Chenfeng Xu ⋅ Yiran Chen

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Current Reinforcement Fine-tuning (RFT) paradigms for Large Language Models (LLMs) suffer from sample inefficiency due to the redundant exposure of identical queries under uniform data sampling. While previous work has explored curriculum learning via heuristic difficulty metrics, these strategies exhibit limitations by neglecting the intrinsic learning signals generated by the model itself, thus leading to suboptimal training regimes. In this paper, we identify a model-inherent signal termed *angle concentration* that effectively reflects an LLM's capacity to learn from specific data. We theoretically and empirically demonstrate a correlation between the angular distribution of token hidden state vectors and the resulting gradient, revealing a learning preference for data exhibiting higher angle concentration. Inspired by this finding, we propose GAIN-RL, a Gradient-driven Angle-Informed Navigated RL framework. By leveraging the model's intrinsic angle concentration signal, GAIN-RL dynamically selects training data in each epoch, ensuring consistently impactful gradient updates and thus significantly enhancing overall training efficiency. Empirical evaluations show that GAIN-RL (GRPO) achieves over a 2.5$\times$ acceleration in training efficiency across diverse mathematical and coding tasks and varying model scales. Furthermore, GAIN-RL (GRPO)'s efficient sampling yields data-efficient training, achieving better performance with half the original data compared to vanilla GRPO with full training data.

View full details

Poster

Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better

Danny Driess ⋅ Jost Springenberg ⋅ Brian Ichter ⋅ LILI YU ⋅ Adrian Li-Bell ⋅ Karl Pertsch ⋅ Allen Ren ⋅ Homer Walke ⋅ Quan Vuong ⋅ Lucy Xiaoyang Shi ⋅ Sergey Levine

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Vision-language-action (VLA) models provide a powerful approach to training control policies for physical systems, such as robots, by combining end-to-end learning with transfer of semantic knowledge from web-scale vision-language model (VLM) training. However, the constraints of real-time control are often at odds with the design of VLMs: the most powerful VLMs have tens or hundreds of billions of parameters, presenting an obstacle to real-time inference, and operate on discrete tokens rather than the continuous-valued outputs that are required for controlling robots. To address this challenge, recent VLA models have used specialized modules for efficient continuous control, such as action experts or continuous output heads, which typically require adding new untrained parameters to the pretrained VLM backbone. While these modules improve real-time and control capabilities, it remains an open question whether they preserve or degrade the semantic knowledge contained in the pretrained VLM, and what effect they have on the VLA training dynamics. In this paper, we study this question in the context of VLAs that include a continuous diffusion or flow matching action expert, showing that naively including such experts significantly harms both training speed and knowledge transfer. We provide an extensive analysis of various design choices, their impact on performance and knowledge transfer, and propose a technique for insulating the VLM backbone during VLA training that mitigates this issue. Videos are available at https://pi.website/research/knowledge_insulation and open-source model weights are available at https://github.com/Physical-Intelligence/openpi.

View full details

Poster

Fast MRI for All: Bridging Access Gaps by Training without Raw Data

Yasar Utku Alcalar ⋅ Merve Gulle ⋅ Mehmet Akcakaya

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Physics-driven deep learning (PD-DL) approaches have become popular for improved reconstruction of fast magnetic resonance imaging (MRI) scans. Though PD-DL offers higher acceleration rates than existing clinical fast MRI techniques, their use has been limited outside specialized MRI centers. A key challenge is generalization to rare pathologies or different populations, noted in multiple studies, with fine-tuning on target populations suggested for improvement. However, current approaches for PD-DL training require access to raw k-space measurements, which is typically only available at specialized MRI centers that have research agreements for such data access. This is especially an issue for rural and under-resourced areas, where commercial MRI scanners only provide access to a final reconstructed image. To tackle these challenges, we propose Compressibility-inspired Unsupervised Learning via Parallel Imaging Fidelity (CUPID) for high-quality PD-DL training using only routine clinical reconstructed images exported from an MRI scanner. CUPID evaluates output quality with a compressibility-based approach while ensuring that the output stays consistent with the clinical parallel imaging reconstruction through well-designed perturbations. Our results show CUPID achieves similar quality to established PD-DL training that requires k-space data while outperforming compressed sensing (CS) and diffusion-based generative methods. We further demonstrate its effectiveness in a zero-shot training setup for retrospectively and prospectively sub-sampled acquisitions, attesting to its minimal training burden. As an approach that radically deviates from existing strategies, CUPID presents an opportunity to provide broader access to fast MRI for remote and rural populations in an attempt to reduce the obstacles associated with this expensive imaging modality. Code is available at https://github.com/ualcalar17/CUPID.

View full details

Poster

Quantum speedup of non-linear Monte Carlo problems

Jose Blanchet ⋅ Yassine Hamoudi ⋅ Mario Szegedy ⋅ Guanyang Wang

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

The mean of a random variable can be understood as a *linear* functional on the space of probability distributions. Quantum computing is known to provide a quadratic speedup over classical Monte Carlo methods for mean estimation. In this paper, we investigate whether a similar quadratic speedup is achievable for estimating *non-linear* functionals of probability distributions. We propose a \textit{quantum-inside-quantum} algorithm that achieves this speedup for the broad class of nonlinear estimation problems known as nested expectations. Our algorithm improves upon the direct application of the quantum-accelerated multilevel Monte Carlo algorithm introduced by An et. al.. The existing lower bound indicates that our algorithm is optimal up to polylogarithmic factors. A key innovation of our approach is a new sequence of multilevel Monte Carlo approximations specifically designed for quantum computing, which is central to the algorithm's improved performance.

View full details

Poster

EGGS: Exchangeable 2D/3D Gaussian Splatting for Geometry-Appearance Balanced Novel View Synthesis

Yancheng Zhang ⋅ Guangyu Sun ⋅ Chen Chen

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Novel view synthesis (NVS) is crucial in computer vision and graphics, with wide applications in AR, VR, and autonomous driving. While 3D Gaussian Splatting (3DGS) enables real-time rendering with high appearance fidelity, it suffers from multi-view inconsistencies, limiting geometric accuracy. In contrast, 2D Gaussian Splatting (2DGS) enforces multi-view consistency but compromises texture details. To address these limitations, we propose Exchangeable Gaussian Splatting (EGGS), a hybrid representation that integrates 2D and 3D Gaussians to balance appearance and geometry. To achieve this, we introduce Hybrid Gaussian Rasterization for unified rendering, Adaptive Type Exchange for dynamic adaptation between 2D and 3D Gaussians, and Frequency-Decoupled Optimization that effectively exploits the strengths of each type of Gaussian representation. Our CUDA-accelerated implementation ensures efficient training and inference. Extensive experiments demonstrate that EGGS outperforms existing methods in rendering quality, geometric accuracy, and efficiency, providing a practical solution for high-quality NVS.

View full details

Poster

CURE: Concept Unlearning via Orthogonal Representation Editing in Diffusion Models

Shristi Das Biswas ⋅ Arani Roy ⋅ Kaushik Roy

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

As Text-to-Image models continue to evolve, so does the risk of generating unsafe, copyrighted, or privacy-violating content. Existing safety interventions - ranging from training data curation and model fine-tuning to inference-time filtering and guidance - often suffer from incomplete concept removal, susceptibility to jail-breaking, computational inefficiency, or collateral damage to unrelated capabilities. In this paper, we introduce CURE, a training-free concept unlearning framework that operates directly in the weight space of pre-trained diffusion models, enabling fast, interpretable, and highly specific suppression of undesired concepts. At the core of our method is the Spectral Eraser, a closed-form, orthogonal projection module that identifies discriminative subspaces using Singular Value Decomposition over token embeddings associated with the concepts to forget and retain. Intuitively, the Spectral Eraser identifies and isolates features unique to the undesired concept while preserving safe attributes. This operator is then applied in a single step update to yield an edited model in which the target concept is effectively unlearned - without retraining, supervision, or iterative optimization. To balance the trade-off between filtering toxicity and preserving unrelated concepts, we further introduce an Expansion Mechanism for spectral regularization which selectively modulates singular vectors based on their relative significance to control the strength of forgetting. All the processes above are in closed-form, guaranteeing extremely efficient erasure in only $2$ seconds. Benchmarking against prior approaches, CURE achieves a more efficient and thorough removal for targeted artistic styles, objects, identities, or explicit content, with minor damage to original generation ability and demonstrates enhanced robustness against red-teaming. Project Page at \url{https://sites.google.com/view/cure-unlearning/home}.

View full details

Poster

Fully Autonomous Neuromorphic Navigation and Dynamic Obstacle Avoidance

Xiaochen Shang ⋅ Pengwei Luo ⋅ Xinning Wang ⋅ Jiayue Zhao ⋅ Huilin Ge ⋅ Bo Dong ⋅ Xin Yang

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Unmanned aerial vehicles could accurately accomplish complex navigation and obstacle avoidance tasks under external control. However, enabling unmanned aerial vehicles (UAVs) to rely solely on onboard computation and sensing for real-time navigation and dynamic obstacle avoidance remains a significant challenge due to stringent latency and energy constraints. Inspired by the efficiency of biological systems, we propose a fully neuromorphic framework achieving end-to-end obstacle avoidance during navigation with an overall latency of just 2.3 milliseconds. Specifically, our bio-inspired approach enables accurate moving object detection and avoidance without requiring target recognition or trajectory computation. Additionally, we introduce the first monocular event-based pose correction dataset with over 50,000 paired and labeled event streams. We validate our system on an autonomous quadrotor using only onboard resources, demonstrating reliable navigation and avoidance of diverse obstacles moving at speeds up to 10 m/s.

View full details

Poster

Restoring Pruned Large Language Models via Lost Component Compensation

Zijian Feng ⋅ Hanzhang Zhou ⋅ Zixiao Zhu ⋅ Tianjiao Li ⋅ Chua Deryl ⋅ Lee Onn Mak ⋅ Gee Ng ⋅ Kezhi Mao

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Pruning is a widely used technique to reduce the size and inference cost of large language models (LLMs), but it often causes performance degradation. To mitigate this, existing restoration methods typically employ parameter-efficient fine-tuning (PEFT), such as LoRA, to recover the pruned model's performance. However, most PEFT methods are designed for dense models and overlook the distinct properties of pruned models, often resulting in suboptimal recovery. In this work, we propose a targeted restoration strategy for pruned models that restores performance while preserving their low cost and high efficiency. We observe that pruning-induced information loss is reflected in attention activations, and selectively reintroducing components of this information can significantly recover model performance. Based on this insight, we introduce RestoreLCC (Restoring Pruned LLMs via Lost Component Compensation), a plug-and-play method that contrastively probes critical attention heads via activation editing, extracts lost components from activation differences, and finally injects them back into the corresponding pruned heads for compensation and recovery. RestoreLCC is compatible with structured, semi-structured, and unstructured pruning schemes. Extensive experiments demonstrate that RestoreLCC consistently outperforms state-of-the-art baselines in both general and task-specific performance recovery, without compromising the sparsity or inference efficiency of pruned models.

View full details

Poster

Differential Privacy on Fully Dynamic Streams

Yuan Qiu ⋅ Ke Yi

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

A fundamental problem in differential privacy is to release privatized answers to a class of linear queries with small error. This problem has been well studied in the static case. In this paper, we consider the fully dynamic setting where items may be inserted into or deleted from the dataset over time, and we need to continually release query answers at every time instance. We present efficient black-box constructions of such dynamic differentially private mechanisms from static ones with only a polylogarithmic degradation in the utility.

View full details

Poster

Neighborhood Self-Dissimilarity Attention for Medical Image Segmentation

Junren Chen ⋅ Rui Chen ⋅ Wei Wang ⋅ Junlong Cheng ⋅ Gang Liang ⋅ zhanglei-scu ⋅ Liangyin Chen

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Medical image segmentation based on neural networks is pivotal in promoting digital health equity. The attention mechanism increasingly serves as a key component in modern neural networks, as it enables the network to focus on regions of interest, thus improving the segmentation accuracy in medical images. However, current attention mechanisms confront an accuracy-complexity trade-off paradox: accuracy gains demand higher computational costs, while reducing complexity sacrifices model accuracy. Such a contradiction inherently restricts the real-world deployment of attention mechanisms in resource-limited settings, thus exacerbating healthcare disparities. To overcome this dilemma, we propose a parameter-free Neighborhood Self-Dissimilarity Attention (NSDA), inspired by radiologists' diagnostic patterns of prioritizing regions exhibiting substantial differences during clinical image interpretation. Unlike pairwise-similarity-based self-attention mechanisms, NSDA constructs a size-adaptive local dissimilarity measure that quantifies element-neighborhood differences. By assigning higher attention weights to regions with larger feature differences, NSDA directs the neural network to focus on high-discrepancy regions, thus improving segmentation accuracy without adding trainable parameters directly related to computational complexity. The experimental results demonstrate the effectiveness and generalization of our method. This study presents a parameter-free attention paradigm, designed with clinical prior knowledge, to improve neural network performance for medical image analysis and contribute to digital health equity in low-resource settings. The code is available at [https://github.com/ChenJunren-Lab/Neighborhood-Self-Dissimilarity-Attention](https://github.com/ChenJunren-Lab/Neighborhood-Self-Dissimilarity-Attention).

View full details

Poster

Learning Robust Vision-Language Models from Natural Latent Spaces

Zhangyun Wang ⋅ Ni Ding ⋅ Aniket Mahanti

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Pre-trained vision-language models (VLMs) exhibit significant vulnerability to imperceptible adversarial perturbations. Current advanced defense strategies typically employ adversarial prompt tuning to improve the adversarial robustness of VLMs, which struggle to simultaneously maintain generalization across both natural and adversarial examples under different benchmarks and downstream tasks. We propose a collaborative adversarial prompt tuning (CoAPT) approach from pre-trained VLMs to target robust VLMs. Inspired by the image mask modeling, we adopt an improved real-time total variation algorithm to suppress and eliminate high-frequency details from images while preserving edge structures, thereby disrupting the adversarial perturbation space. Subsequently, guided by the high-level image and text representations in the latent space of the pre-trained VLMs, the corrupted natural features are restored while inheriting the superior generalization capability. Experiments on four benchmarks demonstrate that CoAPT achieves an excellent trade-off among natural generalization, adversarial robustness, and task-specific adaptation compared to state-of-the-art methods.

View full details

Poster

DexFlyWheel: A Scalable and Self-improving Data Generation Framework for Dexterous Manipulation

Kefei Zhu ⋅ Fengshuo Bai ⋅ YuanHao Xiang ⋅ Yishuai Cai ⋅ Xinglin Chen ⋅ Ruochong Li ⋅ Xingtao Wang ⋅ Hao Dong ⋅ Yaodong Yang ⋅ Xiaopeng Fan ⋅ Yuanpei Chen

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Dexterous manipulation is critical for advancing robot capabilities in real-world applications, yet diverse and high-quality datasets remain scarce. Existing data collection methods either rely on human teleoperation or require significant human engineering, or generate data with limited diversity, which restricts their scalability and generalization. In this paper, we introduce DexFlyWheel, a scalable data generation framework that employs a self-improving cycle to continuously enrich data diversity. Starting from efficient seed demonstrations warmup, DexFlyWheel expands the dataset through iterative cycles. Each cycle follows a closed-loop pipeline that integrates Imitation Learning (IL), residual Reinforcement Learning (RL), rollout trajectory collection, and data augmentation. Specifically, IL extracts human-like behaviors from demonstrations, and residual RL enhances policy generalization. The learned policy is then used to generate trajectories in simulation, which are further augmented across diverse environments and spatial configurations before being fed back into the next cycle. Over successive iterations, a self-improving data flywheel effect emerges, producing datasets that cover diverse scenarios and thereby scaling policy performance. Experimental results demonstrate that DexFlyWheel generates over 2,000 diverse demonstrations across four challenging tasks. Policies trained on our dataset achieve an average success rate of 81.9\% on the challenge test sets and successfully transfer to the real world through digital twin, achieving a 78.3\% success rate on dual-arm lift tasks.

View full details

Poster

Abstain Mask Retain Core: Time Series Prediction by Adaptive Masking Loss with Representation Consistency

Renzhao Liang ⋅ Sizhe Xu ⋅ Chenggang Xie ⋅ Jingru Chen ⋅ Feiyang Ren ⋅ Shu Yang ⋅ Takahiro Yabe

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Time series forecasting plays a pivotal role in critical domains such as energy management and financial markets. Although deep learning-based approaches (e.g., MLP, RNN, Transformer) have achieved remarkable progress, the prevailing "long-sequence information gain hypothesis" exhibits inherent limitations. Through systematic experimentation, this study reveals a counterintuitive phenomenon: appropriately truncating historical data can paradoxically enhance prediction accuracy, indicating that existing models learn substantial redundant features (e.g., noise or irrelevant fluctuations) during training, thereby compromising effective signal extraction. Building upon information bottleneck theory, we propose an innovative solution termed Adaptive Masking Loss with Representation Consistency (AMRC), which features two core components: 1) Dynamic masking loss, which adaptively identified highly discriminative temporal segments to guide gradient descent during model training; 2) Representation consistency constraint, which stabilized the mapping relationships among inputs, labels, and predictions. Experimental results demonstrate that AMRC effectively suppresses redundant feature learning while significantly improving model performance. This work not only challenges conventional assumptions in temporal modeling but also provides novel theoretical insights and methodological breakthroughs for developing efficient and robust forecasting models. We have made our code available at \url{https://github.com/MazelTovy/AMRC}.

View full details

Poster

Deciphering the Extremes: A Novel Approach for Pathological Long-tailed Recognition in Scientific Discovery

Zhe Zhao ⋅ HaiBin Wen ⋅ Xianfu Liu ⋅ Rui Mao ⋅ Pengkun Wang ⋅ Liheng Yu ⋅ Linjiang Chen ⋅ Bo An ⋅ Qingfu Zhang ⋅ Yang Wang

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Scientific discovery across diverse fields increasingly grapples with datasets exhibiting pathological long-tailed distributions: a few common phenomena overshadow a multitude of rare yet scientifically critical instances. Unlike standard benchmarks, these scientific datasets often feature extreme imbalance coupled with a modest number of classes and limited overall sample volume, rendering existing long-tailed recognition (LTR) techniques ineffective. Such methods, biased by majority classes or prone to overfitting on scarce tail data, frequently fail to identify the very instances—novel materials, rare disease biomarkers, faint astronomical signals—that drive scientific breakthroughs. This paper introduces a novel, end-to-end framework explicitly designed to address pathological long-tailed recognition in scientific contexts. Our approach synergizes a Balanced Supervised Contrastive Learning (B-SCL) mechanism, which enhances the representation of tail classes by dynamically re-weighting their contributions, with a Smooth Objective Regularization (SOR) strategy that manages the inherent tension between tail-class focus and overall classification performance. We introduce and analyze the real-world ZincFluor chemical dataset ($\mathcal{T}=137.54$) and synthetic benchmarks with controllable extreme imbalances (CIFAR-LT variants). Extensive evaluations demonstrate our method's superior ability to decipher these extremes. Notably, on ZincFluor, our approach achieves a Tail Top-2 accuracy of $66.84\%$, significantly outperforming existing techniques. On CIFAR-10-LT with an imbalance ratio of $1000$ ($\mathcal{T}=100$), our method achieves a tail-class accuracy of $38.99\%$, substantially leading the next best. These results underscore our framework's potential to unlock novel insights from complex, imbalanced scientific datasets, thereby accelerating discovery.

View full details

Poster

Bridging Theory and Practice in Link Representation with Graph Neural Networks

Veronica Lachi ⋅ Francesco Ferrini ⋅ Antonio Longa ⋅ Bruno Lepri ⋅ Andrea Passerini ⋅ Manfred Jaeger

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Graph Neural Networks (GNNs) are widely used to compute representations of node pairs for downstream tasks such as link prediction. Yet, theoretical understanding of their expressive power has focused almost entirely on graph-level representations. In this work, we shift the focus to links and provide the first comprehensive study of GNN expressiveness in link representation. We introduce a unifying framework, the $k_\phi$-$k_\rho$-$m$ framework, that subsumes existing message-passing link models and enables formal expressiveness comparisons. Using this framework, we derive a hierarchy of state-of-the-art methods and offer theoretical tools to analyze future architectures. To complement our analysis, we propose a synthetic evaluation protocol comprising the first benchmark specifically designed to assess link-level expressiveness. Finally, we ask: does expressiveness matter in practice? We use a graph symmetry metric that quantifies the difficulty of distinguishing links and show that while expressive models may underperform on standard benchmarks, they significantly outperform simpler ones as symmetry increases, highlighting the need for dataset-aware model selection.

View full details

Poster

Non-Asymptotic Analysis Of Data Augmentation For Precision Matrix Estimation

Lucas Morisset ⋅ Adrien Hardy ⋅ Alain Durmus

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

This paper addresses the problem of inverse covariance (also known as precision matrix) estimation in high-dimensional settings. Specifically, we focus on two classes of estimators: linear shrinkage estimators with a target proportional to the identity matrix, and estimators derived from data augmentation (DA). Here, DA refers to the common practice of enriching a dataset with artificial samples—typically generated via a generative model or through random transformations of the original data—prior to model fitting. For both classes of estimators, we derive estimators and provide concentration bounds for their quadratic error. This allows for both method comparison and hyperparameter tuning, such as selecting the optimal proportion of artificial samples. On the technical side, our analysis relies on tools from random matrix theory. We introduce a novel deterministic equivalent for generalized resolvent matrices, accommodating dependent samples with specific structure. We support our theoretical results with numerical experiments.

View full details

Poster

3D Equivariant Visuomotor Policy Learning via Spherical Projection

Boce Hu ⋅ Dian Wang ⋅ David Klee ⋅ Heng Tian ⋅ Xupeng Zhu ⋅ Haojie Huang ⋅ Robert Platt ⋅ Robin Walters

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Equivariant models have recently been shown to improve the data efficiency of diffusion policy by a significant margin. However, prior work that explored this direction focused primarily on point cloud inputs generated by multiple cameras fixed in the workspace. This type of point cloud input is not compatible with the now-common setting where the primary input modality is an eye-in-hand RGB camera like a GoPro. This paper closes this gap by incorporating into the diffusion policy model a process that projects features from the 2D RGB camera image onto a sphere. This enables us to reason about symmetries in $\mathrm{SO}(3)$ without explicitly reconstructing a point cloud. We perform extensive experiments in both simulation and the real world that demonstrate that our method consistently outperforms strong baselines in terms of both performance and sample efficiency. Our work, $\textbf{Image-to-Sphere Policy}$ ($\textbf{ISP}$), is the first $\mathrm{SO}(3)$-equivariant policy learning framework for robotic manipulation that works using only monocular RGB inputs.

View full details

Poster

Refinement Methods for Distributed Distribution Estimation under $\ell^p$-Losses

Deheng Yuan ⋅ Tao Guo ⋅ Zhongyi Huang

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Consider the communication-constrained estimation of discrete distributions under $\ell^p$ losses, where each distributed terminal holds multiple independent samples and uses limited number of bits to describe the samples. We obtain the minimax optimal rates of the problem for most parameter regimes. As a result, an elbow effect of the optimal rates at $p=2$ is clearly identified. In order to achieve the optimal rates for different parameter regimes, we introduce refinement methods and develop additional customized techniques in the estimation protocols. The general idea of the refinement methods is to first generate rough estimate by partial information and then establish refined estimate in subsequent steps guided by the rough estimate. Then customized techniques such as successive refinement, sample compression, thresholding and random hashing are leveraged to achieve the optimal rates in different parameter regimes. The optimality of the estimation protocols is shown by deriving compatible minimax lower bounds.

View full details

Poster

How do Transformers Learn Implicit Reasoning?

Jiaran Ye ⋅ Zijun Yao ⋅ Zhidian Huang ⋅ Liangming Pan ⋅ Jinxin Liu ⋅ Yushi Bai ⋅ Amy Xin ⋅ Liu Weichuan ⋅ Xiaoyin Che ⋅ Lei Hou ⋅ Juanzi Li

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Recent work suggests that large language models (LLMs) can perform multi-hop reasoning implicitly---producing correct answers without explicitly verbalizing intermediate steps---but the underlying mechanisms remain poorly understood. In this paper, we study how such implicit reasoning emerges by training transformers from scratch in a controlled symbolic environment. Our analysis reveals a three-stage developmental trajectory: early memorization, followed by in-distribution generalization, and eventually cross-distribution generalization. We find that training with atomic triples is not necessary but accelerates learning, and that second-hop generalization relies on query-level exposure to specific compositional structures. To interpret these behaviors, we introduce two diagnostic tools: cross-query semantic patching, which identifies semantically reusable intermediate representations, and a cosine-based representational lens, which reveals that successful reasoning correlates with the cosine-base clustering in hidden space. This clustering phenomenon in turn provides a coherent explanation for the behavioral dynamics observed across training, linking representational structure to reasoning capability. These findings provide new insights into the interpretability of implicit multi-hop reasoning in LLMs, helping to clarify how complex reasoning processes unfold internally and offering pathways to enhance the transparency of such models.

View full details

Poster

Ridge Boosting is Both Robust and Efficient

David Bruns-Smith ⋅ Zhongming Xie ⋅ Avi Feller

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Estimators in statistics and machine learning must typically trade off between efficiency, having low variance for a fixed target, and distributional robustness, such as \textit{multiaccuracy}, or having low bias over a range of possible targets. In this paper, we consider a simple estimator, \emph{ridge boosting}: starting with any initial predictor, perform a single boosting step with (kernel) ridge regression. Surprisingly, we show that ridge boosting simultaneously achieves both efficiency and distributional robustness: for target distribution shifts that lie within an RKHS unit ball, this estimator maintains low bias across all such shifts and has variance at the semiparametric efficiency bound for each target. In addition to bridging otherwise distinct research areas, this result has immediate practical value. Since ridge boosting uses only data from the source distribution, researchers can train a single model to obtain both robust and efficient estimates for multiple target estimands at the same time, eliminating the need to fit separate semiparametric efficient estimators for each target. We assess this approach through simulations and an application estimating the age profile of retirement income.

View full details

Poster

Transformer brain encoders explain human high-level visual responses

Hossein Adeli ⋅ Sun Minni ⋅ Nikolaus Kriegeskorte

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

A major goal of neuroscience is to understand brain computations during visual processing in naturalistic settings. A dominant approach is to use image-computable deep neural networks trained with different task objectives as a basis for linear encoding models. However, in addition to requiring estimation of a large number of linear encoding parameters, this approach ignores the structure of the feature maps both in the brain and the models. Recently proposed alternatives factor the linear mapping into separate sets of spatial and feature weights, thus finding static receptive fields for units, which is appropriate only for early visual areas. In this work, we employ the attention mechanism used in the transformer architecture to study how retinotopic visual features can be dynamically routed to category-selective areas in high-level visual processing. We show that this computational motif is significantly more powerful than alternative methods in predicting brain activity during natural scene viewing, across different feature basis models and modalities. We also show that this approach is inherently more interpretable as the attention-routing signals for different high-level categorical areas can be easily visualized for any input image. Given its high performance at predicting brain responses to novel images, the model deserves consideration as a candidate mechanistic model of how visual information from retinotopic maps is routed in the human brain based on the relevance of the input content to different category-selective regions. Our code is available at \href{https://github.com/Hosseinadeli/transformer_brain_encoder/}{https://github.com/Hosseinadeli/transformer\_brain\_encoder/}.

View full details

Poster

Color Conditional Generation with Sliced Wasserstein Guidance

Alexander Lobashev ⋅ Maria Larchenko ⋅ Dmitry Guskov

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We propose SW-Guidance, a training-free approach for image generation conditioned on the color distribution of a reference image. While it is possible to generate an image with fixed colors by first creating an image from a text prompt and then applying a color style transfer method, this approach often results in semantically meaningless colors in the generated image. Our method solves this problem by modifying the sampling process of a diffusion model to incorporate the differentiable Sliced 1-Wasserstein distance between the color distribution of the generated image and the reference palette. Our method outperforms state-of-the-art techniques for color-conditional generation in terms of color similarity to the reference, producing images that not only match the reference colors but also maintain semantic coherence with the original text prompt. Our source code is available at https://github.com/alobashev/sw-guidance.

View full details

Poster

Generalizable Reasoning through Compositional Energy Minimization

Alexandru Oarga ⋅ Yilun Du

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Generalization is a key challenge in machine learning, specifically in reasoning tasks, where models are expected to solve problems more complex than those encountered during training. Existing approaches typically train reasoning models in an end-to-end fashion, directly mapping input instances to solutions. While this allows models to learn useful heuristics from data, it often results in limited generalization beyond the training distribution. In this work, we propose a novel approach to reasoning generalization by learning energy landscapes over the solution spaces of smaller, more tractable subproblems. At test time, we construct a global energy landscape for a given problem by combining the energy functions of multiple subproblems. This compositional approach enables the incorporation of additional constraints during inference, allowing the construction of energy landscapes for problems of increasing difficulty. To improve the sample quality from this newly constructed energy landscape, we introduce Parallel Energy Minimization (PEM). We evaluate our approach on a wide set of reasoning problems. Our method outperforms existing state-of-the-art methods, demonstrating its ability to generalize to larger and more complex problems. Project website can be found at: https://alexoarga.github.io/compositional_reasoning/

View full details

Poster

Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles

Jiangjie Chen ⋅ Qianyu He ⋅ Siyu Yuan ⋅ Aili Chen ⋅ Zhicheng Cai ⋅ Weinan Dai ⋅ Hongli Yu ⋅ Jiaze Chen ⋅ Xuefeng Li ⋅ Qiying Yu ⋅ Hao Zhou ⋅ Mingxuan Wang

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Large Language Models (LLMs), such as OpenAI’s o1 and DeepSeek’s R1, excel at advanced reasoning tasks like math and coding via Reinforcement Learning with Verifiable Rewards (RLVR), but still struggle with puzzles solvable by humans without domain knowledge. We introduce ENIGMATA, the first comprehensive suite tailored for improving LLMs with puzzle reasoning skills. It includes 36 tasks across 7 categories, each with: 1) a generator that produces unlimited examples with controllable difficulty, and 2) a rule-based verifier for automatic evaluation. This generator-verifier design supports scalable, multi-task RL training, fine-grained analysis, and seamless RLVR integration. We further propose ENIGMATA-Eval, a rigorous benchmark, and develop optimized multi-task RLVR strategies. Our trained model, Qwen2.5-32B-ENIGMATA, consistently surpasses o3-mini-high and o1 on the puzzle reasoning benchmarks like ENIGMATA-Eval, ARC-AGI (32.8%), and ARC-AGI 2 (0.6%). It also generalizes well to out-of-domain puzzle benchmarks and mathematical reasoning, with little multi-tasking trade-off. When trained on larger models like Seed1.5-Thinking (20B activated parameters and 200B total parameters), puzzle data from ENIGMATA further boosts SoTA performance on advanced math and STEM reasoning tasks such as AIME (2024-2025), BeyondAIME and GPQA (Diamond), showing nice generalization benefits of ENIGMATA. This work offers a unified, controllable framework for advancing logical reasoning in LLMs. Project page: https://seed-enigmata.github.io.

View full details

Poster

NormFit: A Lightweight Solution for Few-Shot Federated Learning with Non-IID Data

Azadeh Motamedi ⋅ Jae-Mo Kang ⋅ Il-Min Kim

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Vision–Language Models (VLMs) have recently attracted considerable attention in Federated Learning (FL) due to their strong and robust performance. In particular, few-shot adaptation with pre-trained VLMs like CLIP enhances the performance of downstream tasks. However, existing methods still suffer from substantial communication overhead, high local computational demands, and suboptimal performance under non-IID user data. To simultaneously address all those limitations, we propose NormFit, a lightweight solution that selectively fine-tunes only a very small portion of the model parameters, specifically only the Pre-LayerNorm parameters of the vision encoder within a VLM. Overcoming the existing tradeoff between performance and communication/computation efficiency in few-shot FL, NormFit sets a new benchmark by simultaneously achieving superior accuracy and substantially reduced communication and computational demands. Theoretically, we show that NormFit yields a considerably smaller generalization gap compared to tuning all LayerNorm parameters. Importantly, NormFit can function effectively as a standalone solution or integrate seamlessly with existing few-shot fine-tuning methods to further enhance their performance. Notably, NormFit offers implementation simplicity, achieving these improvements without any algorithmic modifications, changes to the underlying model architecture, or the addition of external parameters.

View full details

Poster

Direct Fisher Score Estimation for Likelihood Maximization

Sherman Khoo ⋅ Yakun Wang ⋅ Song Liu ⋅ Mark Beaumont

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We study the problem of likelihood maximization when the likelihood function is intractable but model simulations are readily available. We propose a sequential, gradient-based optimization method that directly models the Fisher score based on a local score matching technique which uses simulations from a localized region around each parameter iterate. By employing a linear parameterization for the surrogate score model, our technique admits a closed-form, least-squares solution. This approach yields a fast, flexible, and efficient approximation to the Fisher score, effectively smoothing the likelihood objective and mitigating the challenges posed by complex likelihood landscapes. We provide theoretical guarantees for our score estimator, including bounds on the bias introduced by the smoothing. Empirical results on a range of synthetic and real-world problems demonstrate the superior performance of our method compared to existing benchmarks.

View full details

Poster

Reasoning Planning for Language Models

Ngoc Bao Nguyen ⋅ Trung Hieu Nguyen ⋅ Ruifeng She ⋅ Xiaojin Fu ⋅ Viet Anh Nguyen

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Selecting an appropriate reasoning method for a given query remains a key challenge in language model generation. Existing approaches typically generate multiple candidate responses and use an aggregation strategy to select the output answer, often assuming that more candidate answers yield higher accuracy. We revisit this assumption through a rigorous theoretical analysis, deriving accuracy bounds for standard aggregation methods under fixed generation distributions and candidate sizes. Building on these insights, we introduce EPIC, an Ensemble Planning with Contrastive learning framework to learn a shared representation space that captures both model reasoning abilities and query-method compatibility. EPIC incorporates our probability bounds as a regularizer in a utility-driven optimization that balances accuracy and computational cost. Experiments on diverse mathematical reasoning tasks show that EPIC consistently selects optimal reasoning methods, improving accuracy while reducing computational overhead. Our code can be found at https://github.com/nguyenngocbaocmt02/EPIC.

View full details

Poster

Enhancing Time Series Forecasting through Selective Representation Spaces: A Patch Perspective

Xingjian Wu ⋅ Xiangfei Qiu ⋅ Hanyin Cheng ⋅ Zhengyu Li ⋅ Jilin Hu ⋅ Chenjuan Guo ⋅ Bin Yang

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Time Series Forecasting has made significant progress with the help of Patching technique, which partitions time series into multiple patches to effectively retain contextual semantic information into a representation space beneficial for modeling long-term dependencies. However, conventional patching partitions a time series into adjacent patches, which causes a fixed representation space, thus resulting in insufficiently expressful representations. In this paper, we pioneer the exploration of constructing a selective representation space to flexibly include the most informative patches for forecasting. Specifically, we propose the Selective Representation Space (SRS) module, which utilizes the learnable Selective Patching and Dynamic Reassembly techniques to adaptively select and shuffle the patches from the contextual time series, aiming at fully exploiting the information of contextual time series to enhance the forecasting performance of patch-based models. To demonstrate the effectiveness of SRS module, we propose a simple yet effective SRSNet consisting of SRS and an MLP head, which achieves state-of-the-art performance on real-world datasets from multiple domains. Furthermore, as a novel plug-and-play module, SRS can also enhance the performance of existing patch-based models. The resources are available at https://github.com/decisionintelligence/SRSNet.

View full details

Poster

OmniSync: Towards Universal Lip Synchronization via Diffusion Transformers

Ziqiao Peng ⋅ Jiwen Liu ⋅ Haoxian Zhang ⋅ Xiaoqiang Liu ⋅ Songlin Tang ⋅ Pengfei Wan ⋅ Di ZHANG ⋅ Hongyan Liu ⋅ Jun He

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Lip synchronization is the task of aligning a speaker’s lip movements in video with corresponding speech audio, and it is essential for creating realistic, expressive video content. However, existing methods often rely on reference frames and masked-frame inpainting, which limit their robustness to identity consistency, pose variations, facial occlusions, and stylized content. In addition, since audio signals provide weaker conditioning than visual cues, lip shape leakage from the original video will affect lip sync quality. In this paper, we present OmniSync, a universal lip synchronization framework for diverse visual scenarios. Our approach introduces a mask-free training paradigm using Diffusion Transformer models for direct frame editing without explicit masks, enabling unlimited-duration inference while maintaining natural facial dynamics and preserving character identity. During inference, we propose a flow-matching-based progressive noise initialization to ensure pose and identity consistency, while allowing precise mouth-region editing. To address the weak conditioning signal of audio, we develop a Dynamic Spatiotemporal Classifier-Free Guidance (DS-CFG) mechanism that adaptively adjusts guidance strength over time and space. We also establish the AIGC-LipSync Benchmark, the first evaluation suite for lip synchronization in diverse AI-generated videos. Extensive experiments demonstrate that OmniSync significantly outperforms prior methods in both visual quality and lip sync accuracy, achieving superior results in both real-world and AI-generated videos.

View full details

Poster

AuroRA: Breaking Low-Rank Bottleneck of LoRA with Nonlinear Mapping

Haonan Dong ⋅ Wenhao Zhu ⋅ Guojie Song ⋅ Liang Wang

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient fine-tuning (PEFT) method validated across NLP and CV domains. However, LoRA faces an inherent low-rank bottleneck: narrowing its performance gap with full fine-tuning requires increasing the rank of its parameter matrix, resulting in significant parameter overhead. Recent linear LoRA variants have attempted to enhance expressiveness by introducing additional linear mappings; however, their composition remains inherently linear and fails to fundamentally improve LoRA’s representational capacity. To address this limitation, we propose \ourmethod, which incorporates an Adaptive Nonlinear Layer (ANL) between two linear projectors to capture \emph{fixed} and \emph{learnable} nonlinearities. This combination forms an {\fontfamily{lmtt}\selectfont \textbf{MLP-like structure}} with a compressed rank, enabling flexible and precise approximation of diverse target functions while theoretically guaranteeing lower approximation errors and bounded gradients. Extensive experiments on 22 datasets and 6 pretrained models demonstrate that \ourmethod: (\textbf{I}) not only matches or surpasses full fine-tuning performance with only $6.18\%\sim25\%$ of LoRA’s parameters but also (\textbf{II}) outperforms state-of-the-art PEFT methods by up to $10.88\%$ in both NLP and CV tasks, and \textbf{(III)} exhibits robust performance across various rank configurations.

View full details

Poster

Compress to Impress: Efficient LLM Adaptation Using a Single Gradient Step on 100 Samples

Shiva Sreeram ⋅ Alaa Maalouf ⋅ Pratyusha Sharma ⋅ Daniela Rus

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Recently, Sharma et al. (2024) suggested a method called LAyer- SElective-Rank reduction (LASER) which demonstrated that pruning high‑order components of carefully chosen LLM’s weight matrices can boost downstream accuracy—without any gradient‑based fine‑tuning. Yet LASER’s exhaustive, per‑matrix search (each requiring full‑dataset forward passes) makes it impractical for rapid deployment. We demonstrate that this overhead can be removed and find that: (i) Only a small, carefully chosen subset of matrices needs to be inspected—eliminating the layer‑by‑layer sweep, (ii) The gradient of each matrix’s singular values pinpoints which matrices merit reduction, (iii) Increasing the factorization search space by allowing matrices rows to cluster around multiple subspaces and then decomposing each cluster separately further reduces overfitting on the original training data and further lifts accuracy by up to 24.6 percentage points, and finally, (iv) we discover that evaluating on just 100 samples rather than the full training data—both for computing the indicative gradients and for measuring the final accuracy—suffices to further reduce the search time; we explain that as adaptation to downstream tasks is dominated by prompting style, not dataset size. As a results, we show that combining these findings yields a fast and robust adaptation algorithm for downstream tasks. Overall, with a single gradient step on 100 examples and a quick scan of the top candidate layers and factorization techniques, we can adapt LLMs to new datasets—entirely without fine‑tuning.

View full details

Poster

TokenSwap: A Lightweight Method to Disrupt Memorized Sequences in LLMs

Parjanya Prashant ⋅ Kaustubh Ponkshe ⋅ Babak Salimi

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

As language models scale, their performance improves dramatically across a wide range of tasks, but so does their tendency to memorize and regurgitate parts of their training data verbatim. This tradeoff poses serious legal, ethical, and safety concerns, especially in real-world deployments. Existing mitigation techniques, such as differential privacy or model unlearning, often require retraining or access to internal weights making them impractical for most users. In this work, we introduce TokenSwap, a lightweight, post-hoc defense designed for realistic settings where the user can only access token-level outputs. Our key insight is that while large models are necessary for high task performance, small models (e.g., DistilGPT-2) are often sufficient to assign fluent, grammatically plausible probabilities to common function words - and crucially, they memorize far less. By selectively swapping token probabilities between models, TokenSwap preserves the capabilities of large models while reducing their propensity for verbatim reproduction. Evaluations on Pythia-6.9B and Llama-3-8B show up to a 10$\times$ drop in exact memorization with negligible task degradation. Our method offers a practical, accessible solution for mitigating memorized generation in deployed LLMs. Code is available at https://github.com/parjanya20/verbatim-llm.

View full details

Poster

Towards Interpretable and Efficient Attention: Compressing All by Contracting a Few

Qishuai Wen ⋅ Zhiyuan Huang ⋅ Chun-Guang Li

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Attention mechanisms have achieved significant empirical success in multiple fields, but their underlying optimization objectives remain unclear yet. Moreover, the quadratic complexity of self-attention has become increasingly prohibitive. Although interpretability and efficiency are two mutually reinforcing pursuits, prior work typically investigates them separately. In this paper, we propose a unified optimization objective that derives inherently interpretable and efficient attention mechanisms through algorithm unrolling. Precisely, we construct a gradient step of the proposed objective with a set of forward-pass operations of our \emph{Contract-and-Broadcast Self-Attention} (CBSA), which compresses input tokens towards low-dimensional structures by contracting a few representatives of them. This novel mechanism can not only scale linearly by fixing the number of representatives, but also covers the instantiations of varied attention mechanisms when using different sets of representatives. We conduct extensive experiments to demonstrate comparable performance and superior advantages over black-box attention mechanisms on visual tasks. Our work sheds light on the integration of interpretability and efficiency, as well as the unified formula of attention mechanisms. Code is available at \href{https://github.com/QishuaiWen/CBSA}{this https URL}.

View full details

Poster

ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code

Tianyu Hua ⋅ Harper Hua ⋅ Violet Xiang ⋅ Benjamin Klieger ⋅ Sang Truong ⋅ Weixin Liang ⋅ Fan-Yun Sun ⋅ Nick Haber

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Large language models (LLMs) have shown promise in transforming machine learning research, yet their capability to faithfully implement genuinely novel ideas from recent research papers—ideas unseen during pretraining—remains unclear. We introduce ResearchCodeBench, a benchmark that evaluates LLMs’ ability to translate cutting-edge ML contributions from top 2024-2025 research papers into executable code. We assessed 30+ proprietary and open-source LLMs, finding that even the best models correctly implement less than 40% of the code. We present empirical findings on performance comparison, contamination, and error patterns. By providing a rigorous evaluation platform, ResearchCodeBench enables continuous understanding and advancement of LLM-driven innovation in research code generation.

View full details

Poster

VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank

Tianhe Wu ⋅ Jian Zou ⋅ Jie Liang ⋅ Lei Zhang ⋅ Kede Ma

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

DeepSeek-R1 has demonstrated remarkable effectiveness in incentivizing reasoning and generalization capabilities of large language models (LLMs) through reinforcement learning. Nevertheless, the potential of reasoning-induced computation has not been thoroughly explored in the context of image quality assessment (IQA), a task depending critically on visual reasoning. In this paper, we introduce VisualQuality-R1, a reasoning-induced no-reference IQA (NR-IQA) model, and we train it with reinforcement learning to rank, a learning algorithm tailored to the intrinsically relative nature of visual quality. Specifically, for a pair of images, we employ group relative policy optimization to generate multiple quality scores for each image. These estimates are used to compute comparative probabilities of one image having higher quality than the other under the Thurstone model. Rewards for each quality estimate are defined using continuous fidelity measures rather than discretized binary labels. Extensive experiments show that the proposed VisualQuality-R1 consistently outperforms discriminative deep learning-based NR-IQA models as well as a recent reasoning-induced quality regression method. Moreover, VisualQuality-R1 is capable of generating contextually rich, human-aligned quality descriptions, and supports multi-dataset training without requiring perceptual scale realignment. These features make VisualQuality-R1 especially well-suited for reliably measuring progress in a wide range of image processing tasks like super-resolution and image generation.

View full details

Poster

DMWM: Dual-Mind World Model with Long-Term Imagination

Lingyi Wang ⋅ Rashed Shelim ⋅ Walid Saad ⋅ Naren Ramakrishnan

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Imagination in world models is crucial for enabling agents to learn long-horizon policy in a sample-efficient manner. Existing recurrent state-space model (RSSM)-based world models depend on single-step statistical inference to capture the environment dynamics, and, hence, they are unable to perform long-term imagination tasks due to the accumulation of prediction errors. Inspired by the dual-process theory of human cognition, we propose a novel dual-mind world model (DMWM) framework that integrates logical reasoning to enable imagination with logical consistency. DMWM is composed of two components: an RSSM-based System 1 (RSSM-S1) component that handles state transitions in an intuitive manner and a logic-integrated neural network-based System 2 (LINN-S2) component that guides the imagination process through hierarchical deep logical reasoning. The inter-system feedback mechanism is designed to ensure that the imagination process follows the logical rules of the real environment. The proposed framework is evaluated on benchmark tasks that require long-term planning from the DMControl suite and robotic environment. Extensive experimental results demonstrate that the proposed framework yields significant improvements in terms of logical coherence, trial efficiency, data efficiency and long-term imagination over the state-of-the-art world models.

View full details

Poster

MoBA: Mixture of Block Attention for Long-Context LLMs

Enzhe Lu ⋅ Zhejun Jiang ⋅ Jingyuan Liu ⋅ Yulun Du ⋅ Tao Jiang ⋅ Chao Hong ⋅ Shaowei Liu ⋅ Weiran He ⋅ Enming Yuan ⋅ Yuzhi Wang ⋅ Zhiqi Huang ⋅ Huan Yuan ⋅ Suting Xu ⋅ Xinran Xu ⋅ Guokun Lai ⋅ Yanru Chen ⋅ Huabin Zheng ⋅ Junjie Yan ⋅ Jianlin Su ⋅ Yuxin Wu ⋅ Yutao Zhang ⋅ Zhilin Yang ⋅ Xinyu Zhou ⋅ Mingxing Zhang ⋅ Jiezhong Qiu

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Scaling the effective context length is essential for advancing large language models (LLMs) toward artificial general intelligence (AGI). However, the quadratic increase in computational complexity inherent in traditional attention mechanisms presents a prohibitive overhead. Existing approaches either impose strongly biased structures, such as sink or window attention which are task-specific, or radically modify the attention mechanism into linear approximations, whose performance in complex reasoning tasks remains inadequately explored. In this work, we propose a solution that adheres to the ``less structure'' principle, allowing the model to determine where to attend autonomously, rather than introducing predefined biases. We introduce Mixture of Block Attention (MoBA), an innovative approach that applies the principles of Mixture of Experts (MoE) to the attention mechanism. This novel architecture demonstrates superior performance on long-context tasks while offering a key advantage: the ability to seamlessly transition between full and sparse attention, enhancing efficiency without the risk of compromising performance. MoBA has already been deployed to handle actual production workloads with long-context requirements, demonstrating significant advancements in efficient attention computation for LLMs. Our code is available at https://github.com/MoonshotAI/MoBA.

View full details

Poster

Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation

Shuo Yang ⋅ Haocheng Xi ⋅ Yilong Zhao ⋅ Muyang Li ⋅ Jintao Zhang ⋅ Han Cai ⋅ Yujun Lin ⋅ Xiuyu Li ⋅ Chenfeng Xu ⋅ Kelly Peng ⋅ Jianfei Chen ⋅ Song Han ⋅ Kurt Keutzer ⋅ Ion Stoica

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Diffusion Transformers (DiTs) are essential for video generation but suffer from significant latency due to the quadratic complexity of attention. By computing only critical tokens, sparse attention reduces computational costs and offers a promising acceleration approach. However, we identify that existing methods fail to approach optimal generation quality under the same computation budget for two reasons: (1) Inaccurate critical token identification: current methods cluster tokens based on position rather than semantics, leading to imprecise aggregated representations. (2) Excessive computation waste: critical tokens are scattered among non-critical ones, leading to wasted computation on GPUs, which are optimized for processing contiguous tokens. In this paper, we propose SVG2, a training-free framework that maximizes identification accuracy and minimizes computation waste, achieving a Pareto frontier trade-off between generation quality and efficiency. The core of SVG2 is semantic-aware permutation, which clusters and reorders tokens based on semantic similarity using k-means. This approach ensures both a precise cluster representation, improving identification accuracy, and a densified layout of critical tokens, enabling efficient computation without padding. Additionally, SVG2 integrates Top-p dynamic budget control and customized kernel implementations, achieving up to $2.30\times$ and $1.89\times$ speedup while maintaining a PSNR of up to $30$ and $26$ on HunyuanVideo and Wan 2.1, respectively. Our code is open-sourced at https://github.com/svg-project/Sparse-VideoGen.

View full details

Poster

HyPINO: Multi-Physics Neural Operators via HyperPINNs and the Method of Manufactured Solutions

Rafael Bischof ⋅ Michal Piovarci ⋅ Michael Kraus ⋅ Siddhartha Mishra ⋅ Bernd Bickel

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We present HyPINO, a multi-physics neural operator designed for zero-shot generalization across a broad class of PDEs without requiring task-specific fine-tuning. Our approach combines a Swin Transformer-based hypernetwork with mixed supervision: (i) labeled data from analytical solutions generated via the Method of Manufactured Solutions (MMS), and (ii) unlabeled samples optimized using physics-informed objectives. The model maps PDE parameterizations to target Physics-Informed Neural Networks (PINNs) and can handle linear elliptic, hyperbolic, and parabolic equations in two dimensions with varying source terms, geometries, and mixed Dirichlet/Neumann boundary conditions, including interior boundaries. HyPINO achieves strong zero-shot accuracy on seven benchmark problems from PINN literature, outperforming U-Nets, Poseidon, and Physics-Informed Neural Operators (PINO). Further, we introduce an iterative refinement procedure that treats the residual of the generated PINN as "delta PDE" and performs another forward pass to generate a corrective PINN. Summing their contributions and repeating this process forms an ensemble whose combined solution progressively reduces the error on six benchmarks and achieves a >100× lower $L_2$ loss in the best case, while retaining forward-only inference. Additionally, we evaluate the fine-tuning behavior of PINNs initialized by HyPINO and show that they converge faster and to lower final error than both randomly initialized and Reptile-meta-learned PINNs on five benchmarks, performing on par on the remaining two. Our results highlight the potential of this scalable approach as a foundation for extending neural operators toward solving increasingly complex, nonlinear, and high-dimensional PDE problems. The code and model weights are publicly available at https://github.com/rbischof/hypino.

View full details

Poster

Hyperbolic Fine-Tuning for Large Language Models

Menglin Yang ⋅ Ram B ⋅ Aosong Feng ⋅ Bo Xiong ⋅ Jiahong Liu ⋅ Irwin King ⋅ Rex Ying

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Large language models (LLMs) have demonstrated remarkable performance across various tasks. However, it remains an open question whether the default Euclidean space is the most suitable choice for LLMs. In this study, we investigate the geometric characteristics of LLMs, focusing specifically on tokens and their embeddings. Our findings reveal that token frequency follows a power-law distribution, where high-frequency tokens (e.g., the, that ) constitute the minority, while low-frequency tokens (e.g., apple, dog) constitute the majority. Furthermore, high-frequency tokens cluster near the origin, whereas low-frequency tokens are positioned farther away in the embedding space. Additionally, token embeddings exhibit hyperbolic characteristics, indicating a latent tree-like structure within the embedding space. Motivated by these observations, we propose **HypLoRA**, an efficient fine-tuning approach that operates in hyperbolic space to exploit these underlying hierarchical structures better. HypLoRA performs low-rank adaptation directly in hyperbolic space, thereby preserving hyperbolic modeling capabilities throughout the fine-tuning process. Extensive experiments across various base models and reasoning benchmarks, specifically arithmetic and commonsense reasoning tasks, demonstrate that HypLoRA substantially improves LLM performance.

View full details

Poster

Seeing Sound, Hearing Sight: Uncovering Modality Bias and Conflict of AI models in Sound Localization

Yanhao Jia ⋅ Ji Xie ⋅ S Jivaganesh ⋅ Li Hao ⋅ Xu Wu ⋅ Mengmi Zhang

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Imagine hearing a dog bark and instinctively turning toward the sound—only to find a parked car, while a silent dog sits nearby. Such moments of sensory conflict challenge perception, yet humans flexibly resolve these discrepancies, prioritizing auditory cues over misleading visuals to accurately localize sounds. Despite the rapid advancement of multimodal AI models that integrate vision and sound, little is known about how these systems handle cross-modal conflicts or whether they favor one modality over another. Here, we systematically and quantitatively examine modality bias and conflict resolution in AI models for Sound Source Localization (SSL). We evaluate a wide range of state-of-the-art multimodal models and compare them against human performance in psychophysics experiments spanning six audiovisual conditions, including congruent, conflicting, and absent visual and audio cues. Our results reveal that humans consistently outperform AI in SSL and exhibit greater robustness to conflicting or absent visual information by effectively prioritizing auditory signals. In contrast, AI shows a pronounced bias toward vision, often failing to suppress irrelevant or conflicting visual input, leading to chance-level performance. To bridge this gap, we present EchoPin, a neuroscience-inspired multimodal model for SSL that emulates human auditory perception. The model is trained on our carefully curated AudioCOCO dataset, in which stereo audio signals are first rendered using a physics-based 3D simulator, then filtered with Head-Related Transfer Functions (HRTFs) to capture pinnae, head, and torso effects, and finally transformed into cochleagram representations that mimic cochlear processing. To eliminate existing biases in standard benchmark datasets, we carefully controlled the vocal object sizes, semantics, and spatial locations in the corresponding images of AudioCOCO. EchoPin outperforms existing models trained on standard audio-visual datasets. Remarkably, consistent with neuroscience findings, it exhibits a human-like localization bias, favoring horizontal (left–right) precision over vertical (up–down) precision. This asymmetry likely arises from HRTF-shaped and cochlear-modulated stereo audio and the lateral placement of human ears, highlighting how sensory input quality and physical structure jointly shape precision of multimodal representations. All code, data, and models are available \href{https://github.com/CuriseJia/SSHS}{here}.

View full details

Poster

Advanced Sign Language Video Generation with Compressed and Quantized Multi-Condition Tokenization

Cong Wang ⋅ Zexuan Deng ⋅ Zhiwei Jiang ⋅ Yafeng Yin ⋅ Fei Shen ⋅ Zifeng Cheng ⋅ Shiping Ge ⋅ Shiwei Gan ⋅ Qing Gu

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Sign Language Video Generation (SLVG) seeks to generate identity-preserving sign language videos from spoken language texts. Existing methods primarily rely on the single coarse condition (e.g., skeleton sequences) as the intermediary to bridge the translation model and the video generation model, which limits both the naturalness and expressiveness of the generated videos. To overcome these limitations, we propose SignViP, a novel SLVG framework that incorporate multiple fine-grained conditions for improved generation fidelity. Rather than directly translating error-prone high-dimensional conditions, SignViP adopts a discrete tokenization paradigm to integrate and represent fine-grained conditions (i.e., fine-grained poses and 3D hands). SignViP contains three core components. (1) Sign Video Diffusion Model is jointly trained with a multi-condition encoder to learn continuous embeddings that encapsulate fine-grained motion and appearance. (2) Finite Scalar Quantization (FSQ) Autoencoder is further trained to compress and quantize these embeddings into discrete tokens for compact representation of the conditions. (3) Multi-Condition Token Translator is trained to translate spoken language text to discrete multi-condition tokens. During inference, Multi-Condition Token Translator first translates the spoken language text into discrete multi-condition tokens. These tokens are then decoded to continuous embeddings by FSQ Autoencoder, which are subsequently injected into Sign Video Diffusion Model to guide video generation. Experimental results show that SignViP achieves state-of-the-art performance across metrics, including video quality, temporal coherence, and semantic fidelity. The code is available at https://github.com/umnooob/signvip/.

View full details

Poster

Right Question is Already Half the Answer: Fully Unsupervised LLM Reasoning Incentivization

Qingyang Zhang ⋅ Haitao Wu ⋅ Changqing Zhang ⋅ Peilin Zhao ⋅ Yatao Bian

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Existing methods to enhance the reasoning capability of large language models predominantly rely on supervised fine-tuning (SFT) followed by reinforcement learning (RL) on reasoning-specific data. These approaches critically depend on external supervisions--such as labeled reasoning traces, verified golden answers, or pre-trained reward models. In this work, we propose Entropy Minimized Policy Optimization (EMPO), which makes an early attempt at fully unsupervised LLM reasoning incentivization. By minimizing the semantic entropy of LLMs on unlabeled questions, EMPO achieves competitive performance compared to supervised counterparts. Specifically, without any supervised signals, EMPO boosts the accuracy of Qwen2.5-Math-7B Base from 33.7\% to 51.6\% on math benchmarks and improves the accuracy of Qwen2.5-7B Base from 32.1\% to 50.1\% on MMLU-Pro. Primary analysis are also provided to interpret the effectiveness of EMPO. Code is available at https://github.com/QingyangZhang/EMPO.

View full details

Poster

Memo: Training Memory-Efficient Embodied Agents with Reinforcement Learning

Gunshi Gupta ⋅ Karmesh Yadav ⋅ Zsolt Kira ⋅ Yarin Gal ⋅ Rahaf Aljundi

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

To enable embodied agents to operate effectively over extended timeframes, it is crucial to develop models that form and access memories to stay contextualized in their environment. In the current paradigm of training transformer-based policies for embodied sequential decision-making tasks, visual inputs often overwhelm the context limits of transformers, while humans can maintain and utilize a lifetime of experience compressed as memories. Significant compression is possible in principle, as much of the input is irrelevant and can be abstracted. However, existing approaches predominantly focus on either recurrent models with fixed-size memory or transformers with full-context reliance. In this work, we propose Memo, a transformer-based architecture and training recipe for reinforcement learning (RL) on memory-intensive, long-horizon tasks. Memo incorporates the creation and retrieval of memory by interleaving periodic summarization tokens with the inputs of a model during training. We demonstrate Memo’s effectiveness on a grid-world meta-RL benchmark and a multi-object navigation task in photo-realistic indoor settings. Memo outperforms naive long-context transformer baselines while being more compute and storage efficient. Additionally, Memo generalizes better to longer contexts at inference time and remains robust in streaming settings, where historical context must be truncated to fit inference constraints.

View full details

Poster

CAR-Flow: Condition-Aware Reparameterization Aligns Source and Target for Better Flow Matching

Chen Chen ⋅ Pengsheng Guo ⋅ Liangchen Song ⋅ Jiasen Lu ⋅ Rui Qian ⋅ Tsu-Jui Fu ⋅ Xinze Wang ⋅ Wei Liu ⋅ Yinfei Yang ⋅ Alex Schwing

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Conditional generative modeling aims to learn a conditional data distribution from samples containing data-condition pairs. For this, diffusion and flow-based methods have attained compelling results. These methods use a learned (flow) model to transport an initial standard Gaussian noise that ignores the condition to the conditional data distribution. The model is hence required to learn both mass transport \emph{and} conditional injection. To ease the demand on the model, we propose \emph{Condition-Aware Reparameterization for Flow Matching} (CAR-Flow) -- a lightweight, learned \emph{shift} that conditions the source, the target, or both distributions. By relocating these distributions, CAR-Flow shortens the probability path the model must learn, leading to faster training in practice. On low-dimensional synthetic data, we visualize and quantify the effects of CAR-Flow. On higher-dimensional natural image data (ImageNet-256), equipping SiT-XL/2 with CAR-Flow reduces FID from 2.07 to 1.68, while introducing less than \(0.6\%\) additional parameters.

View full details

Poster

StreamForest: Efficient Online Video Understanding with Persistent Event Memory

Xiangyu Zeng ⋅ Kefan Qiu ⋅ Qingyu Zhang ⋅ Xinhao Li ⋅ Jing Wang ⋅ Jiaxin Li ⋅ Ziang Yan ⋅ Kun Tian ⋅ Meng Tian ⋅ Xinhai Zhao ⋅ Yi Wang ⋅ Limin Wang

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Multimodal Large Language Models (MLLMs) have recently achieved remarkable progress in video understanding. However, their effectiveness in real-time streaming scenarios remains limited due to storage constraints of historical visual features and insufficient real-time spatiotemporal reasoning. To address these challenges, we propose StreamForest, a novel architecture specifically designed for streaming video understanding. Central to StreamForest is the Persistent Event Memory Forest, a memory mechanism that adaptively organizes video frames into multiple event-level tree structures. This process is guided by penalty functions based on temporal distance, content similarity, and merge frequency, enabling efficient long-term memory retention under limited computational resources. To enhance real-time perception, we introduce a Fine-grained Spatiotemporal Window, which captures detailed short-term visual cues to improve current scene perception. Additionally, we present OnlineIT, an instruction-tuning dataset tailored for streaming video tasks. OnlineIT significantly boosts MLLM performance in both real-time perception and future prediction. To evaluate generalization in practical applications, we introduce ODV-Bench, a new benchmark focused on real-time streaming video understanding in autonomous driving scenarios. Experimental results demonstrate that StreamForest achieves the state-of-the-art performance, with accuracies of 77.3% on StreamingBench, 60.5% on OVBench, and 55.6% on OVO-Bench. In particular, even under extreme visual token compression (limited to 1024 tokens), the model retains 96.8% of its average accuracy in eight benchmarks relative to the default setting. These results underscore the robustness, efficiency, and generalizability of StreamForest for streaming video understanding.

View full details

Poster

Towards Understanding the Mechanisms of Classifier-Free Guidance

Xiang Li ⋅ Rongrong Wang ⋅ Qing Qu

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Classifier-free guidance (CFG) is a core technique powering state-of-the-art image generation systems, yet its underlying mechanisms remain poorly understood. In this work, we first analyze CFG in a simplified linear diffusion model, where we show its behavior closely resembles that observed in the nonlinear case. Our analysis reveals that linear CFG improves generation quality via three distinct components: (i) a mean-shift term that approximately steers samples in the direction of class means, (ii) a positive Contrastive Principal Components (CPC) term that amplifies class-specific features, and (iii) a negative CPC term that suppresses generic features prevalent in unconditional data. We then verify these insights in real-world, nonlinear diffusion models: over a broad range of noise levels, linear CFG resembles the behavior of its nonlinear counterpart. Although the two eventually diverge at low noise levels, we discuss how the insights from the linear analysis still shed light on the CFG's mechanism within the nonlinear regime.

View full details

Poster

HBLLM: Wavelet-Enhanced High-Fidelity 1-Bit Quantization for LLMs

Ningning Chen ⋅ Weicai Ye ⋅ Ying Jiang

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We introduce HBLLM, a wavelet-enhanced high-fidelity $1$-bit post-training quantization method for Large Language Models (LLMs). By leveraging Haar wavelet transforms to enhance expressive capacity through frequency decomposition, HBLLM significantly improves quantization fidelity while maintaining minimal overhead. This approach features two innovative structure-aware grouping strategies: (1) frequency-aware multi-parameter intra-row grouping and (2) $\ell_2$-norm-based saliency-driven column selection. For non-salient weights, a shared mean is employed across quantization groups within each frequency band to optimize storage efficiency. Experiments conducted on the OPT and LLaMA models demonstrate that HBLLM achieves state-of-the-art performance in $1$-bit quantization, attaining a perplexity of $6.71$ on LLaMA$2$-$13$B with an average weight storage of only $1.08$ bits. Code available at: https://github.com/Yeyke/HBLLM.

View full details

Poster

Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning

Chaofan Lin ⋅ Jiaming Tang ⋅ Shuo Yang ⋅ Hanshuo Wang ⋅ Tian Tang ⋅ Boyu Tian ⋅ Ion Stoica ⋅ Song Han ⋅ Mingyu Gao

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Leveraging attention sparsity to accelerate long-context large language models (LLMs) has been of great importance recently. However, most existing sparse attention algorithms use a fixed budget of how many tokens to use in their computations. This simple static decision raises critical issues in real-world deployment because it fails to account for the dynamic nature of real-world scenarios, where the optimal balance between accuracy and efficiency can vary greatly. In this paper, we reveal a key insight that leveraging the idea of top-$p$ sampling (a.k.a., nucleus sampling) in sparse attention could enable efficient and adaptive budget decisions. Based on this, we propose Twilight, a framework that enhances any existing sparse attention algorithm with adaptive budget decision capabilities without sacrificing accuracy. Empirical results show that Twilight can adaptively prune up to 98% tokens with nearly no accuracy loss in both mid- and long-context scenarios, leading to a $1.4\times$ speedup over state-of-the-art sparse attention mechanisms.

View full details

Poster

Structured Sparse Transition Matrices to Enable State Tracking in State-Space Models

Aleksandar Terzic ⋅ Nicolas Menet ⋅ Michael Hersche ⋅ Thomas Hofmann ⋅ Abbas Rahimi

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Modern state-space models (SSMs) often utilize structured transition matrices which enable efficient computation but pose restrictions on the model’s expressivity, as measured in terms of the ability to emulate finite-state automata (FSA). While unstructured transition matrices are optimal in terms of expressivity, they come at a prohibitively high compute and memory cost, even for moderate state sizes. We propose a structured sparse parametrization of transition matrices in SSMs that enables FSA state tracking with provably optimal state size and depth, while keeping the computational cost of the recurrence comparable to that of diagonal SSMs. Our method, \emph{PD-SSM}, parametrizes the transition matrix as the product of a column one-hot matrix ($P$) and a complex-valued diagonal matrix ($D$). As a result, the computational cost of parallel scans scales linearly with the state size. Theoretically, the model is BIBO-stable and can emulate any $N$-state FSA with one layer of dimension $N$ and a linear readout of size $N ×N$, significantly improving on all current structured SSM guarantees. Experimentally, the model significantly outperforms a wide collection of modern SSM variants on various FSA state tracking tasks. On multivariate time-series classification, it outperforms neural controlled differential equations, a paradigm explicitly built for time-series analysis. Finally, we integrate PD-SSM into a hybrid Transformer-SSM architecture and demonstrate that the model can effectively track the states of a complex FSA in which transitions are encoded into sets of variable-length English sentences. The code is available at https://github.com/IBM/expressive-sparse-state-space-model.

View full details

Poster

Unlocking hidden biomolecular conformational landscapes in diffusion models at inference time

Daniel D. Richman ⋅ Jessica Karaguesian ⋅ Carl-Mikael Suomivuori ⋅ Ron Dror

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

The function of biomolecules such as proteins depends on their ability to interconvert between a wide range of structures or conformations. Researchers have endeavored for decades to develop computational methods to predict the distribution of conformations, which is far harder to determine experimentally than a static folded structure. We present ConforMix, an inference-time algorithm that enhances sampling of conformational distributions using a combination of classifier guidance, filtering, and free energy estimation. Our approach upgrades diffusion models---whether trained for static structure prediction or conformational generation---to enable more efficient discovery of conformational variability without requiring prior knowledge of major degrees of freedom. ConforMix is orthogonal to improvements in model pretraining and would benefit even a hypothetical model that perfectly reproduced the Boltzmann distribution. Remarkably, when applied to a diffusion model trained for static structure prediction, ConforMix captures structural changes including domain motion, cryptic pocket flexibility, and transporter cycling, while avoiding unphysical states. Case studies of biologically critical proteins demonstrate the scalability, accuracy, and utility of this method.

View full details

Poster

Gymnasium: A Standard Interface for Reinforcement Learning Environments

Mark Towers ⋅ Ariel Kwiatkowski ⋅ John Balis ⋅ Gianluca De Cola ⋅ Tristan Deleu ⋅ Manuel Goulão ⋅ Kallinteris Andreas ⋅ Markus Krimmel ⋅ Arjun KG ⋅ Rodrigo Perez-Vicente ⋅ J Terry ⋅ Andrea Pierré ⋅ Sander Schulhoff ⋅ Jun Jet Tai ⋅ Hannah Tan ⋅ Omar G. Younis

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Reinforcement Learning (RL) is a continuously growing field that has the potential to revolutionize many areas of artificial intelligence. However, despite its promise, RL research is often hindered by the lack of standardization in environment and algorithm implementations. This makes it difficult for researchers to compare and build upon each other's work, slowing down progress in the field.Gymnasium is an open-source library that provides a standard API for RL environments, aiming to tackle this issue. Gymnasium's main feature is a set of abstractions that allow for wide interoperability between environments and training algorithms, making it easier for researchers to develop and test RL algorithms. In addition, Gymnasium provides a collection of easy-to-use environments, tools for easily customizing environments, and tools to ensure the reproducibility and robustness of RL research.Through this unified framework, Gymnasium significantly streamlines the process of developing and testing RL algorithms, enabling researchers to focus more on innovation and less on implementation details. By providing a standardized platform for RL research, Gymnasium helps to drive forward the field of reinforcement learning and unlock its full potential. Gymnasium is available online at \url{https://github.com/Farama-Foundation/Gymnasium}.

View full details

Poster

DisMo: Disentangled Motion Representations for Open-World Motion Transfer

Thomas Ressler-Antal ⋅ Frank Fundel ⋅ Malek Ben Alaya ⋅ Stefan Andreas Baumann ⋅ Felix Krause ⋅ Ming Gui ⋅ Björn Ommer

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Recent advances in text-to-video (T2V) and image-to-video (I2V) models, have enabled the creation of visually compelling and dynamic videos from simple textual descriptions or initial frames. However, these models often fail to provide an explicit representation of motion separate from content, limiting their applicability for content creators. To address this gap, we propose DisMo, a novel paradigm for learning abstract motion representations directly from raw video data via an image-space reconstruction objective. Our representation is generic and independent of static information such as appearance, object identity, or pose. This enables open-world motion transfer, allowing motion to be transferred across semantically unrelated entities without requiring object correspondences, even between vastly different categories. Unlike prior methods, which trade off motion fidelity and prompt adherence, are overfitting to source structure or drifting from the described action, our approach disentangles motion semantics from appearance, enabling accurate transfer and faithful conditioning. Furthermore, our motion representation can be combined with any existing video generator via lightweight adapters, allowing us to effortlessly benefit from future advancements in video models. We demonstrate the effectiveness of our method through a diverse set of motion transfer tasks. Finally, we show that the learned representations are well-suited for downstream motion understanding tasks, consistently outperforming state-of-the-art video representation models such as V-JEPA in zero-shot action classification on benchmarks including Something-Something v2 and Jester. Project page: https://compvis.github.io/DisMo

View full details

Poster

PLMTrajRec: A Scalable and Generalizable Trajectory Recovery Method with Pre-trained Language Models

Tonglong Wei ⋅ Yan Lin ⋅ Youfang Lin ⋅ Shengnan Guo ⋅ Jilin Hu ⋅ Haitao Yuan ⋅ Gao Cong ⋅ Huaiyu Wan

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Spatiotemporal trajectory data is crucial for various traffic-related applications. However, issues such as device malfunctions and network instability often result in sparse trajectories that lose detailed movement information compared to their dense counterparts. Recovering missing points in sparse trajectories is thus essential. Despite recent progress, three challenges remain. First, the lack of large-scale dense trajectory datasets hinders the training of a trajectory recovery model. Second, the varying spatiotemporal correlations in sparse trajectories make it hard to generalize across different sampling intervals. Third, extracting road conditions for missing points is non-trivial. To address these challenges, we propose $\textit{PLMTrajRec}$, a novel trajectory recovery model. It leverages the scalability of a pre-trained language model (PLM) and can effectively recover trajectories by fine-tuning with small-scale dense trajectory datasets. To handle different sampling intervals in sparse trajectories, we first convert sampling intervals and movement features into prompts for the PLM to understand. We then introduce a trajectory encoder to unify trajectories of varying intervals into a single interval. To extract road conditions for missing points, we propose an area flow-guided implicit trajectory prompt that represents traffic conditions in each region, and a road condition passing mechanism that infers the road conditions of missing points using the observed ones. Experiments on four public trajectory datasets with three sampling intervals demonstrate the effectiveness, scalability, and generalization ability of PLMTrajRec. Code is available at https://github.com/wtl52656/PLMTrajRec.

View full details

Poster

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Andrew Zhao ⋅ Yiran Wu ⋅ Yang Yue ⋅ Tong Wu ⋅ Quentin Xu ⋅ Yang Yue ⋅ Matthieu Lin ⋅ Shenzhi Wang ⋅ Qingyun Wu ⋅ Zilong Zheng ⋅ Gao Huang

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from rule-based outcome rewards. Recent RLVR works that operate under the zero setting avoid supervision in labeling the reasoning process, but still depend on manually curated collections of questions and answers for training. The scarcity of high-quality, human-produced examples raises concerns about the long-term scalability of relying on human supervision, a challenge already evident in the domain of language model pretraining. Furthermore, in a hypothetical future where AI surpasses human intelligence, tasks provided by humans may offer limited learning potential for a superintelligent system. To address these concerns, we propose a new RLVR paradigm called Absolute Zero, in which a single model learns to propose tasks that maximize its own learning progress and improves reasoning by solving them, without relying on any external human or distillation data. Under this paradigm, we introduce the Absolute Zero Reasoner (AZR), a system that self-evolves its training curriculum and reasoning ability. AZR uses a code executor to both validate self-proposed code reasoning tasks and verify answers, serving as an unified source of verifiable feedback to guide open-ended yet grounded learning. Despite being trained entirely without external data, AZR achieves overall SOTA performance on coding and mathematical reasoning tasks, outperforming existing zero-setting models that rely on tens of thousands of in-domain human-curated examples. Furthermore, we demonstrate that AZR can be effectively applied across different model scales and is compatible with various model classes.

View full details

Poster

MARS-VFL: A Unified Benchmark for Vertical Federated Learning with Realistic Evaluation

Wei Shen ⋅ Weiqi Liu ⋅ Mingde Chen ⋅ Wenke Huang ⋅ Mang Ye

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Vertical Federated Learning (VFL) has emerged as a critical privacy-preserving learning paradigm, enabling collaborative model training by leveraging distributed features across clients. However, due to privacy concerns, there are few publicly available real-world datasets for evaluating VFL methods, which poses significant challenges to related research. To bridge this gap, we propose MARS-VFL, a unified benchmark for realistic VFL evaluation. It integrates data from practical applications involving collaboration across different features, maintaining compatibility with the VFL setting. Based on this, we standardize the evaluation of VFL methods from the mainstream aspects of efficiency, robustness, and security. We conduct comprehensive experiments to assess different VFL approaches, providing references for unified evaluation. Furthermore, we are the first to unify the evaluation of robustness challenges in VFL and introduce a new method for addressing robustness challenges, establishing standard baselines for future research.

View full details

Poster

Enhancing Training Data Attribution with Representational Optimization

Weiwei Sun ⋅ Haokun Liu ⋅ Nikhil Kandpal ⋅ Colin Raffel ⋅ Yiming Yang

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Training data attribution (TDA) methods aim to measure how training data impacts a model's predictions. While gradient-based attribution methods, such as influence functions, offer theoretical grounding, their computational costs make them impractical for large-scale applications. Representation-based approaches are far more scalable, but typically rely on heuristic embeddings that are not optimized for attribution, limiting their fidelity. To address these challenges, we propose AirRep, a scalable, representation-based approach that closes this gap by learning task-specific and model-aligned representations optimized explicitly for TDA. AirRep introduces two key innovations: a trainable encoder tuned for attribution quality, and an attention-based pooling mechanism that enables accurate estimation of group-wise influence. We train AirRep using a ranking objective over automatically constructed training subsets labeled by their empirical effect on target predictions. Experiments on instruction-tuned LLMs demonstrate that AirRep achieves performance on par with state-of-the-art gradient-based approaches while being nearly two orders of magnitude more efficient at inference time. Further analysis highlights its robustness and generalization across tasks and models. Our code is available at https://github.com/sunnweiwei/AirRep.

View full details

Poster

Shallow Diffuse: Robust and Invisible Watermarking through Low-Dim Subspaces in Diffusion Models

Wenda Li ⋅ Huijie Zhang ⋅ Qing Qu

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

The widespread use of AI-generated content from diffusion models has raised significant concerns regarding misinformation and copyright infringement. Watermarking is a crucial technique for identifying these AI-generated images and preventing their misuse. In this paper, we introduce *Shallow Diffuse*, a new watermarking technique that embeds robust and invisible watermarks into diffusion model outputs. Unlike existing approaches that integrate watermarking throughout the entire diffusion sampling process, *Shallow Diffuse* decouples these steps by leveraging the presence of a low-dimensional subspace in the image generation process. This method ensures that a substantial portion of the watermark lies in the null space of this subspace, effectively separating it from the image generation process. Our theoretical and empirical analyses show that this decoupling strategy greatly enhances the consistency of data generation and the detectability of the watermark. Extensive experiments further validate that *Shallow Diffuse* outperforms existing watermarking methods in terms of consistency.

View full details

Poster

DeCaFlow: A deconfounding causal generative model

Alejandro Almodóvar ⋅ Adrián Javaloy ⋅ Juan Parras ⋅ Santiago Zazo ⋅ Isabel Valera

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We introduce DeCaFlow, a deconfounding causal generative model. Training once per dataset using just observational data and the underlying causal graph, DeCaFlow enables accurate causal inference on continuous variables under the presence of hidden confounders. Specifically, we extend previous results on causal estimation under hidden confounding to show that a single instance of DeCaFlow provides correct estimates for all causal queries identifiable with do-calculus, leveraging proxy variables to adjust for the causal effects when do-calculus alone is insufficient. Moreover, we show that counterfactual queries are identifiable as long as their interventional counterparts are identifiable, and thus are also correctly estimated by DeCaFlow. Our empirical results on diverse settings—including the Ecoli70 dataset, with 3 independent hidden confounders, tens of observed variables and hundreds of causal queries—show that DeCaFlow outperforms existing approaches, while demonstrating its out-of-the-box applicability to any given causal graph.

View full details

Poster

Approximate Domain Unlearning for Vision-Language Models

Kodai Kawamura ⋅ Yuta Goto ⋅ Rintaro Yanagi ⋅ Hirokatsu Kataoka ⋅ Go Irie

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Pre-trained Vision-Language Models (VLMs) exhibit strong generalization capabilities, enabling them to recognize a wide range of objects across diverse domains without additional training. However, they often retain irrelevant information beyond the requirements of specific target downstream tasks, raising concerns about computational efficiency and potential information leakage. This has motivated growing interest in approximate unlearning, which aims to selectively remove unnecessary knowledge while preserving overall model performance. Existing approaches to approximate unlearning have primarily focused on {\em class unlearning}, where a VLM is retrained to fail to recognize specified object classes while maintaining accuracy for others. However, merely forgetting object classes is often insufficient in practical applications. For instance, an autonomous driving system should accurately recognize {\em real} cars, while avoiding misrecognition of {\em illustrated} cars depicted in roadside advertisements as {\em real} cars, which could be hazardous. In this paper, we introduce {\em Approximate Domain Unlearning (ADU)}, a novel problem setting that requires reducing recognition accuracy for images from specified domains (e.g., {\em illustration}) while preserving accuracy for other domains (e.g., {\em real}). ADU presents new technical challenges: due to the strong domain generalization capability of pre-trained VLMs, domain distributions are highly entangled in the feature space, making naive approaches based on penalizing target domains ineffective. To tackle this limitation, we propose a novel approach that explicitly disentangles domain distributions and adaptively captures instance-specific domain information. Extensive experiments on four multi-domain benchmark datasets demonstrate that our approach significantly outperforms strong baselines built upon state-of-the-art VLM tuning techniques, paving the way for practical and fine-grained unlearning in VLMs. Code : https://kodaikawamura.github.io/Domain_Unlearning/.

View full details

Poster

Compositional Neural Network Verification via Assume-Guarantee Reasoning

Hai Duong ⋅ David Shriver ⋅ ThanhVu Nguyen ⋅ Matthew Dwyer

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Verifying the behavior of neural networks is necessary if developers are to confidently deploy them as parts of mission-critical systems. Toward this end, researchers have been actively developing a range of increasingly sophisticated and scalable neural network verifiers. However, scaling verification to large networks is challenging, at least in part due to the significant memory requirements of verification algorithms. In this paper, we propose an assume-guarantee compositional framework, CoVeNN, that is parameterized by an underlying verifier to generate a sequence of verification sub-problems to address this challenge. We present an iterative refinement-based strategy for computing assumptions that allow sub-problems to retain sufficient accuracy. An evaluation using 7 neural networks and a total of 140 property specifications demonstrates that CoVeNN can verify nearly 7 times more problems than state-of-the-art verifiers. CoVeNN is part of the NeuralSAT verification project: https://github.com/dynaroars/neuralsat.

View full details

Poster

Shortcut Features as Top Eigenfunctions of NTK: A Linear Neural Network Case and More

Jinwoo Lim ⋅ Suhyun Kim ⋅ Soo-Mook Moon

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

One of the chronic problems of deep-learning models is shortcut learning. In a case where the majority of training data are dominated by a certain feature, neural networks prefer to learn such a feature even if the feature is not generalizable outside the training set. Based on the framework of Neural Tangent Kernel (NTK), we analyzed the case of linear neural networks to derive some important properties of shortcut learning. We defined a “feature” of a neural network as an eigenfunction of NTK. Then, we found that shortcut features correspond to features with larger eigenvalues when the shortcuts stem from the imbalanced number of samples in the clustered distribution. We also showed that the features with larger eigenvalues still have a large influence on the neural network output even after training, due to data variances in the clusters. Such a preference for certain features remains even when a margin of a neural network output is controlled, which shows that the max-margin bias is not the only major reason for shortcut learning. These properties of linear neural networks are empirically extended for more complex neural networks as a two-layer ReLU FC network and a ResNet-18.

View full details

Poster

FPSAttention: Training-Aware FP8 and Sparsity Co-Design for Fast Video Diffusion

Akide Liu ⋅ Zeyu Zhang ⋅ Zhexin Li ⋅ Xuehai Bai ⋅ Yuanjie Xing ⋅ Yizeng Han ⋅ Jiasheng Tang ⋅ Jichao Wu ⋅ Mingyang Yang ⋅ Weihua Chen ⋅ Jiahao He ⋅ Yuanyu He ⋅ Fan Wang ⋅ Reza Haffari ⋅ Bohan Zhuang

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Diffusion generative models have become the standard for producing high-quality, coherent video content, yet their slow inference speeds and high computational demands hinder practical deployment. Although both quantization and sparsity can independently accelerate inference while maintaining generation quality, naively combining these techniques in existing training-free approaches leads to significant performance degradation, as they fail to achieve proper joint optimization. We introduce FPSAttention, a novel training-aware co-design of FP8 quantization and Sparsity for video generation, with a focus on the 3D bi-directional attention mechanism. Our approach features three key innovations: 1) A unified 3D tile-wise granularity that simultaneously supports both quantization and sparsity. 2) A denoising step-aware strategy that adapts to the noise schedule, addressing the strong correlation between quantization/sparsity errors and denoising steps. 3) A native, hardware-friendly kernel that leverages FlashAttention and is implemented with optimized Hopper architecture features, enabling highly efficient execution. Trained on Wan2.1's 1.3B and 14B models and evaluated on the vBench benchmark, FPSAttention achieves a 7.09$\times$ kernel speedup for attention operations and a 4.96$\times$ end-to-end speedup for video generation compared to the BF16 baseline at 720p resolution—without sacrificing generation quality.

View full details

Poster

Long-Tailed Recognition via Information-Preservable Two-Stage Learning

Fudong Lin ⋅ Xu Yuan

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

The imbalance (or long-tail) is the nature of many real-world data distributions, which often induces the undesirable bias of deep classification models toward frequent classes, resulting in poor performance for tail classes. In this paper, we propose a novel two-stage learning approach to mitigate such a majority-biased tendency while preserving valuable information within datasets. Specifically, the first stage proposes a new representation learning technique from the information theory perspective. This approach is theoretically equivalent to minimizing intra-class distance, yielding an effective and well-separated feature space. The second stage develops a novel sampling strategy that selects mathematically informative instances, able to rectify majority-biased decision boundaries without compromising a model’s overall performance. As a result, our approach achieves the state-of-the-art performance across various long-tailed benchmark datasets, validated via extensive experiments. Our code is available at https://github.com/fudong03/BNS_IPDPP.

View full details

Poster

Optimal and Provable Calibration in High-Dimensional Binary Classification: Angular Calibration and Platt Scaling

Yufan Li ⋅ Pragya Sur

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We study the fundamental problem of calibrating a linear binary classifier of the form \(\sigma(\hat{w}^\top x)\), where the feature vector \(x\) is Gaussian, \(\sigma\) is a link function, and \(\hat{w}\) is an estimator of the true linear weight $w^\star$. By interpolating with a noninformative \emph{chance classifier}, we construct a well-calibrated predictor whose interpolation weight depends on the angle \(\angle(\hat{w}, w_\star)\) between the estimator \(\hat{w}\) and the true linear weight \(w_\star\). We establish that this angular calibration approach is provably well-calibrated in a high-dimensional regime where the number of samples and features both diverge, at a comparable rate. The angle \(\angle(\hat{w}, w_\star)\) can be consistently estimated. Furthermore, the resulting predictor is uniquely \emph{Bregman-optimal}, minimizing the Bregman divergence to the true label distribution within a suitable class of calibrated predictors. Our work is the first to provide a calibration strategy that satisfies both calibration and optimality properties provably in high dimensions. Additionally, we identify conditions under which a classical Platt-scaling predictor converges to our Bregman-optimal calibrated solution. Thus, Platt-scaling also inherits these desirable properties provably in high dimensions.

View full details

Poster

GSRF: Complex-Valued 3D Gaussian Splatting for Efficient Radio-Frequency Data Synthesis

Kang Yang ⋅ Gaofeng Dong ⋅ Sijie Ji ⋅ Wan Du ⋅ Mani Srivastava

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Synthesizing radio-frequency (RF) data given the transmitter and receiver positions, e.g., received signal strength indicator (RSSI), is critical for wireless networking and sensing applications, such as indoor localization. However, it remains challenging due to complex propagation interactions, including reflection, diffraction, and scattering. State-of-the-art neural radiance field (NeRF)-based methods achieve high-fidelity RF data synthesis but are limited by long training times and high inference latency. We introduce GSRF, a framework that extends 3D Gaussian Splatting (3DGS) from the optical domain to the RF domain, enabling efficient RF data synthesis. GSRF realizes this adaptation through three key innovations: First, it introduces complex-valued 3D Gaussians with a hybrid Fourier–Legendre basis to model directional and phase-dependent radiance. Second, it employs orthographic splatting for efficient ray–Gaussian intersection identification. Third, it incorporates a complex-valued ray tracing algorithm, executed on RF-customized CUDA kernels and grounded in wavefront propagation principles, to synthesize RF data in real time. Evaluated across various RF technologies, GSRF preserves high-fidelity RF data synthesis while achieving significant improvements in training efficiency, shorter training time, and reduced inference latency.

View full details

Poster

Causal Differentiating Concepts: Interpreting LM Behavior via Causal Representation Learning

Navita Goyal ⋅ Hal Daumé ⋅ Alexandre Drouin ⋅ Dhanya Sridhar

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Language model activations entangle concepts that mediate their behavior, making it difficult to interpret these factors, which has implications for generalizability and robustness. We introduce an approach for disentangling these concepts without supervision. Existing methods for concept discovery often rely on external labels, contrastive prompts, or known causal structures, which limits their scalability and biases them toward predefined, easily annotatable features. In contrast, we propose a new unsupervised algorithm that identifies causal differentiating concepts—interpretable latent directions in LM activations that must be changed to elicit a different model behavior. These concepts are discovered using a constrained contrastive learning objective, guided by the insight that eliciting a target behavior requires only sparse changes to the underlying concepts. We formalize this notion and show that, under a particular assumption about the sparsity of these causal differentiating concepts, our method learns disentangled representations that align with human-interpretable factors influencing LM decisions. We empirically show the ability of our method to recover ground-truth causal factors in synthetic and semi-synthetic settings. Additionally, we illustrate the utility of our method through a case study on refusal behavior in language models. Our approach offers a scalable and interpretable lens into the internal workings of LMs, providing a principled foundation for interpreting language model behavior.

View full details

Poster

Cloud4D: Estimating Cloud Properties at a High Spatial and Temporal Resolution

Jacob Lin ⋅ Edward Gryspeerdt ⋅ Ronald Clark

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

There has been great progress in improving numerical weather prediction and climate models using machine learning. However, most global models act at a kilometer-scale, making it challenging to model individual clouds and factors such as extreme precipitation, wind gusts, turbulence, and surface irradiance. Therefore, there is a need to move towards higher-resolution models, which in turn require high-resolution real-world observations that current instruments struggle to obtain. We present Cloud4D, the first learning-based framework that reconstructs a physically consistent, four–dimensional cloud state using only synchronized ground‐based cameras. Leveraging a homography-guided 2D‐to‐3D transformer, Cloud4D infers the full 3D distribution of liquid water content at 25 m spatial and 5 s temporal resolution. By tracking the 3D liquid water content retrievals over time, Cloud4D additionally estimates horizontal wind vectors. Across a two-month deployment comprising six skyward cameras, our system delivers an order-of-magnitude improvement in space-time resolution relative to state-of-the-art satellite measurements, while retaining single-digit relative error ($<10\\%$) against collocated radar measurements.

View full details

Poster

LeMiCa: Lexicographic Minimax Path Caching for Efficient Diffusion-Based Video Generation

Huanlin Gao ⋅ Ping Chen ⋅ Fuyuan Shi ⋅ Chao Tan ⋅ Zhaoxiang Liu ⋅ Fang Zhao ⋅ Kai Wang ⋅ Shiguo Lian

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We present LeMiCa, a training-free and efficient acceleration framework for diffusion-based video generation. While existing caching strategies primarily focus on reducing local heuristic errors, they often overlook the accumulation of global errors, leading to noticeable content degradation between accelerated and original videos. To address this issue, we formulate cache scheduling as a directed graph with error-weighted edges and introduce a Lexicographic Minimax Path Optimization strategy that explicitly bounds the worst-case path error. This approach substantially improves the consistency of global content and style across generated frames. Extensive experiments on multiple text-to-video benchmarks demonstrate that LeMiCa delivers dual improvements in both inference speed and generation quality. Notably, our method achieves a 2.9× speedup on the Latte model and reaches an LPIPS score of 0.05 on Open-Sora, outperforming prior caching techniques. Importantly, these gains come with minimal perceptual quality degradation, making LeMiCa a robust and generalizable paradigm for accelerating diffusion-based video generation. We believe this approach can serve as a strong foundation for future research on efficient and reliable video synthesis.

View full details

Poster

When Data Can't Meet: Estimating Correlation Across Privacy Barriers

Abhinav Chakraborty ⋅ Arnab Auddy ⋅ T. Tony Cai

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We consider the problem of estimating the correlation of two random variables $X$ and $Y$, where the pairs $(X,Y)$ are not observed together, but are instead separated co-ordinate-wise at two servers: server 1 contains all the $X$ observations, and server 2 contains the corresponding $Y$ observations. In this vertically distributed setting, we assume that each server has its own privacy constraints, owing to which they can only share suitably privatized statistics of their own component observations. We consider differing privacy budgets $(\varepsilon_1,\delta_1)$ and $(\varepsilon_2,\delta_2)$ for the two servers and determine the minimax optimal rates for correlation estimation allowing for both non-interactive and interactive mechanisms. We also provide correlation estimators that achieve these rates and further develop inference procedures, namely, confidence intervals, for the estimated correlations. Our results are characterized by an interesting rate in terms of the sample size $n$, $\varepsilon_1$, $\varepsilon_2$, which is strictly slower than the usual central privacy estimation rates. More interestingly, we find that the interactive mechanism is always better than its non-interactive counterpart whenever the two privacy budgets are different. Results from extensive numerical experiments support our theoretical findings.

View full details

Poster

Functional Scaling Laws in Kernel Regression: Loss Dynamics and Learning Rate Schedules

Binghui Li ⋅ Fengling Chen ⋅ Zixun Huang ⋅ Lean Wang ⋅ Lei Wu

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Scaling laws have emerged as a unifying lens for understanding and guiding the training of large language models (LLMs). However, existing studies predominantly focus on the final-step loss, leaving open whether the entire $\textit{loss dynamics}$ obey similar laws and, crucially, how the $\textit{learning rate schedule}$ (LRS) shapes them. We address these gaps in a controlled theoretical setting by analyzing stochastic gradient descent (SGD) on a power-law kernel regression model. The key insight is a novel $\textbf{intrinsic-time}$ viewpoint, which captures the training progress more faithfully than iteration count. We then establish a $\textbf{Functional Scaling Law (FSL)}$ that captures the full loss trajectory under arbitrary LRSs, with the schedule’s influence entering through a simple convolutional functional. We further instantiate the theory for three representative LRSs---constant, exponential decay, and warmup–stable–decay (WSD)---and derive explicit scaling relations in both data- and compute-limited regimes. These comparisons explain key empirical phenomena: (i) higher-capacity models are more data- and compute-efficient; (ii) learning-rate decay improves training efficiency; and (iii) WSD-type schedules outperform pure decay. Finally, experiments on LLMs ranging from 0.1B to 1B parameters demonstrate the practical relevance of FSL as a surrogate model for fitting and predicting loss trajectories in large-scale pre-training.

View full details

Poster

PiKE: Adaptive Data Mixing for Large-Scale Multi-Task Learning Under Low Gradient Conflicts

Zeman Li ⋅ Yuan Deng ⋅ Peilin Zhong ⋅ Meisam Razaviyayn ⋅ Vahab Mirrokni

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Modern foundation models are trained on diverse datasets to enhance generalization across tasks and domains. A central challenge in this process is determining how to effectively mix and sample data from multiple sources. This naturally leads to a multi-task learning (MTL) perspective. While prior work in MTL has emphasized mitigating gradient conflicts, we observe that large-scale pretraining scenarios—such as multilingual or multi-domain training—often exhibit little to no gradient conflict. Motivated by this observation, we propose $\textbf{PiKE}$ ($\textbf{P}$ositive gradient $\textbf{i}$nteraction-based $\textbf{K}$-task weights $\textbf{E}$stimator), an adaptive data mixing algorithm that dynamically adjusts sampling weights during training. PiKE exploits non-conflicting gradient interactions to minimize a near-tight upper bound on the average loss decrease at each step, while incurring negligible computational overhead. We provide theoretical convergence guarantees and show that PiKE outperforms static and non-adaptive mixing baselines. Furthermore, we extend PiKE to promote balanced learning across tasks. Extensive experiments on large-scale language model pretraining confirm that PiKE achieves faster convergence and improved downstream performance compared to existing approaches.

View full details

Poster

The Fragile Truth of Saliency: Improving LLM Input Attribution via Attention Bias Optimization

Yihua Zhang ⋅ Changsheng Wang ⋅ Yiwei Chen ⋅ Chongyu Fan ⋅ Jinghan Jia ⋅ Sijia Liu

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Input saliency aims to quantify the influence of input tokens on the output of large language models (LLMs), which has been widely used for prompt engineering, model interpretability, and behavior attribution. Despite the proliferation of saliency techniques, the field lacks a standardized and rigorous evaluation protocol. In this work, we introduce a stress-testing framework inspired by the needle-in-a-haystack (NIAH) setting to systematically assess the reliability of seven popular input saliency methods. Our evaluation reveals a surprising and critical flaw: existing methods consistently assign non-trivial importance to irrelevant context, and this attribution error worsens as input length increases. To address this issue, we propose a novel saliency method based on Attention Bias Optimization (ours), which explicitly optimizes the attention bias associated with each input token to quantify its causal impact on target token generation. ABO robustly outperforms existing methods by 10\sim30% in saliency accuracy across diverse NIAH tasks, maintains effectiveness up to 10K-token prompts, and enables practical applications including zero-shot detoxification, sentiment steering, and reasoning-error correction. Our findings highlight the limitations of prevalent attribution methods and establish ABO as a principled alternative for accurate token attribution.

View full details

Poster

On Universality Classes of Equivariant Networks

Marco Pacini ⋅ Gabriele Santin ⋅ Bruno Lepri ⋅ Shubhendu Trivedi

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Equivariant neural networks provide a principled framework for incorporating symmetry into learning architectures and have been extensively analyzed through the lens of their *separation power*, that is, the ability to distinguish inputs modulo symmetry. This notion plays a central role in settings such as graph learning, where it is often formalized via the Weisfeilern–Leman hierarchy. In contrast, the *universality* of equivariant models—their capacity to approximate target functions—remains comparatively underexplored. In this work, we investigate the approximation power of equivariant neural networks beyond separation constraints. We show that separation power does not fully capture expressivity: models with identical separation power may differ in their approximation ability. To demonstrate this, we characterize the universality classes of shallow invariant networks, providing a general framework for understanding which functions these architectures can approximate. Since equivariant models reduce to invariant ones under projection, this analysis yields sufficient conditions under which shallow equivariant networks fail to be universal. Conversely, we identify settings where shallow models do achieve separation-constrained universality. These positive results, however, depend critically on structural properties of the symmetry group, such as the existence of adequate normal subgroups, which may not hold in important cases like permutation symmetry.

View full details

Poster

The Computational Advantage of Depth in Learning High-Dimensional Hierarchical Targets

Yatin Dandi ⋅ Luca Pesce ⋅ Lenka Zdeborová ⋅ Florent Krzakala

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Understanding the advantages of deep neural networks trained by gradient descent (GD) compared to shallow models remains an open theoretical challenge. In this paper, we introduce a class of target functions (single and multi-index Gaussian hierarchical targets) that incorporate a hierarchy of latent subspace dimensionalities. This framework enables us to analytically study the learning dynamics and generalization performance of deep networks compared to shallow ones in the high-dimensional limit. Specifically, our main theorem shows that feature learning with GD successively reduces the effective dimensionality, transforming a high-dimensional problem into a sequence of lower-dimensional ones. This enables learning the target function with drastically less samples than with shallow networks. While the results are proven in a controlled training setting, we also discuss more common training procedures and argue that they learn through the same mechanisms. These findings open the way to further quantitative studies of the crucial role of depth in learning hierarchical structures with deep networks.

View full details

Poster

Towards a Pairwise Ranking Model with Orderliness and Monotonicity for Label Enhancement

Yunan Lu ⋅ Xixi Zhang ⋅ Yaojin Lin ⋅ Weiwei Li ⋅ Lei Yang ⋅ Xiuyi Jia

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Label distribution in recent years has been applied in a diverse array of complex decision-making tasks. To address the availability of label distributions, label enhancement has been established as an effective learning paradigm that aims to automatically infer label distributions from readily available multi-label data, e.g., logical labels. Recently, numerous works have demonstrated that the label ranking is significantly beneficial to label enhancement. However, these works still exhibit deficiencies in representing the probabilistic relationships between label distribution and label rankings, or fail to accommodate scenarios where multiple labels are equally important for a given instance. Therefore, we propose PROM, a pairwise ranking model with orderliness and monotonicity, to explain the probabilistic relationship between label distributions and label rankings. Specifically, we propose the monotonicity and orderliness assumptions for the probabilities of different ranking relationships and derive the mass functions for PROM, which are theoretically ensured to preserve the monotonicity and orderliness. Further, we propose a generative label enhancement algorithm based on PROM, which directly learns a label distribution predictor from the readily available multi-label data. Finally, extensive experiments demonstrate the efficacy of our proposed model.

View full details

Poster

Set Smoothness Unlocks Clarke Hyper-stationarity in Bilevel Optimization

He Chen ⋅ Jiajin Li ⋅ Anthony Man-Cho So

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Solving bilevel optimization (BLO) problems to global optimality is generally intractable. A common surrogate is to compute a hyper-stationary point—a stationary point of the hyper-objective function obtained by minimizing or maximizing the upper-level objective over the lower-level solution set. Existing methods, however, either provide weak notions of stationarity or require restrictive assumptions to guarantee the smoothness of hyper-objective functions. In this paper, we eliminate these impractical assumptions and show that strong (Clarke) hyper-stationarity remains computable even when the hyper-objective is nonsmooth. Our key ingredient is a new structural property, called set smoothness, which captures the variational dependence of the lower-level solution set on the upper-level variable. We prove that this property holds for a broad class of BLO problems and ensures weak convexity (resp. concavity) of pessimistic (resp. optimistic) hyper-objective functions. Building on this foundation, we show that a zeroth-order algorithm that computes approximate Clarke hyper-stationary points with non-asymptotic convergence guarantees. To the best of our knowledge, this is the first computational guarantee for Clarke-type stationarity in nonsmooth BLO. Beyond this specific application, the set smoothness property emerges as a structural concept of independent interest, with potential to inform the analysis of broader classes of optimization and variational problems.

View full details

Poster

CoRe: Benchmarking LLMs’ Code Reasoning Capabilities through Static Analysis Tasks

Danning Xie ⋅ Mingwei Zheng ⋅ Xuwei Liu ⋅ Jiannan Wang ⋅ Chengpeng Wang ⋅ Lin Tan ⋅ Xiangyu Zhang

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Large language models (LLMs) have been widely adopted across diverse domains of software engineering, such as code generation, program repair, and vulnerability detection. These applications require understanding beyond surface-level code patterns: value propagation, control flow, and interdependence between program elements. However, existing benchmarks primarily evaluate end-to-end outcomes, such as whether code is correctly repaired or generated, leaving the models' ability of program semantic reasoning underexplored.This work presents CoRe, a high-quality, human-verified benchmark designed to evaluate LLMs on fundamental static analysis tasks. CoRe includes 12,553 task instances spanning data dependency, control dependency, and information flow across programs written in C/C++, Java, and Python. To ensure semantic diversity and reasoning complexity, we propose a semantics-aware diverse sampling strategy that selects targets and task instances based on structural coverage and dependency depth. We evaluate 10 state-of-the-art LLMs and show that, while they perform well at identifying dependencies, models still struggle with tasks that require deeper semantic understanding and multi-step reasoning.We further conduct qualitative analyses to uncover key challenges, such as complex control structures and backward dependency patterns, offering insights into improving LLMs’ code reasoning capabilities.

View full details

Poster

QSVD: Efficient Low-rank Approximation for Unified Query-Key-Value Weight Compression in Low-Precision Vision-Language Models

Yutong Wang ⋅ Haiyu Wang ⋅ Sai Qian Zhang

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Vision-Language Models (VLMs) are integral to tasks such as image captioning and visual question answering, but their high computational cost, driven by large memory footprints and processing time, limits their scalability and real-time applicability. In this work, we propose leveraging Singular-Value Decomposition (SVD) over the joint query (Q), key (K), and value (V) weight matrices to reduce KV cache size and computational overhead. We in addition introduce an efficient rank allocation strategy that dynamically adjusts the SVD rank based on its impact on VLM accuracy, achieving a significant reduction in both memory usage and computational cost. Finally, we extend this approach by applying quantization to both VLM weights and activations, resulting in a highly efficient VLM. Our method outperforms previous approaches that rely solely on quantization or SVD by achieving more than $10$% accuracy improvement while consuming less hardware cost, making it better for real-time deployment on resource-constrained devices. We open source our code at https://github.com/SAI-Lab-NYU/QSVD.

View full details

Poster

Policy Compatible Skill Incremental Learning via Lazy Learning Interface

Daehee Lee ⋅ Dongsu Lee ⋅ TaeYoon Kwack ⋅ Wonje Choi ⋅ Honguk Woo

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Skill Incremental Learning (SIL) is the process by which an embodied agent expands and refines its skill set over time by leveraging experience gained through interaction with its environment or by the integration of additional data. SIL facilitates efficient acquisition of hierarchical policies grounded in reusable skills for downstream tasks. However, as the skill repertoire evolves, it can disrupt compatibility with existing skill-based policies, limiting their reusability and generalization. In this work, we propose SIL-C, a novel framework that ensures skill-policy compatibility, allowing improvements in incrementally learned skills to enhance the performance of downstream policies without requiring policy re-training or structural adaptation. SIL-C employs a bilateral lazy learning-based mapping technique to dynamically align the subtask space referenced by policies with the skill space decoded into agent behaviors. This enables each subtask, derived from the policy's decomposition of a complex task, to be executed by selecting an appropriate skill based on trajectory distribution similarity. We evaluate SIL-C across diverse SIL scenarios and demonstrate that it maintains compatibility between evolving skills and downstream policies while ensuring efficiency throughout the learning process.

View full details

Poster

When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs

Xiaomin Li ⋅ Zhou Yu ⋅ Zhiwei Zhang ⋅ Xupeng Chen ⋅ Ziji Zhang ⋅ Yingying Zhuang ⋅ Narayanan Sadagopan ⋅ Anurag Beniwal

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Reasoning-enhanced large language models (RLLMs), whether explicitly trained for reasoning or prompted via chain-of-thought (CoT), have achieved state-of-the-art performance on many complex reasoning tasks. However, we uncover a surprising and previously overlooked phenomenon: explicit CoT reasoning can significantly degrade instruction-following accuracy. Evaluating 20+ models on two benchmarks: IFEval (with simple, rule-verifiable constraints) and ComplexBench (with complex, compositional constraints), we consistently observe performance drops when CoT prompting is applied. Through large-scale case studies and an attention-based analysis, we identify common patterns where reasoning either helps (e.g., with formatting or lexical precision) or hurts (e.g., by neglecting simple constraints or introducing unnecessary content). We propose a metric, constraint attention, to quantify model focus during generation and show that CoT reasoning often diverts attention away from instruction-relevant tokens. To mitigate these effects, we introduce and evaluate four strategies: in-context learning, self-reflection, self-selective reasoning, and classifier-selective reasoning. Our results demonstrate that selective reasoning strategies, particularly classifier-selective reasoning, can substantially recover lost performance. To our knowledge, this is the first work to systematically expose reasoning-induced failures in instruction-following and offer practical mitigation strategies.

View full details

Poster

Towards Reliable Code-as-Policies: A Neuro-Symbolic Framework for Embodied Task Planning

Sanghyun Ahn ⋅ Wonje Choi ⋅ Junyong Lee ⋅ Jinwoo Park ⋅ Honguk Woo

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Recent advances in large language models (LLMs) have enabled the automatic generation of executable code for task planning and control in embodied agents such as robots, demonstrating the potential of LLM-based embodied intelligence. However, these LLM-based code-as-policies approaches often suffer from limited environmental grounding, particularly in dynamic or partially observable settings, leading to suboptimal task success rates due to incorrect or incomplete code generation. In this work, we propose a neuro-symbolic embodied task planning framework that incorporates explicit symbolic verification and interactive validation processes during code generation. In the validation phase, the framework generates exploratory code that actively interacts with the environment to acquire missing observations while preserving task-relevant states. This integrated process enhances the grounding of generated code, resulting in improved task reliability and success rates in complex environments. We evaluate our framework on RLBench and in real-world settings across dynamic, partially observable scenarios. Experimental results demonstrate that our framework improves task success rates by 46.2\% over Code as Policies baselines and attains over 86.8\% executability of task-relevant actions, thereby enhancing the reliability of task planning in dynamic environments.

View full details

Poster

A Principled Approach to Randomized Selection under Uncertainty: Applications to Peer Review and Grant Funding

Alexander Goldberg ⋅ Giulia Fanti ⋅ Nihar Shah

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Many decision-making processes involve evaluating and selecting items, including scientific peer review, job hiring, school admissions, and investment decisions. These domains feature error-prone evaluations and uncertainty about outcomes, which undermine deterministic selection rules. Consequently, randomized selection mechanisms are gaining traction. However, current randomized approaches are ad hoc and, as we prove, inappropriate for their purported objectives. We propose a principled framework for randomized decision-making based on interval estimates of item quality. We introduce MERIT (Maximin Efficient Randomized Interval Top-$k$), which maximizes the worst-case expected number of top candidates selected under uncertainty represented by overlapping intervals. MERIT provides optimal resource allocation under an interpretable robustness notion. We develop a polynomial-time, practically efficient algorithm and prove our approach satisfies desirable axiomatic properties not guaranteed by existing methods. Experiments on synthetic peer review data from grant funding and conferences demonstrate that MERIT matches existing algorithms' expected utility under fully probabilistic models while outperforming them under our worst-case formulation.

View full details

Poster

Virus Infection Attack on LLMs: Your Poisoning Can Spread "VIA" Synthetic Data

Zi Liang ⋅ Qingqing Ye ⋅ Xuan Liu ⋅ Yanyun Wang ⋅ Jianliang Xu ⋅ Haibo Hu

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Synthetic data refers to artificial samples generated by models. While it has been validated to significantly enhance the performance of large language models (LLMs) during training and has been widely adopted in LLM development, potential security risks it may introduce remain uninvestigated. This paper systematically evaluates the resilience of synthetic-data-integrated training paradigm for LLMs against mainstream poisoning and backdoor attacks. We reveal that such a paradigm exhibits strong resistance to existing attacks, primarily thanks to the different distribution patterns between poisoning data and queries used to generate synthetic samples. To enhance the effectiveness of these attacks and further investigate the security risks introduced by synthetic data, we introduce a novel and universal attack framework, namely, Virus Infection Attack (VIA), which enables the propagation of current attacks through synthetic data even under purely clean queries. Inspired by the principles of virus design in cybersecurity, VIA conceals the poisoning payload within a protective “shell” and strategically searches for optimal hijacking points in benign samples to maximize the likelihood of generating malicious content. Extensive experiments on both data poisoning and backdoor attacks show that VIA significantly increases the presence of poisoning content in synthetic data and correspondingly raises the attack success rate (ASR) on downstream models to levels comparable to those observed in the poisoned upstream models.

View full details

Poster

AnaCP: Toward Upper-Bound Continual Learning via Analytic Contrastive Projection

Saleh Momeni ⋅ Changnan Xiao ⋅ Bing Liu

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

This paper studies the problem of class-incremental learning (CIL), a core setting within continual learning where a model learns a sequence of tasks, each containing a distinct set of classes. Traditional CIL methods, which do not leverage pre-trained models (PTMs), suffer from catastrophic forgetting (CF) due to the need to incrementally learn both feature representations and the classifier. The integration of PTMs into CIL has recently led to efficient approaches that treat the PTM as a fixed feature extractor combined with analytic classifiers, achieving state-of-the-art performance. However, they still face a major limitation: the inability to continually adapt feature representations to best suit the CIL tasks, leading to suboptimal performance. To address this, we propose AnaCP (Analytic Contrastive Projection), a novel method that preserves the efficiency of analytic classifiers while enabling incremental feature adaptation without gradient-based training, thereby eliminating the CF caused by gradient updates. Our experiments show that AnaCP not only outperforms existing baselines but also achieves the accuracy level of joint training, which is regarded as the upper bound of CIL.

View full details

Poster

SpecMER: Fast Protein Generation with K-mer Guided Speculative Decoding

Thomas Walton ⋅ Darin Tsui ⋅ Aryan Musharaf ⋅ Amirali Aghazadeh

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Autoregressive models have transformed protein engineering by enabling the generation of novel protein sequences beyond those found in nature. However, their sequential inference introduces significant latency, limiting their utility in high-throughput protein screening. Speculative decoding accelerates generation by employing a lightweight draft model to sample tokens, which a larger target model then verifies and refines. Yet in protein sequence generation, draft models are typically agnostic to the structural and functional constraints of the target protein, leading to biologically implausible outputs and a shift in the likelihood distribution of generated sequences. We introduce SpecMER (Speculative Decoding via k-mer Guidance), a novel framework that incorporates biological, structural, and functional priors using k-mer motifs extracted from multiple sequence alignments. By scoring candidate sequences in parallel and selecting those most consistent with known biological patterns, SpecMER significantly improves sequence plausibility while retaining the efficiency of speculative decoding. SpecMER achieves 24–32% speedup over standard autoregressive decoding, along with higher acceptance rates and improved sequence likelihoods.

View full details

Poster

AGENTIF: Benchmarking Large Language Models Instruction Following Ability in Agentic Scenarios

Yunjia Qi ⋅ Hao Peng ⋅ Xiaozhi Wang ⋅ Amy Xin ⋅ Youfeng Liu ⋅ Bin Xu ⋅ Lei Hou ⋅ Juanzi Li

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Large Language Models (LLMs) have demonstrated advanced capabilities in real-world agentic applications. Growing research efforts aim to develop LLM-based agents to address practical demands, introducing a new challenge: agentic scenarios often involve lengthy instructions with complex constraints, such as extended system prompts and detailed tool specifications. While adherence to such instructions is crucial for agentic applications, whether LLMs can reliably follow them remains underexplored. In this paper, we introduce AgentIF, the first benchmark for systematically evaluating LLM instruction following ability in agentic scenarios. AgentIF features three key characteristics: (1) Realistic, constructed from $50$ real-world agentic applications. (2) Long, averaging $1,723$ words with a maximum of $15,630$ words. (3) Complex, averaging $11.9$ constraints per instruction, covering diverse constraint types, such as tool specifications and condition constraints.To construct AgentIF, we collect $707$ human-annotated instructions across $50$ agentic tasks from industrial application agents and open-source agentic systems. For each instruction, we annotate the associated constraints and corresponding evaluation metrics, including code-based evaluation, LLM-based evaluation, and hybrid code-LLM evaluation.We use AgentIF to systematically evaluate existing advanced LLMs. We observe that current models generally perform poorly, especially in handling complex constraint structures and tool specifications. We further conduct error analysis and analytical experiments on instruction length and meta constraints, providing some findings about the failure modes of existing LLMs. We have released the code and data to facilitate future research.

View full details

Poster

Purifying Approximate Differential Privacy with Randomized Post-processing

Yingyu Lin ⋅ Erchi Wang ⋅ Yian Ma ⋅ Yu-Xiang Wang

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We propose a framework to convert $(\varepsilon, \delta)$-approximate Differential Privacy (DP) mechanisms into $(\varepsilon', 0)$-pure DP mechanisms under certain conditions, a process we call ``purification.'' This algorithmic technique leverages randomized post-processing with calibrated noise to eliminate the $\delta$ parameter while achieving near-optimal privacy-utility tradeoff for pure DP. It enables a new design strategy for pure DP algorithms: first run an approximate DP algorithm with certain conditions, and then purify. This approach allows one to leverage techniques such as strong composition and propose-test-release that require $\delta>0$ in designing pure-DP methods with $\delta=0$. We apply this framework in various settings, including Differentially Private Empirical Risk Minimization (DP-ERM), stability-based release, and query release tasks. To the best of our knowledge, this is the first work with a statistically and computationally efficient reduction from approximate DP to pure DP. Finally, we illustrate the use of this reduction for proving lower bounds under approximate DP constraints with explicit dependence in $\delta$, avoiding the sophisticated fingerprinting code construction.

View full details

Poster

Improving Bilinear RNN with Closed-loop Control

Jiaxi Hu ⋅ Yongqi Pan ⋅ Jusen Du ⋅ Disen Lan ⋅ Tang ⋅ Qingsong Wen ⋅ Yuxuan Liang ⋅ Weigao Sun

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Recent efficient sequence modeling methods, such as Gated DeltaNet, TTT, and RWKV-7, have achieved performance improvements by supervising the recurrent memory management through the Delta learning rule. Unlike previous state-space models (e.g., Mamba) and gated linear attentions (e.g., GLA), these models introduce interactions between the recurrent state and the key vector, resulting in a bilinear recursive structure. In this paper, we first introduce the concept of Bilinear RNNs with a comprehensive analysis on the advantages and limitations of these models. Then based on the closed-loop control theory, we propose a novel Bilinear RNN variant named Comba, which adopts a scalar-plus-low-rank state transition, with both state feedback and output feedback corrections. We also implement a hardware-efficient chunk-wise parallel kernel in Triton and train models with 340M/1.3B parameters on a large-scale corpus. Comba demonstrates its superior performance and computation efficiency on both language modeling and vision tasks.

View full details

Poster

Corporate Needs You to Find the Difference: Revisiting Submodular and Supermodular Ratio Optimization Problems

Elfarouk Harb ⋅ Yousef Yassin ⋅ Chandra Chekuri

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We consider the following question: given a submodular or supermodular set function $f:2^V \to \mathbb{R}$, how should one minimize or maximize its average value $f(S)/|S|$ over non-empty subsets $S\subseteq V$? This problem generalizes several well-known objectives including Densest Subgraph (DSG), Densest Supermodular Set (DSS), and Submodular Function Minimization (SFM). Motivated by recent applications [39, 31], we formalize two new broad problems: the Unrestricted Sparsest Submodular Set (USSS) and Unrestricted Densest Supermodular Set (UDSS) which allow negative and non-monotone functions. Using classical results we observe that DSS, SFM, USSS, UDSS, and MNP are all equivalent under strongly polynomial-time reductions. This equivalence enables algorithmic cross-over: methods designed for one problem can be repurposed to solve others efficiently. In particular we use the perspective of the minimum norm point in the base polyhedron of a sub/supermodular function which, via Fujishige's results, yields the dense decomposition as a byproduct. Via this perspective we show that a recent converging heuristic for DSS, \textsc{SuperGreedy++} [15, 29], and Wolfe’s minimum norm point algorithm are both universal solvers for all of these problems. On the theoretical front, we explain the observation made in recent work [39, 31] that \textsc{SuperGreedy++} appears to work well even in settings beyond DSS. Surprisingly, we also show that this simple algorithm can be used for Submodular Function Minimization, including for example that it can act as an effective minimum $st$ cut algorithm. On the empirical front, we explore the utility of several different algorithms including Fujishige-Wolfe min-norm point algorithm for recent problems. We conduct over 400 experiments across seven problem types and large-scale synthetic and real-world datasets (up to $\approx 100$ million edges). Our results reveal that methods historically considered inefficient, such as convex-programming methods, flow-based solvers, and Fujishige-Wolfe’s algorithm, outperform state-of-the-art task-specific baselines by orders of magnitude on concrete problems like HNSN [39]. These findings challenge prevailing assumptions and demonstrate that with the right framing, general optimization algorithms can be both scalable and state-of-the-art for supermodular and submodular ratio problems.

View full details

Poster

SmallKV: Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference

Yi Zhao ⋅ Yajuan Peng ⋅ Nguyen Cam-Tu ⋅ Zuchao Li ⋅ Xiaoliang Wang ⋅ Hai Zhao ⋅ Xiaoming Fu

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

KV cache eviction has emerged as an effective solution to alleviate resource constraints faced by LLMs in long-context scenarios. However, existing token-level eviction methods often overlook two critical aspects: (1) their irreversible eviction strategy fails to adapt to dynamic attention patterns during decoding (the saliency shift problem), and (2) they treat both marginally important tokens and truly unimportant tokens uniformly, despite the collective significance of marginal tokens to model performance (the marginal information over-compression problem). To address these issues, we design two compensation mechanisms based on the high similarity of attention matrices between LLMs with different scales. We propose SmallKV, a small model assisted compensation method for KV cache compression. SmallKV can maintain attention matching between different-scale LLMs to: 1) assist the larger model in perceiving globally important information of attention; and 2) use the smaller model’s attention scores to approximate those of marginal tokens in the larger model. Extensive experiments on benchmarks including GSM8K, BBH, MT-Bench, and LongBench demonstrate the effectiveness of SmallKV. Moreover, efficiency evaluations show that SmallKV achieves 1.75 - 2.56 times higher throughput than baseline methods, highlighting its potential for efficient and performant LLM inference in resource constrained environments.

View full details

Poster

Towards Multi-Table Learning: A Novel Paradigm for Complementarity Quantification and Integration

Zhang Junyu ⋅ Lizhong Ding ⋅ MinghongZhang ⋅ Ye Yuan ⋅ Xingcan Li ⋅ Pengqi Li ⋅ Tihang Xi ⋅ Guoren Wang ⋅ Changsheng Li

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Multi-table data integrate various entities and attributes, with potential interconnections between them. However, existing tabular learning methods often struggle to describe and leverage the underlying complementarity across distinct tables. To address this limitation, we propose the first unified paradigm for multi-table learning that systematically quantifies and integrates complementary information across tables. Specifically, we introduce a metric called complementarity strength (CS), which captures inter-table complementarity by incorporating relevance, similarity, and informativeness. For the first time, we systematically formulate the paradigm towards multi-table learning by establishing formal definitions of tasks and loss functions. Correspondingly, we present a network for multi-table learning that combines Adaptive Table encoder and Cross table Attention mechanism (ATCA-Net), achieving the simultaneous integration of complementary information from distinct tables. Extensive experiments show that ATCA-Net effectively leverages complementary information and that the CS metric accurately quantifies the richness of complementarity across multiple tables. To the best of our knowledge, this is the first work to establish theoretical and practical foundations for multi-table learning.

View full details

Poster

Uni-LoRA: One Vector is All You Need

Kaiyang Li ⋅ Shaobo Han ⋅ Qing Su ⋅ Wei Li ⋅ Zhipeng Cai ⋅ Shihao Ji

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Low-Rank Adaptation (LoRA) has become the de facto parameter-efficient fine-tuning (PEFT) method for large language models (LLMs) by constraining weight updates to low-rank matrices. Recent works such as Tied-LoRA, VeRA, and VB-LoRA push efficiency further by introducing additional constraints to reduce the trainable parameter space. In this paper, we show that the parameter space reduction strategies employed by these LoRA variants can be formulated within a unified framework, Uni-LoRA, where the LoRA parameter space, flattened as a high-dimensional vector space R^D, can be reconstructed through a projection from a subspace R^d, with d << D. We demonstrate that the fundamental difference among various LoRA methods lies in the choice of the projection matrix, P ∈ R^{D×d}. Most existing LoRA variants rely on layer-wise or structure-specific projections that limit cross-layer parameter sharing, thereby compromising parameter efficiency. In light of this, we introduce an efficient and theoretically grounded projection matrix that is isometric, enabling global parameter sharing and reducing computation overhead. Furthermore, under the unified view of Uni-LoRA, this design requires only a single trainable vector to reconstruct LoRA parameters for the entire LLM -- making Uni-LoRA both a unified framework and a “one-vector-only” solution. Extensive experiments on GLUE, mathematical reasoning, and instruction tuning benchmarks demonstrate that Uni-LoRA achieves state-of-the-art parameter efficiency while outperforming or matching prior approaches in predictive performance.

View full details

Poster

Critical Batch Size Revisited: A Simple Empirical Approach to Large-Batch Language Model Training

Will Merrill ⋅ Shane Arora ⋅ Dirk Groeneveld ⋅ Hanna Hajishirzi

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

The right batch size is important when training language models at scale: a large batch size is necessary for fast training, but a batch size that is *too large* will harm token efficiency. To navigate this tradeoff, McCandlish et al. (2018) suggest that a *critical batch size* (CBS), below which training will not substantially degrade loss, can be estimated based on the gradient noise scale during training. While their method has been adopted in practice, e.g., when training GPT-3, strong assumptions are required to justify gradient noise as a proxy for the CBS, which makes it unclear whether their approach should be trusted in practice, limiting its applicability. In this paper, we introduce a simple, empirical approach to *directly* measure the CBS and show how the CBS evolves over training. Applying our approach to the OLMo models, we find that CBS is near 0 at initialization, increases rapidly at first, and then plateaus as training progresses. Furthermore, we find that this trend holds across different model sizes (1B and 7B), suggesting CBS from small training runs can inform larger-scale training runs. Our findings about how the CBS changes over training motivate *batch size warmup* as a natural way to reliably train language models at large batch size: start the batch size small and increase it as the CBS grows. To validate this claim, we use batch size warmup to train OLMo 1B to slightly better loss than the original training run with 43% fewer gradient steps. This shows how our framework can be applied to reliably train language models at larger batch sizes, increasing data parallelism without compromising performance.

View full details

Poster

CALM-PDE: Continuous and Adaptive Convolutions for Latent Space Modeling of Time-dependent PDEs

Jan Hagnberger ⋅ Daniel Musekamp ⋅ Mathias Niepert

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Solving time-dependent Partial Differential Equations (PDEs) using a densely discretized spatial domain is a fundamental problem in various scientific and engineering disciplines, including modeling climate phenomena and fluid dynamics. However, performing these computations directly in the physical space often incurs significant computational costs. To address this issue, several neural surrogate models have been developed that operate in a compressed latent space to solve the PDE. While these approaches reduce computational complexity, they often use Transformer-based attention mechanisms to handle irregularly sampled domains, resulting in increased memory consumption. In contrast, convolutional neural networks allow memory-efficient encoding and decoding but are limited to regular discretizations. Motivated by these considerations, we propose CALM-PDE, a model class that efficiently solves arbitrarily discretized PDEs in a compressed latent space. We introduce a novel continuous convolution-based encoder-decoder architecture that uses an epsilon-neighborhood-constrained kernel and learns to apply the convolution operator to adaptive and optimized query points. We demonstrate the effectiveness of CALM-PDE on a diverse set of PDEs with both regularly and irregularly sampled spatial domains. CALM-PDE is competitive with or outperforms existing baseline methods while offering significant improvements in memory and inference time efficiency compared to Transformer-based methods.

View full details

Poster

Geometry Meets Incentives: Sample-Efficient Incentivized Exploration with Linear Contexts

Ben Schiffer ⋅ Mark Sellke

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

In the incentivized exploration model, a principal aims to explore and learn over time by interacting with a sequence of self-interested agents. It has been recently understood that the main challenge in designing incentive-compatible algorithms for this problem is to gather a moderate amount of initial data, after which one can obtain near-optimal regret via posterior sampling. With high-dimensional contexts, however, this \emph{initial exploration} phase requires exponential sample complexity in some cases, which prevents efficient learning unless initial data can be acquired exogenously. We show that these barriers to exploration disappear under mild geometric conditions on the set of available actions, in which case incentive-compatibility does not preclude regret-optimality. Namely, we consider the linear bandit model with actions in the Euclidean unit ball, and give an incentive-compatible exploration algorithm with sample complexity that scales polynomially with the dimension and other parameters.

View full details

Poster

TrajMamba: An Efficient and Semantic-rich Vehicle Trajectory Pre-training Model

Yichen Liu ⋅ Yan Lin ⋅ Shengnan Guo ⋅ Zeyu Zhou ⋅ Youfang Lin ⋅ Huaiyu Wan

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Vehicle GPS trajectories record how vehicles move over time, storing valuable travel semantics, including movement patterns and travel purposes. Learning travel semantics effectively and efficiently is crucial for real-world applications of trajectory data, which is hindered by two major challenges. First, travel purposes are tied to the functions of the roads and points-of-interest (POIs) involved in a trip. Such information is encoded in textual addresses and descriptions and introduces heavy computational burden to modeling. Second, real-world trajectories often contain redundant points, which harm both computational efficiency and trajectory embedding quality. To address these challenges, we propose TrajMamba, a novel approach for efficient and semantically rich vehicle trajectory learning. TrajMamba introduces a Traj-Mamba Encoder that captures movement patterns by jointly modeling both GPS and road perspectives of trajectories, enabling robust representations of continuous travel behaviors. It also incorporates a Travel Purpose-aware Pre-training procedure to integrate travel purposes into the learned embeddings without introducing extra overhead to embedding calculation. To reduce redundancy in trajectories, TrajMamba features a Knowledge Distillation Pre-training scheme to identify key trajectory points through a learnable mask generator and obtain effective compressed trajectory embeddings. Extensive experiments on two real-world datasets and three downstream tasks show that TrajMamba outperforms state-of-the-art baselines in both efficiency and accuracy.

View full details

Poster

TreeSynth: Synthesizing Diverse Data from Scratch via Tree-Guided Subspace Partitioning

Sheng Wang ⋅ Pengan CHEN ⋅ Jingqi Zhou ⋅ Qintong Li ⋅ Jingwei Dong ⋅ Jiahui Gao ⋅ Boyang XUE ⋅ Jiyue Jiang ⋅ Lingpeng Kong ⋅ Chuan Wu

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Model customization necessitates high-quality and diverse datasets, but acquiring such data remains time-consuming and labor-intensive. Despite the great potential of large language models (LLMs) for data synthesis, current approaches are constrained by limited seed data, model biases and low-variation prompts, resulting in limited diversity and biased distribution with the increase of data scales. To tackle this challenge, we introduce TreeSynth, a tree-guided subspace-based data synthesis approach inspired by decision trees. It constructs a spatial partitioning tree to recursively divide a task-specific full data space (i.e., root node) into numerous atomic subspaces (i.e., leaf nodes) with mutually exclusive and exhaustive attributes to ensure both distinctiveness and comprehensiveness, before synthesizing samples within each atomic subspace. This globally divide-and-synthesize method finally collects subspace samples into a comprehensive dataset, effectively circumventing repetition and space collapse to ensure the diversity of large-scale data synthesis. Furthermore, the spatial partitioning tree enables sample allocation into atomic subspaces, allowing the re-balancing of existing datasets for more balanced and comprehensive distributions. Empirically, extensive experiments across diverse benchmarks consistently validates the superior data diversity, model performance, and robust scalability of TreeSynth compared to both human-crafted datasets and peer data synthesis methods, with the average performance gain reaching 10%. Besides, the consistent improvements of TreeSynth-balanced datasets highlight its efficacious application to redistribute existing datasets for more comprehensive coverage and the induced performance enhancement. The code is available at https://github.com/cpa2001/TreeSynth.

View full details

Poster

LLM Meeting Decision Trees on Tabular Data

Hangting Ye ⋅ Jinmeng Li ⋅ He Zhao ⋅ Dandan Guo ⋅ Yi Chang

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Tabular data have been playing a vital role in diverse real-world fields, including healthcare, finance, etc. With the recent success of Large Language Models (LLMs), early explorations of extending LLMs to the domain of tabular data have been developed. Most of these LLM-based methods typically first serialize tabular data into natural language descriptions, and then tune LLMs or directly infer on these serialized data. However, these methods suffer from two key inherent issues: (i) data perspective: existing data serialization methods lack universal applicability for structured tabular data, and may pose privacy risks through direct textual exposure, and (ii) model perspective: LLM fine-tuning methods struggle with tabular data, and in-context learning scalability is bottle-necked by input length constraints (suitable for few-shot learning). This work explores a novel direction of integrating LLMs into tabular data through logical decision tree rules as intermediaries, proposing a decision tree enhancer with LLM-derived rule for tabular prediction, DeLTa. The proposed DeLTa avoids tabular data serialization, and can be applied to full data learning setting without LLM fine-tuning. Specifically, we leverage the reasoning ability of LLMs to redesign an improved rule given a set of decision tree rules. Furthermore, we provide a calibration method for original decision trees via new generated rule by LLM, which approximates the error correction vector to steer the original decision tree predictions in the direction of ``errors'' reducing. Finally, extensive experiments on diverse tabular benchmarks show that our method achieves state-of-the-art performance.

View full details

Poster

THUNDER: Tile-level Histopathology image UNDERstanding benchmark

Pierre Marza ⋅ Leo Fillioux ⋅ Sofiène Boutaj ⋅ KUNAL MAHATHA ⋅ Christian Desrosiers ⋅ Pablo Piantanida ⋅ Jose Dolz ⋅ Stergios Christodoulidis ⋅ Maria Vakalopoulou

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Progress in a research field can be hard to assess, in particular when many concurrent methods are proposed in a short period of time. This is the case in digital pathology, where many foundation models have been released recently to serve as feature extractors for tile-level images, being used in a variety of downstream tasks, both for tile- and slide-level problems. Benchmarking available methods then becomes paramount to get a clearer view of the research landscape. In particular, in critical domains such as healthcare, a benchmark should not only focus on evaluating downstream performance, but also provide insights about the main differences between methods, and importantly, further consider uncertainty and robustness to ensure a reliable usage of proposed models. For these reasons, we introduce THUNDER, a tile-level benchmark for digital pathology foundation models, allowing for efficient comparison of many models on diverse datasets with a series of downstream tasks, studying their feature spaces and assessing the robustness and uncertainty of predictions informed by their embeddings. THUNDER is a fast, easy-to-use, dynamic benchmark that can already support a large variety of state-of-the-art foundation, as well as local user-defined models for direct tile-based comparison. In this paper, we provide a comprehensive comparison of 23 foundation models on 16 different datasets covering diverse tasks, feature analysis, and robustness. The code for THUNDER is publicly available at https://github.com/MICS-Lab/thunder.

View full details

Poster

High-Performance Arithmetic Circuit Optimization via Differentiable Architecture Search

Xilin Xia ⋅ Jie Wang ⋅ Wanbo Zhang ⋅ Zhihai Wang ⋅ Mingxuan Yuan ⋅ Jianye Hao ⋅ Feng Wu

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Arithmetic circuit optimization remains a fundamental challenge in modern integrated circuit design. Recent advances have cast this problem within the Learning to Optimize (L2O) paradigm, where intelligent agents autonomously explore high-performance design spaces with encouraging results. However, existing approaches predominantly target coarse-grained architectural configurations, while the crucial interconnect optimization stage is often relegated to oversimplified proxy models or a heuristic approach. This disconnect undermines design quality, leading to suboptimal solutions in the circuit topology search space. To bridge this gap, we present **Arith-DAS**, a **D**ifferentiable **A**rchitecture **S**earch framework for **Arith**metic circuits. To the best of our knowledge, **Arith-DAS** is the first to formulate interconnect optimization within arithmetic circuits as a differentiable edge prediction problem over a multi-relational directed acyclic graph, enabling fine-grained, proxy-free optimization at the interconnection level. We evaluate **Arith-DAS** on a suite of representative arithmetic circuits, including multipliers and multiply-accumulate units. Experiments show substantial improvements over state-of-the-art L2O and conventional methods, achieving up to $\textbf{27.05}$% gain in hypervolume of area-delay Pareto front, a standard metric for evaluating multi-objective optimization performance. Moreover, integrating our optimized arithmetic units into large-scale AI accelerators yields up to $\textbf{6.59}$% delay reduction, demonstrating both scalability and real-world applicability.

View full details

Poster

Measuring and Controlling Solution Degeneracy across Task-Trained Recurrent Neural Networks

Ann Huang ⋅ Satpreet Harcharan Singh ⋅ Flavio Martinelli ⋅ Kanaka Rajan

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Task-trained recurrent neural networks (RNNs) are widely used in neuroscience and machine learning to model dynamical computations. To gain mechanistic insight into how neural systems solve tasks, prior work often reverse-engineers individual trained networks. However, different RNNs trained on the same task and achieving similar performance can exhibit strikingly different internal solutions, a phenomenon known as solution degeneracy. Here, we develop a unified framework to systematically quantify and control solution degeneracy across three levels: behavior, neural dynamics, and weight space. We apply this framework to 3,400 RNNs trained on four neuroscience-relevant tasks: flip-flop memory, sine wave generation, delayed discrimination, and path integration, while systematically varying task complexity, learning regime, network size, and regularization. We find that increased task complexity and stronger feature learning reduce degeneracy in neural dynamics but increase it in weight space, with mixed effects on behavior. In contrast, larger networks and structural regularization reduce degeneracy at all three levels. These findings empirically validate the Contravariance Principle and provide practical guidance for researchers seeking to tune the variability of RNN solutions, either to uncover shared neural mechanisms or to model the individual variability observed in biological systems. This work provides a principled framework for quantifying and controlling solution degeneracy in task-trained RNNs, offering new tools for building more interpretable and biologically grounded models of neural computation.

View full details

Poster

Environment Inference for Learning Generalizable Dynamical System

Shixuan Liu ⋅ Yue He ⋅ Haotian Wang ⋅ Wenjing Yang ⋅ Yunfei Wang ⋅ Peng Cui ⋅ Zhong Liu

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Data-driven methods offer efficient and robust solutions for analyzing complex dynamical systems but rely on the assumption of I.I.D. data, driving the development of generalization techniques for handling environmental differences. These techniques, however, are limited by their dependence on environment labels, which are often unavailable during training due to data acquisition challenges, privacy concerns, and environmental variability, particularly in large public datasets and privacy-sensitive domains. In response, we propose DynaInfer, a novel method that infers environment specifications by analyzing prediction errors from fixed neural networks within each training round, enabling environment assignments directly from data. We prove our algorithm effectively solves the alternating optimization problem in unlabeled scenarios and validate it through extensive experiments across diverse dynamical systems. Results show that DynaInfer outperforms existing environment assignment techniques, converges rapidly to true labels, and even achieves superior performance when environment labels are available.

View full details

Poster

Spectral Graph Neural Networks are Incomplete on Graphs with a Simple Spectrum

Snir Hordan ⋅ Maya Bechler-Speicher ⋅ Gur Lifshitz ⋅ Nadav Dym

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Spectral features are widely incorporated within Graph Neural Networks (GNNs) to improve their expressive power, or their ability to distinguish among non-isomorphic graphs. One popular example is the usage of graph Laplacian eigenvectors for positional encoding in MPNNs and Graph Transformers. The expressive power of such Spectrally-enhanced GNNs (SGNNs) is usually evaluated via the $k$-WL graph isomorphism test hierarchy and homomorphism counting. Yet, these frameworks align poorly with the graph spectra, yielding limited insight into SGNNs' expressive power. In this paper, we leverage a well-studied paradigm of classifying graphs by their largest eigenvalue multiplicity to introduce an expressivity hierarchy for SGNNs. We then prove that many SGNNs are incomplete even on graphs with distinct eigenvalues. To mitigate this deficiency, we adapt rotation equivariant neural networks to the graph spectra setting, yielding equiEPNN, a novel SGNN that provably improves upon contemporary SGNNs' expressivity on simple spectrum graphs. We then demonstrate that equiEPNN achieves perfect eigenvector canonicalization on ZINC, and performs favorably on image classification on MNIST-Superpixel and graph property regression on ZINC, compared to leading spectral methods.

View full details

Poster

Emergence and Evolution of Interpretable Concepts in Diffusion Models

Berk Tinaz ⋅ Zalan Fabian ⋅ Mahdi Soltanolkotabi

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Diffusion models have become the go-to method for text-to-image generation, producing high-quality images from pure noise. However, the inner workings of diffusion models is still largely a mystery due to their black-box nature and complex, multi-step generation process. Mechanistic interpretability techniques, such as Sparse Autoencoders (SAEs), have been successful in understanding and steering the behavior of large language models at scale. However, the great potential of SAEs has not yet been applied toward gaining insight into the intricate generative process of diffusion models. In this work, we leverage the SAE framework to probe the inner workings of a popular text-to-image diffusion model, and uncover a variety of human-interpretable concepts in its activations. Interestingly, we find that *even before the first reverse diffusion step* is completed, the final composition of the scene can be predicted surprisingly well by looking at the spatial distribution of activated concepts. Moreover, going beyond correlational analysis, we design intervention techniques aimed at manipulating image composition and style, and demonstrate that (1) in early stages of diffusion image composition can be effectively controlled, (2) in the middle stages image composition is finalized, however stylistic interventions are effective, and (3) in the final stages only minor textural details are subject to change.

View full details

Poster

Trust Region Constrained Measure Transport in Path Space for Stochastic Optimal Control and Inference

Denis Blessing ⋅ Julius Berner ⋅ Lorenz Richter ⋅ Carles Domingo i Enrich ⋅ Yuanqi Du ⋅ Arash Vahdat ⋅ Gerhard Neumann

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Solving stochastic optimal control problems with quadratic control costs can be viewed as approximating a target path space measure, e.g. via gradient-based optimization. In practice, however, this optimization is challenging in particular if the target measure differs substantially from the prior. In this work, we therefore approach the problem by iteratively solving constrained problems incorporating trust regions that aim for approaching the target measure gradually in a systematic way. It turns out that this trust region based strategy can be understood as a geometric annealing from the prior to the target measure, where, however, the incorporated trust regions lead to a principled and educated way of choosing the time steps in the annealing path. We demonstrate in multiple optimal control applications that our novel method can improve performance significantly, including tasks in diffusion-based sampling and fine-tuning of diffusion models.

View full details

Poster

Mesh-RFT: Enhancing Mesh Generation via Fine-grained Reinforcement Fine-Tuning

Jian Liu ⋅ Jing Xu ⋅ Song Guo ⋅ Jing Li ⋅ jingfeng Guo ⋅ Jiaao Yu ⋅ Haohan Weng ⋅ Biwen Lei ⋅ Xianghui Yang ⋅ Zhuo Chen ⋅ Fangqi Zhu ⋅ Tao Han ⋅ Chunchao Guo

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Existing pretrained models for 3D mesh generation often suffer from data biases and produce low-quality results, while global reinforcement learning (RL) methods rely on object-level rewards that struggle to capture local structure details. To address these challenges, we present $\textbf{Mesh-RFT}$, a novel fine-grained reinforcement fine-tuning framework that employs Masked Direct Preference Optimization (M-DPO) to enable localized refinement via quality-aware face masking. To facilitate efficient quality evaluation, we introduce an objective topology-aware scoring system to evaluate geometric integrity and topological regularity at both object and face levels through two metrics: Boundary Edge Ratio (BER) and Topology Score (TS). By integrating these metrics into a fine-grained RL strategy, Mesh-RFT becomes the first method to optimize mesh quality at the granularity of individual faces, resolving localized errors while preserving global coherence. Experiment results show that our M-DPO approach reduces Hausdorff Distance (HD) by 24.6\% and improves Topology Score (TS) by 3.8\% over pre-trained models, while outperforming global DPO methods with a 17.4\% HD reduction and 4.9\% TS gain. These results demonstrate Mesh-RFT’s ability to improve geometric integrity and topological regularity, achieving new state-of-the-art performance in production-ready mesh generation.

View full details

Poster

Progressive Inference-Time Annealing of Diffusion Models for Sampling from Boltzmann Densities

Tara Akhound-Sadegh ⋅ Jungyoon Lee ⋅ Joey Bose ⋅ Valentin De Bortoli ⋅ Arnaud Doucet ⋅ Michael Bronstein ⋅ Dominique Beaini ⋅ Siamak Ravanbakhsh ⋅ Kirill Neklyudov ⋅ Alexander Tong

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Sampling efficiently from a target unnormalized probability density remains a core challenge, with relevance across countless high-impact scientific applications. A promising approach towards this challenge is the design of amortized samplers that borrow key ideas, such as probability path design, from state-of-the-art generative diffusion models. However, all existing diffusion-based samplers remain unable to draw samples from distributions at the scale of even simple molecular systems. In this paper, we propose Progressive Inference-Time Annealing (PITA) a novel framework to learn diffusion-based samplers that combines two complementary interpolation techniques: I.) Annealing of the Boltzmann distribution and II.) Diffusion smoothing. PITA trains a sequence of diffusion models from high to low temperatures by sequentially training each model at progressively higher temperatures, leveraging engineered easy access to samples of the temperature-annealed target density. In the subsequent step, PITA enables simulating the trained diffusion model to *procure training samples at a lower temperature* for the next diffusion model through inference-time annealing using a novel Feynman-Kac PDE combined with Sequential Monte Carlo. Empirically, PITA enables, for the first time, equilibrium sampling of $N$-body particle systems, Alanine Dipeptide, and tripeptides in Cartesian coordinates with dramatically lower energy function evaluations.

View full details

Poster

Conditional Representation Learning for Customized Tasks

Honglin Liu ⋅ Chao Sun ⋅ Peng Hu ⋅ Yunfan Li ⋅ Xi Peng

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Conventional representation learning methods learn a universal representation that primarily captures dominant semantics, which may not always align with customized downstream tasks. For instance, in animal habitat analysis, researchers prioritize scene-related features, whereas universal embeddings emphasize categorical semantics, leading to suboptimal results. As a solution, existing approaches resort to supervised fine-tuning, which however incurs high computational and annotation costs. In this paper, we propose Conditional Representation Learning (CRL), aiming to extract representations tailored to arbitrary user-specified criteria. Specifically, we reveal that the semantics of a space are determined by its basis, thereby enabling a set of descriptive words to approximate the basis for a customized feature space. Building upon this insight, given a user-specified criterion, CRL first employs a large language model (LLM) to generate descriptive texts to construct the semantic basis, then projects the image representation into this conditional feature space leveraging a vision-language model (VLM). The conditional representation better captures semantics for the specific criterion, which could be utilized for multiple customized tasks. Extensive experiments on classification and retrieval tasks demonstrate the superiority and generality of the proposed CRL. The code is available at https://github.com/XLearning-SCU/2025-NeurIPS-CRL.

View full details

Poster

GaussianFusion: Gaussian-Based Multi-Sensor Fusion for End-to-End Autonomous Driving

Shuai Liu ⋅ Quanmin Liang ⋅ Zefeng Li ⋅ Boyang Li ⋅ Kai Huang

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Multi-sensor fusion is crucial for improving the performance and robustness of end-to-end autonomous driving systems. Existing methods predominantly adopt either attention-based flatten fusion or bird’s eye view fusion through geometric transformations. However, these approaches often suffer from limited interpretability or dense computational overhead. In this paper, we introduce GaussianFusion, a Gaussian-based multi-sensor fusion framework for end-to-end autonomous driving. Our method employs intuitive and compact Gaussian representations as intermediate carriers to aggregate information from diverse sensors. Specifically, we initialize a set of 2D Gaussians uniformly across the driving scene, where each Gaussian is parameterized by physical attributes and equipped with explicit and implicit features. These Gaussians are progressively refined by integrating multi-modal features. The explicit features capture rich semantic and spatial information about the traffic scene, while the implicit features provide complementary cues beneficial for trajectory planning. To fully exploit rich spatial and semantic information in Gaussians, we design a cascade planning head that iteratively refines trajectory predictions through interactions with Gaussians. Extensive experiments on the NAVSIM and Bench2Drive benchmarks demonstrate the effectiveness and robustness of the proposed GaussianFusion framework. The source code is included in the supplementary material and will be released publicly.

View full details

Poster

ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and Understanding

junliang ye ⋅ Zhengyi Wang ⋅ Ruowen Zhao ⋅ Shenghao Xie ⋅ Jun Zhu

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Recently, the powerful text-to-image capabilities of GPT-4o have led to growing appreciation for native multimodal large language models. However, its multimodal capabilities remain confined to images and text. Yet beyond images, the ability to understand and generate 3D content is equally crucial. To address this gap, we propose ShapeLLM-Omni—a native 3D large language model capable of understanding and generating 3D assets and text in any sequence. First, we train a 3D vector-quantized variational autoencoder (VQVAE), which maps 3D objects into a discrete latent space to achieve efficient and accurate shape representation and reconstruction. Building upon the 3D-aware discrete tokens, we innovatively construct a large-scale continuous training dataset named 3D-Alpaca, encompassing generation, comprehension, and editing, thus providing rich resources for future research and training. Finally, we perform instruction-based fine-tuning of the Qwen-2.5-vl-7B-Instruct model on the 3D-Alpaca dataset, equipping it with native 3D understanding and generation capabilities. Our work represents an effective step toward extending multimodal large language models with fundamental 3D intelligence, paving the way for future advances in 3D-native AI.

View full details

Poster

An Analytical Theory of Spectral Bias in the Learning Dynamics of Diffusion Models

Binxu Wang ⋅ Cengiz Pehlevan

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We develop an analytical framework for understanding how the learned distribution evolves during diffusion model training. Leveraging the Gaussian equivalence principle, we derived exact solutions for the gradient-flow dynamics of weights in one or two layer linear or linear convolutional denoiser settings with arbitrary data, where linear networks converge along principal components, and convolutional networks converge along Fourier modes. Remarkably, these solutions allow us to derive the generated distribution in closed-form and its KL-divergence through training. These analytical results expose a pronounced \emph{spectral bias}, i.e. for both weights and generated distributions, the convergence time of a mode follows an inverse power law of its variance. Empirical experiments on both Gaussian and natural image datasets demonstrate that the power-law spectral bias—remain robust even when using deeper or convolutional architectures. Our results underscore the importance of the data covariance in dictating the order and rate at which diffusion models learn different modes of the data, providing potential explanations of why earlier stopping could lead to incorrect details in image generative model.

View full details

Poster

On the Value of Cross-Modal Misalignment in Multimodal Representation Learning

Yichao Cai ⋅ Yuhang Liu ⋅ Erdun Gao ⋅ Tianjiao Jiang ⋅ Zhen Zhang ⋅ Anton van den Hengel ⋅ Prof Javen Qinfeng Shi

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Multimodal representation learning, exemplified by multimodal contrastive learning (MMCL) using image-text pairs, aims to learn powerful representations by aligning cues across modalities. This approach relies on the core assumption that the exemplar image-text pairs constitute two representations of an identical concept. However, recent research has revealed that real-world datasets often exhibit cross-modal misalignment. There are two distinct viewpoints on how to address this issue: one suggests mitigating the misalignment, and the other leveraging it. We seek here to reconcile these seemingly opposing perspectives, and to provide a practical guide for practitioners. Using latent variable models we thus formalize cross-modal misalignment by introducing two specific mechanisms: Selection bias, where some semantic variables are absent in the text, and perturbation bias, where semantic variables are altered—both leading to misalignment in data pairs. Our theoretical analysis demonstrates that, under mild assumptions, the representations learned by MMCL capture exactly the information related to the subset of the semantic variables invariant to selection and perturbation biases. This provides a unified perspective for understanding misalignment. Based on this, we further offer actionable insights into how misalignment should inform the design of real-world ML systems. We validate our theoretical findings via extensive empirical studies on both synthetic data and real image-text datasets, shedding light on the nuanced impact of cross-modal misalignment on multimodal representation learning.

View full details

Poster

Co-Evolving LLM Coder and Unit Tester via Reinforcement Learning

Yinjie Wang ⋅ Ling Yang ⋅ Ye Tian ⋅ Ke Shen ⋅ Mengdi Wang

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Mathematical reasoning in large language models has been successfully incentivized through reinforcement learning with verifiable rewards, leading to improved one-shot precision. In this work, we turn our focus to the coding domain. Beyond one-shot precision, we highlight unit test generation as another key factor for enhancing coding ability, since accurate unit tests are essential for enabling self-checking and self-correction during inference. Traditional approaches for fine-tuning LLMs on unit test generation rely heavily on ground-truth code solutions in the training data. We propose CURE, a novel reinforcement learning framework with a dedicated reward design that co-evolves coding and unit test generation capabilities based on their interaction outcomes—without any ground-truth code as supervision. This approach enables flexible and scalable training and allows the unit tester to learn directly from the coder’s mistakes. Through extensive evaluations, we demonstrate that our CURE models, derived from base models of varying sizes, excel in both code generation and unit test generation. They naturally extend to downstream tasks such as test-time scaling—achieving a 6.2\% improvement over the base model—and agentic unit test generation, with a 25.1\% improvement. Our 4B model consistently outperforms Qwen3-4B while achieving 64.8\% inference efficiency in unit test generation. Notably, we also find that the CURE model can serve as an effective reward model for reinforcement learning on base models, even in the absence of any labeled supervision.

View full details