NeurIPS 2025 Posters

Skip to yearly menu bar Skip to main content

Poster

Beyond Scalar Rewards: An Axiomatic Framework for Lexicographic MDPs

Mehran Shakerinava · Siamak Ravanbakhsh · Adam Oberman

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

Recent work has formalized the reward hypothesis through the lens of expected utility theory, by interpreting reward as utility. Hausner's foundational work showed that dropping the continuity axiom leads to a generalization of expected utility theory where utilities are lexicographically ordered vectors of arbitrary dimension. In this paper, we extend this result by identifying a simple and practical condition under which preferences in a Markov Decision Process (MDP) cannot be represented by scalar rewards, necessitating a 2-dimensional reward function. We provide a full characterization of such reward functions, as well as the general d-dimensional case under a memorylessness assumption on preferences. Furthermore, we show that optimal policies in this setting retain many desirable properties of their scalar-reward counterparts, while in the Constrained MDP (CMDP) setting — another common multiobjective setting — they do not.

View full details

Poster

Constant Bit-size Transformers Are Turing Complete

Qian Li · Yuyi Wang

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

We prove that any Turing machine running on inputs of arbitrary length can be simulated by a constant bit-size transformer, as long as the context window is sufficiently long. This improves previous works, which require scaling up either the model's precision or the number of parameters on longer inputs. Furthermore, we prove that the complexity class SPACE$[s(n)]$ exactly characterizes the expressive power of a constant bit-size transformer with a context window of length $s(n)$. Our approach relies on simulating Post machines, a Turing-complete computational model. Post machines can be modeled as automata equipped with a queue, exhibiting computational behaviors naturally aligned with those of transformers. The behavioral similarity between transformers and Post machines may offer new insights into the mechanisms underlying the reasoning abilities of transformers.

View full details

Poster

TRUST: Test-Time Refinement using Uncertainty-Guided SSM Traverses

Sahar Dastani · Ali Bahri · Gustavo Vargas Hakim · Moslem Yazdanpanah · Mehrdad Noori · David OSOWIECHI · Samuel Barbeau · Ismail Ayed · Herve Lombaert · Christian Desrosiers

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

State Space Models (SSMs) have emerged as efficient alternatives to Vision Transformers (ViTs), with VMamba standing out as a pioneering architecture designed for vision tasks. However, their generalization performance degrades significantly under distribution shifts. To address this limitation, we propose TRUST (Test-Time Refinement using Uncertainty-Guided SSM Traverses), a novel test-time adaptation (TTA) method that leverages diverse traversal permutations to generate multiple causal perspectives of the input image. Model predictions serve as pseudo-labels to guide updates of the Mamba-specific parameters, and the adapted weights are averaged to integrate the learned information across traversal scans. Altogether, TRUST is the first approach that explicitly leverages the unique architectural properties of SSMs for adaptation. Experiments on seven benchmarks show that TRUST consistently improves robustness and outperforms existing TTA methods.

View full details

Poster

Spatial Understanding from Videos: Structured Prompts Meet Simulation Data

Haoyu Zhang · Meng Liu · Zaijing Li · Haokun Wen · Weili Guan · Yaowei Wang · Liqiang Nie

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

Visual-spatial understanding, the ability to infer object relationships and layouts from visual input, is fundamental to downstream tasks such as robotic navigation and embodied interaction. However, existing methods face spatial uncertainty and data scarcity, limiting the 3D spatial reasoning capability of pre-trained vision-language models (VLMs). To address these challenges, we present a unified framework for enhancing 3D spatial reasoning in pre-trained VLMs without modifying their architecture. This framework combines SpatialMind, a structured prompting strategy that decomposes complex scenes and questions into interpretable reasoning steps, with ScanForgeQA, a scalable question-answering dataset built from diverse 3D simulation scenes through an automated construction process designed for fine-tuning. Extensive experiments across multiple benchmarks demonstrate the individual and combined effectiveness of our prompting and fine-tuning strategies, and yield insights that may inspire future research on visual-spatial understanding.

View full details

Poster

PointMAC: Meta-Learned Adaptation for Robust Test-Time Point Cloud Completion

Linlian Jiang · Rui Ma · Li Gu · Ziqiang Wang · Xinxin Zuo · Yang Wang

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

Point cloud completion is essential for robust 3D perception in safety-critical applications such as robotics and augmented reality. However, existing models perform static inference and rely heavily on inductive biases learned during training, limiting their ability to adapt to novel structural patterns and sensor-induced distortions at test time. To address this limitation, we propose PointMAC, a meta-learned framework for robust test-time adaptation in point cloud completion. It enables sample-specific refinement without requiring additional supervision. Our method optimizes the completion model under two self-supervised auxiliary objectives that simulate structural and sensor-level incompleteness. A meta-auxiliary learning strategy based on Model-Agnostic Meta-Learning (MAML) ensures that adaptation driven by auxiliary objectives is consistently aligned with the primary completion task. During inference, we adapt the shared encoder on-the-fly by optimizing auxiliary losses, with the decoder kept fixed. To further stabilize adaptation, we introduce Adaptive $\lambda$-Calibration, a meta-learned mechanism for balancing gradients between primary and auxiliary objectives. Extensive experiments on synthetic, simulated, and real-world datasets demonstrate that PointMAC achieves state-of-the-art results by refining each sample individually to produce high-quality completions. To the best of our knowledge, this is the first work to apply meta-auxiliary test-time adaptation to point cloud completion.

View full details

Poster

TreeSplat: Mergeable Tree for Deformable Gaussian Splatting

Qiuhong Shen · Xingyi Yang · Xinchao Wang

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

Dynamic 3D scene reconstruction from multi-view videos demands representation to model complex deformations at scale. Current Gaussian Splatting based methods often either suffer from significant computation cost due to dense MLP-based modeling or explicit modeling deformation of each Gaussian independently. However, the dynamics of objects within a scene are typically hierarchical and exhibit structural correlations. To leverage these structural priors into the representation, we introduce TreeSplat, a Tree data structure for deformable Gaussian Splatting. In TreeSplat, as the name suggests, motions of Gaussian are represented hierarchically within a tree. Each node learns coefficients for time-varying basis functions, defining a part of the motion. The full motion for any given Gaussian is then determined by accumulating these transformations along the tree path from its leaf node to the root node. This tree isn't predefined; instead, it is constructed adaptively alongside Gaussian densification, where cloning or splitting a Gaussian correspondingly creates new leaf nodes. One central property of TreeSplat is its mergeability; after optimization during training, the hierarchical motion parameters for each Gaussian can be efficiently consolidated. By performing this merging step before test time, we eliminate the need to traverse the tree explicitly for each Gaussian during rendering. This results in dramatically faster rendering over 200 FPS and compact storage, while maintaining state-of-the-art rendering quality. Experiments on diverse synthetic and real-world datasets validate these advantages.

View full details

Poster

Adaptive Sigmoid Clipping for Balancing the Direction–Magnitude Mismatch Trade-off in Differentially Private Learning

Faeze Moradi Kalarde · Ali Bereyhi · Ben Liang · Min Dong

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

Differential privacy (DP) limits the impact of individual training data samples by bounding their gradient norms through clipping. Conventional clipping operations assign unequal scaling factors to sample gradients with different norms, leading to a direction mismatch between the true batch gradient and the aggregation of the clipped gradients. Applying a smaller but identical scaling factor to all sample gradients alleviates this direction mismatch; however, it intensifies the magnitude mismatch by excessively reducing the aggregation norm. This work proposes a novel clipping method, termed adaptive sigmoid (AdaSig), which uses a sigmoid function with an adjustable saturation slope to clip the sample gradients. The slope is adaptively adjusted during the training process to balance the trade-off between direction mismatch and magnitude mismatch, as the statistics of sample gradients evolve over the training iterations. Despite AdaSig’s adaptive nature, our convergence analysis demonstrates that differentially private stochastic gradient descent (DP-SGD) with AdaSig clipping retains the best-known convergence rate under non-convex loss functions. Evaluating AdaSig on sentence and image classification tasks across different datasets shows that it consistently improves learning performance compared with established clipping methods.

View full details

Poster

$\epsilon$-Seg: Sparsely Supervised Semantic Segmentation of Microscopy Data

Sheida Rahnamai Kordasiabi · Damian Nogare · Florian Jug

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

Semantic segmentation of electron microscopy (EM) images of biological samples remains a challenge in the life sciences. EM data captures details of biological structures, sometimes with such complexity that even human observers can find it overwhelming. We introduce $\epsilon$-Seg, a method based on hierarchical variational autoencoders (HVAEs), employing center-region masking, sparse label contrastive learning (CL), a Gaussian mixture model (GMM) prior, and clustering-free label prediction. Center-region masking and the inpainting loss encourage the model to learn robust and representative embeddings to distinguish the desired classes, even if training labels are sparse ($0.05$\% of the total image data or less). For optimal performance, we employ CL and a GMM prior to shape the latent space of the HVAE such that encoded input patches tend to cluster w.r.t. the semantic classes we wish to distinguish. Finally, instead of clustering latent embeddings for semantic segmentation, we propose a MLP semantic segmentation head to directly predict class labels from latent embeddings. We show empirical results of $\epsilon$-Seg and baseline methods on $2$ dense EM datasets of biological tissues and demonstrate the applicability of our method also on fluorescence microscopy data. Our results show that $\epsilon$-Seg is capable of achieving competitive sparsely-supervised segmentation results on complex biological image data, even if only limited amounts of training labels are available. Code available at https://github.com/juglab/eps-Seg.

View full details

Poster

Reconstructing Heterogeneous Biomolecules via Hierarchical Gaussian Mixtures and Part Discovery

Shayan Shekarforoush · David Lindell · Marcus Brubaker · David Fleet

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

Cryo-EM is a transformational paradigm in molecular biology where computational methods are used to infer 3D molecular structure at atomic resolution from extremely noisy 2D electron microscope images. At the forefront of research is how to model the structure when the imaged particles exhibit non-rigid conformational flexibility and compositional variation where parts are sometimes missing. We introduce a novel 3D reconstruction framework with a hierarchical Gaussian mixture model, inspired in part by Gaussian Splatting for 4D scene reconstruction. In particular, the structure of the model is grounded in an initial process that infers a part-based segmentation of the particle, providing essential inductive bias in order to handle both conformational and compositional variability. The framework, called \methodName, is shown to reveal biologically meaningful structures on complex experimental datasets, and establishes a new state-of-the-art on CryoBench, a benchmark for cryo-EM heterogeneity methods.

View full details

Poster

MoniTor: Exploiting Large Language Models with Instruction for Online Video Anomaly Detection

shengtian yang · Yue Feng · Yingshi Liu · Jingrou Zhang · Jie Qin

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

Video Anomaly Detection (VAD) aims to locate unusual activities or behaviors within videos. Recently, offline VAD has garnered substantial research attention, which has been invigorated by the progress in large language models (LLMs) and vision-language models (VLMs), offering the potential for a more nuanced understanding of anomalies. However, online VAD has seldom received attention due to real-time constraints and computational intensity. In this paper, we introduce a novel Memory-based online scoring queue scheme for Training-free VAD (MoniTor), to address the inherent complexities in online VAD. Specifically, MoniTor applies a streaming input to VLMs, leveraging the capabilities of pre-trained large-scale models. To capture temporal dependencies more effectively, we incorporate a novel prediction mechanism inspired by Long Short-Term Memory (LSTM) networks. This ensures the model can effectively model past states and leverage previous predictions to identify anomalous behaviors. Thereby, it better understands the current frame. Moreover, we design a scoring queue and an anomaly prior to dynamically store recent scores and cover all anomalies in the monitoring scenario, providing guidance for LLMs to distinguish between normal and abnormal behaviors over time. We evaluate MoniTor on two large datasets (i.e., UCF-Crime and XD-Violence) containing various surveillance and real-world scenarios. The results demonstrate that MoniTor outperforms state-of-the-art methods and is competitive with weakly supervised methods without training. Code is available at https://github.com/YsTvT/MoniTor.

View full details

Poster

Relaxing partition admissibility in Cluster-DAGs: a causal calculus with arbitrary variable clustering

Clément Yvernes · Emilie Devijver · Adèle Ribeiro · Marianne Clausel · Eric Gaussier

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

Cluster DAGs (C-DAGs) provide an abstraction of causal graphs in which nodes represent clusters of variables, and edges encode both cluster-level causal relationships and dependencies arisen from unobserved confounding. C-DAGs define an equivalence class of acyclic causal graphs that agree on cluster-level relationships, enabling causal reasoning at a higher level of abstraction. However, when the chosen clustering induces cycles in the resulting C-DAG, the partition is deemed inadmissible under conventional C-DAG semantics. In this work, we extend the C-DAG framework to support arbitrary variable clusterings by relaxing the partition admissibility constraint, thereby allowing cyclic C-DAG representations. We extend the notions of d-separation and causal calculus to this setting, significantly broadening the scope of causal reasoning across clusters and enabling the application of C-DAGs in previously intractable scenarios. Our calculus is both sound and atomically complete with respect to the do-calculus: all valid interventional queries at the cluster level can be derived using our rules, each corresponding to a primitive do-calculus step.

View full details

Poster

Gradient Variance Reveals Failure Modes in Flow-Based Generative Models

Teodora Reu · Sixtine Dromigny · Michael Bronstein · Francisco Vargas

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

Rectified Flows learn ODE vector fields whose trajectories are straight between source and target distributions, enabling near one-step inference. We show that this straight-path objective reveals fundamental failure modes: under deterministic training, low gradient variance drives memorization of arbitrary training pairings, even when interpolant lines between training pairs intersect. To analyze this mechanism, we study Gaussian-to-Gaussian transport and use the loss gradient variance across stochastic and deterministic regimes to characterize which vector fields optimization favors in each setting. We then show that, in a setting where all interpolating lines intersect, applying Rectified Flow yields the same specific pairings at inference as during training. More generally, we prove that a memorizing vector field exists even when training interpolants intersect, and that optimizing the straight-path objective converges to this ill-defined field. At inference, deterministic integration reproduces the exact training pairings. We validate our findings empirically on the CelebA dataset, confirming that deterministic interpolants induce memorization, while the injection of small noise restores generalization.

View full details

Poster

Physics-informed Reduced Order Modeling of Time-dependent PDEs via Differentiable Solvers

Nima Hosseini Dashtbayaz · Hesam Salehipour · Adrian Butscher · Nigel Morris

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

Reduced-order modeling (ROM) of time-dependent and parameterized differential equations aims to accelerate the simulation of complex high-dimensional systems by learning a compact latent manifold representation that captures the characteristics of the solution fields and their time-dependent dynamics. Although high-fidelity numerical solvers generate the training datasets, they have thus far been excluded from the training process, causing the learned latent dynamics to drift away from the discretized governing physics. This mismatch often limits generalization and forecasting capabilities. In this work, we propose **Ph**ysics-**i**nformed **ROM** ($\Phi$-ROM) by incorporating differentiable PDE solvers into the training procedure. Specifically, the latent space dynamics and its dependence on PDE parameters are shaped directly by the governing physics encoded in the solver, ensuring a strong correspondence between the full and reduced systems. Our model outperforms state-of-the-art data-driven ROMs and other physics-informed strategies by accurately generalizing to new dynamics arising from unseen parameters, enabling long-term forecasting beyond the training horizon, maintaining continuity in both time and space, and reducing the data cost. Furthermore, $\Phi$-ROM learns to recover and forecast the solution fields even when trained or evaluated with sparse and irregular observations of the fields, providing a flexible framework for field reconstruction and data assimilation. We demonstrate the framework’s robustness across various PDE solvers and highlight its broad applicability by providing an open-source JAX implementation that is readily extensible to other PDE systems and differentiable solvers, available at https://phi-rom.github.io.

View full details

Poster

On the VC dimension of deep group convolutional neural networks

Anna Sepliarskaia · Sophie Langer · Johannes Schmidt-Hieber

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

Recent works have introduced new equivariant neural networks, motivated by their improved generalization compared to traditional deep neural networks. While experiments support this advantage, the theoretical understanding of their generalization properties remains limited. In this paper, we analyze the generalization capabilities of Group Convolutional Neural Networks (GCNNs) with the ReLU activation function through the lens of Vapnik-Chervonenkis (VC) dimension theory. We investigate how architectural factors—such as the number of layers, weights, and input dimensions—affect the VC dimension. A key challenge in our analysis is proving a lower bound on the VC dimension, for which we introduce new techniques, establishing a novel connection between GCNNs and standard deep neural networks. Additionally, we compare our derived bounds to those known for fully connected neural networks. Our results extend previous findings on the VC dimension of continuous GCNNs with two layers, offering new insights into their generalization behavior, particularly their dependence on input resolution.

View full details

Poster

The Promise of RL for Autoregressive Image Editing

Saba Ahmadi · Rabiul Awal · Ankur Sikarwar · Amirhossein Kazemnejad · Ge Ya Luo · Juan Rodriguez · Sai Rajeswar Mudumba · Siva Reddy · Chris Pal · Benno Krojer · Aishwarya Agrawal

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

While image generation techniques are now capable of producing high-quality images that respect prompts which span multiple sentences, the task of text-guided image editing remains a challenge. Even edit requests that consist of only a few words often fail to be executed correctly. We explore three strategies to enhance performance on a wide range of image editing tasks: supervised fine-tuning (SFT), reinforcement learning (RL), and Chain-of-Thought (CoT) reasoning. In order to study all these components in one consistent framework, we adopt an autoregressive multimodal model that processes textual and visual tokens in a unified manner. We find RL combined with a large multi-modal LLM verifier to be the most effective of these strategies. As a result, we release EARL: Editing with Autoregression and RL, a strong RL-based image editing model that performs competitively on a diverse range of edits compared to strong baselines, despite using much less training data. Thus, EARL pushes the frontier of autoregressive multimodal models on image editing. We release our code, training data, and trained models at https://github.com/mair-lab/EARL.

View full details

Poster

Learning Simple Interpolants for Linear Integer Arithmetic

Minchao Wu · Naoki Kobayashi

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

Craig interpolation plays a central role in formal verification tasks such as model checking, invariant generation, and abstraction refinement. In the domain of linear integer arithmetic (LIA), interpolants are crucial for deriving inductive invariants that characterize unreachable or safe program states, enabling scalable and precise reasoning about software and hardware correctness. Despite progress in interpolation algorithms, generating concise and interpretable interpolants remains a key challenge. We propose a lightweight learning-based approach to generating simple interpolants for LIA. Our model learns to lazily sample input problems directly and is complementary to existing logical methods. When Z3 is guided by our learned model, the complexity of the interpolants it produces can be reduced by up to 47.3%. For older solvers, the reduction rate can reach up to 69.1%.

View full details

Poster

Token Perturbation Guidance for Diffusion Models

Javad Rajabi · Soroush Mehraban · Seyedmorteza Sadat · Babak Taati

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

Classifier-free guidance (CFG) has become an essential component of modern diffusion models to enhance both generation quality and alignment with input conditions. However, CFG requires specific training procedures and is limited to conditional generation. To address these limitations, we propose Token Perturbation Guidance (TPG), a novel method that applies perturbation matrices directly to intermediate token representations within the diffusion network. TPG employs a norm-preserving shuffling operation to provide effective and stable guidance signals that improve generation quality without architectural changes. As a result, TPG is training-free and agnostic to input conditions, making it readily applicable to both conditional and unconditional generation. We also analyze the guidance term provided by TPG and show that its effect on sampling more closely resembles CFG compared to existing training-free guidance techniques. We extensively evaluate TPG on SDXL and Stable Diffusion 2.1, demonstrating nearly a 2x improvement in FID for unconditional generation over the SDXL baseline and showing that TPG closely matches CFG in prompt alignment. Thus, TPG represents a general, condition-agnostic guidance method that extends CFG-like benefits to a broader class of diffusion models.

View full details

Poster

Uncertainty-Aware Multi-Objective Reinforcement Learning-Guided Diffusion Models for 3D De Novo Molecular Design

Lianghong Chen · Dongkyu Kim · Mike Domaratzki · Pingzhao Hu

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

Designing de novo 3D molecules with desirable properties remains a fundamental challenge in drug discovery and molecular engineering. While diffusion models have demonstrated remarkable capabilities in generating high-quality 3D molecular structures, they often struggle to effectively control complex multi-objective constraints critical for real-world applications. In this study, we propose an uncertainty-aware Reinforcement Learning (RL) framework to guide the optimization of 3D molecular diffusion models toward multiple property objectives while enhancing the overall quality of the generated molecules. Our method leverages surrogate models with predictive uncertainty estimation to dynamically shape reward functions, facilitating balance across multiple optimization objectives. We comprehensively evaluate our framework across three benchmark datasets and multiple diffusion model architectures, consistently outperforming baselines for molecular quality and property optimization. Additionally, Molecular Dynamics (MD) simulations and ADMET profiling of top generated candidates indicate promising drug-like behavior and binding stability, comparable to known Epidermal Growth Factor Receptor (EGFR) inhibitors. Our results demonstrate the strong potential of RL-guided generative diffusion models for advancing automated molecular design.

View full details

Poster

Can Class-Priors Help Single-Positive Multi-Label Learning?

Biao Liu · Ning Xu · Jie Wang · Xin Geng

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

Single-positive multi-label learning (SPMLL) is a weakly supervised multi-label learning problem, where each training example is annotated with only one positive label. Existing SPMLL methods typically assign pseudo-labels to unannotated labels with the assumption that prior probabilities of all classes are identical. However, the class-prior of each category may differ significantly in real-world scenarios, which makes the predictive model not perform as well as expected due to the unrealistic assumption on real-world application. To alleviate this issue, a novel framework named Crisp, i.e., Class-pRiors Induced Single-Positive multi-label learning, is proposed. Specifically, a class-priors estimator is introduced, which can estimate the class-priors that are theoretically guaranteed to converge to the ground-truth class-priors. In addition, based on the estimated class-priors, an unbiased risk estimator for classification is derived, and the corresponding risk minimizer can be guaranteed to approximately converge to the optimal risk minimizer on fully supervised data. Experimental results on ten MLL benchmark datasets demonstrate the effectiveness and superiority of our method over existing SPMLL approaches.

View full details

Poster

CHPO: Constrained Hybrid-action Policy Optimization for Reinforcement Learning

ao zhou · Jiayi Guan · Li Shen · Fan Lu · Sanqing Qu · Junqiao Zhao · Ziqiao Wang · Ya Wu · Guang Chen

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

Constrained hybrid-action reinforcement learning (RL) promises to learn a safe policy within a parameterized action space, which is particularly valuable for safety-critical applications involving discrete-continuous hybrid action spaces. However, existing hybrid-action RL algorithms primarily focus on reward maximization, which faces significant challenges for tasks involving both cost constraints and hybrid action spaces. In this work, we propose a novel Constrained Hybrid-action Policy Optimization algorithm (CHPO) to address the problems of constrained hybrid-action RL. Concretely, we rethink the limitations of hybrid-action RL in handling safe tasks with parameterized action spaces and reframe the objective of constrained hybrid-action RL by introducing the concept of Constrained Parameterized-action Markov Decision Process (CPMDP). Subsequently, we present a constrained hybrid-action policy optimization algorithm to confront the constrained hybrid-action problems and conduct theoretical analyses demonstrating that the CHPO converges to the optimal solution while satisfying safety constraints. Finally, extensive experiments demonstrate that the CHPO achieves competitive performance across multiple experimental tasks.

View full details

Poster

$\texttt{STRCMP}$: Integrating Graph Structural Priors with Language Models for Combinatorial Optimization

Xijun Li · Jiexiang Yang · Jinghao Wang · Bo Peng · Jianguo Yao · Haibing Guan

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

Combinatorial optimization (CO) problems, central to operation research and theoretical computer science, present significant computational challenges due to their $\mathcal{NP}$-hard nature. While large language models (LLMs) have emerged as promising tools for CO—either by directly generating solutions or synthesizing solver-specific codes—existing approaches often $\textit{neglect critical structural priors inherent to CO problems}$, leading to suboptimality and iterative inefficiency. Inspired by human experts’ success in leveraging CO structures for algorithm design, we propose $\texttt{STRCMP}$, a novel structure-aware LLM-based algorithm discovery framework that systematically integrates structure priors to enhance solution quality and solving efficiency. Our framework combines a graph neural network (GNN) for extracting structural embeddings from CO instances with an LLM conditioned on these embeddings to identify high-performed algorithms in the form of solver-specific codes. This composite architecture ensures syntactic correctness, preserves problem topology, and aligns with natural language objectives, while an evolutionary refinement process iteratively optimizes generated algorithm. Extensive evaluations across Mixed Integer Linear Programming and Boolean Satisfiability problems, using nine benchmark datasets, demonstrate that our proposed $\texttt{STRCMP}$ outperforms five strong neural and LLM-based methods by a large margin, in terms of both solution optimality and computational efficiency. The code is publicly available in the repository: https://github.com/Y-Palver/L2O-STRCMP.

View full details

Poster

ChemX: A Collection of Chemistry Datasets for Benchmarking Automated Information Extraction

Anastasia Vepreva · Julia Razlivina · Mariia Eremeyeva · Nina Gubina · Anastasia Orlova · Aleksei Dmitrenko · Kapranova Xenia · Susan Jyakhwo · Nikita Vasilev · Arsen Sarkisyan · Ivan Chernyshov · Vladimir Vinogradov · Andrei Dmitrenko

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

Despite recent advances in machine learning, many scientific discoveries in chemistry still rely on manually curated datasets extracted from the scientific literature. Automation of information extraction in specialized chemistry domains has the potential to scale up machine learning applications and improve the quality of predictions, enabling data-driven scientific discoveries at a faster pace. In this paper, we present ChemX, a collection of 10 benchmarking datasets across several domains of chemistry providing a reliable basis for evaluating and fine-tuning automated information extraction methods. The datasets encompassing various properties of small molecules and nanomaterials have been manually extracted from peer-reviewed publications and systematically validated by domain experts through a cross-verification procedure allowing for identification and correction of errors at sources. In order to demonstrate the utility of the resulting datasets, we evaluate the extraction performance of the state-of-the-art large language models (LLMs). Moreover, we design our own agentic approach to take full control of the document preprocessing before LLM-based information extraction. Finally, we apply the recently emerged multi-agent systems specialized in chemistry to compare performance against the strong baselines. Our empirical results highlight persistent challenges in chemical information extraction, particularly in handling domain-specific terminology, complex tabular and schematic formats, and context-dependent ambiguities. We discuss the importance of expert data validation, the nuances of the evaluation pipeline, and the prospects of automated information extraction in chemistry. Finally, we provide open documentation including standardized schemas and provenance metadata, as well as the code and other materials to ensure reproducibility. ChemX is poised to advance automatic information extraction in chemistry by challenging the quality and generalization capabilities of existing methods, as well as providing insights into evaluation strategies.

View full details

Poster

Non-convex entropic mean-field optimization via Best Response flow

Razvan-Andrei Lascu · Mateusz Majka

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

We study the problem of minimizing non-convex functionals on the space of probability measures, regularized by the relative entropy (KL divergence) with respect to a fixed reference measure, as well as the corresponding problem of solving entropy-regularized non-convex-non-concave min-max problems. We utilize the Best Response flow (also known in the literature as the fictitious play flow) and study how its convergence is influenced by the relation between the degree of non-convexity of the functional under consideration, the regularization parameter and the tail behaviour of the reference measure. In particular, we demonstrate how to choose the regularizer, given the non-convex functional, so that the Best Response operator becomes a contraction with respect to the $L^1$-Wasserstein distance, which ensures the existence of its unique fixed point that is then shown to be the unique global minimizer for our optimization problem. This extends recent results where the Best Response flow was applied to solve convex optimization problems regularized by the relative entropy with respect to arbitrary reference measures, and with arbitrary values of the regularization parameter. Our results explain precisely how the assumption of convexity can be relaxed, at the expense of making a specific choice of the regularizer. Additionally, we demonstrate how these results can be applied in reinforcement learning in the context of policy optimization for Markov Decision Processes and Markov games with softmax parametrized policies in the mean-field regime.

View full details

Poster

Strassen Attention, Split VC Dimension and Compositionality in Transformers

Alexander Kozachinskiy · Felipe Urrutia · Hector Orellana · Tomasz Steifer · Germán Pizarro · Matías Fuentes · Francisco Meza Vásquez · Cristian Buc Calderon · Cristobal Rojas

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

We propose the first method to show theoretical limitations for one-layer softmax transformers with arbitrarily many precision bits (even infinite). We establish those limitations for three tasks that require advanced reasoning. The first task, Match 3 (Sanford et al., 2023), requires looking at all possible token triplets in an input sequence. The second and third tasks address compositionality-based reasoning: function composition (Peng et al., 2024) and binary relations composition, respectively. We formally prove the inability of one-layer softmax Transformers to solve any of these tasks. To overcome these limitations, we introduce Strassen attention and prove that, equipped with this mechanism, a one-layer transformer can in principle solve all these tasks. Importantly, we show that it enjoys sub-cubic running-time complexity, making it more scalable than similar previously proposed mechanisms, such as higher-order attention (Sanford et al., 2023). To complement our theoretical findings, we experimentally studied Strassen attention and compared it against standard (Vaswani et al, 2017), higher-order attention (Sanford et al., 2023), and triangular attention (Bergen et al. 2021). Our results help to disentangle all these attention mechanisms, highlighting their strengths and limitations. In particular, Strassen attention outperforms standard attention significantly on all the tasks. Altogether, understanding the theoretical limitations can guide research towards scalable attention mechanisms that improve the reasoning abilities of Transformers.

View full details

Poster

Abstract Counterfactuals for Language Model Agents

Edoardo Pona · Milad Kazemi Mehrabadi · Yali Du · David Watson · Nicola Paoletti

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

Counterfactual inference is a powerful tool for analysing and evaluating autonomous agents, but its application to language model (LM) agents remains challenging. Existing work on counterfactuals in LMs has primarily focused on token-level counterfactuals, which are often inadequate for LM agents due to their open-ended action spaces. Unlike traditional agents with fixed, clearly defined action spaces, the actions of LM agents are often implicit in the strings they output, making their action spaces difficult to define and interpret. Furthermore, the meanings of individual tokens can shift depending on the context, adding complexity to token-level reasoning and sometimes leading to biased or meaningless counterfactuals. We introduce \emph{Abstract Counterfactuals}, a framework that emphasises high-level characteristics of actions and interactions within an environment, enabling counterfactual reasoning tailored to user-relevant features. Our experiments demonstrate that the approach produces consistent and meaningful counterfactuals while minimising the undesired side effects of token-level methods. We conduct experiments on text-based games and counterfactual text generation, while considering both token-level and latent-space interventions.

View full details

Poster

SPARKE: Scalable Prompt-Aware Diversity and Novelty Guidance in Diffusion Models via RKE Score

Mohammad Jalali · Haoyu Lei · Amin Gohari · Farzan Farnia

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

Diffusion models have demonstrated remarkable success in high-fidelity image synthesis and prompt-guided generative modeling. However, ensuring adequate diversity in generated samples of prompt-guided diffusion models remains a challenge, particularly when the prompts span a broad semantic spectrum and the diversity of generated data needs to be evaluated in a prompt-aware fashion across semantically similar prompts. Recent methods have introduced guidance via diversity measures to encourage more varied generations. In this work, we extend the diversity measure-based approaches by proposing the *S*calable *P*rompt-*A*ware *R*eny *K*ernel *E*ntropy Diversity Guidance (*SPARKE*) method for prompt-aware diversity guidance. SPARKE utilizes conditional entropy for diversity guidance, which dynamically conditions diversity measurement on similar prompts and enables prompt-aware diversity control. While the entropy-based guidance approach enhances prompt-aware diversity, its reliance on the matrix-based entropy scores poses computational challenges in large-scale generation settings. To address this, we focus on the special case of \textit{Conditional latent RKE Score Guidance}, reducing entropy computation and gradient-based optimization complexity from the $\mathcal{O}(n^3)$ of general entropy measures to $\mathcal{O}(n)$. The reduced computational complexity allows for diversity-guided sampling over potentially thousands of generation rounds on different prompts. We numerically test the SPARKE method on several text-to-image diffusion models, demonstrating that the proposed method improves the prompt-aware diversity of the generated data without incurring significant computational costs. We release our code on the project page: [https://mjalali.github.io/SPARKE/](https://mjalali.github.io/SPARKE).

View full details

Poster

Exploring Semantic-constrained Adversarial Example with Instruction Uncertainty Reduction

Jin Hu · Jiakai Wang · linna Jing · Haolin Li · Liu haodong · Haotong Qin · Aishan Liu · Ke Xu · Xianglong Liu

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

Recently, semantically constrained adversarial examples (SemanticAE), which are directly generated from natural language instructions, have become a promising avenue for future research due to their flexible attacking forms, but have not been thoroughly explored yet. To generate SemanticAEs, current methods fall short of satisfactory attacking ability as the key underlying factors of semantic uncertainty in human instructions, such as $\textit{referring diversity}$, $\textit{descriptive incompleteness}$, and $\textit{boundary ambiguity}$, have not been fully investigated. To tackle the issues, this paper develops a multi-dimensional $\textbf{ins}$truction $\textbf{u}$ncertainty $\textbf{r}$eduction ($\textbf{InSUR}$) framework to generate more satisfactory SemanticAE, $\textit{i.e.}$, transferable, adaptive, and effective. Specifically, in the dimension of the sampling method, we propose the residual-driven attacking direction stabilization to alleviate the unstable adversarial optimization caused by the diversity of language references. By coarsely predicting the language-guided sampling process, the optimization process will be stabilized by the designed ResAdv-DDIM sampler, therefore releasing the transferable and robust adversarial capability of multi-step diffusion models. In task modeling, we propose the context-encoded attacking scenario constraint to supplement the missing knowledge from incomplete human instructions. Guidance masking and renderer integration are proposed to regulate the constraints of 2D/3D SemanticAE, activating stronger scenario-adapted attacks. Moreover, in the dimension of generator evaluation, we propose the semantic-abstracted attacking evaluation enhancement by clarifying the evaluation boundary based on the label taxonomy, facilitating the development of more effective SemanticAE generators. Extensive experiments demonstrate the superiority of the transfer attack performance of InSUR. Besides, it is worth highlighting that we realize the reference-free generation of semantically constrained 3D adversarial examples by utilizing language-guided 3D generation models for the first time.

View full details

Poster

Global Minimizers of $\ell^p$-Regularized Objectives Yield the Sparsest ReLU Neural Networks

Julia Nakhleh · Robert Nowak

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

Overparameterized neural networks can interpolate a given dataset in many different ways, prompting the fundamental question: which among these solutions should we prefer, and what explicit regularization strategies will provably yield these solutions? This paper addresses the challenge of finding the sparsest interpolating ReLU network—i.e., the network with the fewest nonzero parameters or neurons—a goal with wide-ranging implications for efficiency, generalization, interpretability, theory, and model compression. Unlike post hoc pruning approaches, we propose a continuous, almost-everywhere differentiable training objective whose global minima are guaranteed to correspond to the sparsest single-hidden-layer ReLU networks that fit the data. This result marks a conceptual advance: it recasts the combinatorial problem of sparse interpolation as a smooth optimization task, potentially enabling the use of gradient-based training methods. Our objective is based on minimizing $\ell^p$ quasinorms of the weights for $0 < p < 1$, a classical sparsity-promoting strategy in finite-dimensional settings. However, applying these ideas to neural networks presents new challenges: the function class is infinite-dimensional, and the weights are learned using a highly nonconvex objective. We prove that, under our formulation, global minimizers correspond exactly to sparsest solutions. Our work lays a foundation for understanding when and how continuous sparsity-inducing objectives can be leveraged to recover sparse networks through training.

View full details

Poster

BADiff: Bandwidth Adaptive Diffusion Model

Xi Zhang · Hanwei Zhu · Yan Zhong · Jiamang Wang · Weisi Lin

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

In this work, we propose a novel framework to enable diffusion models to adapt their generation quality based on real-time network bandwidth constraints. Traditional diffusion models produce high-fidelity images by performing a fixed number of denoising steps, regardless of downstream transmission limitations. However, in practical cloud-to-device scenarios, limited bandwidth often necessitates heavy compression, leading to loss of fine textures and wasted computation. To address this, we introduce a joint end-to-end training strategy where the diffusion model is conditioned on a target quality level derived from the available bandwidth. During training, the model learns to adaptively modulate the denoising process, enabling early-stop sampling that maintains perceptual quality appropriate to the target transmission condition. Our method requires minimal architectural changes and leverages a lightweight quality embedding to guide the denoising trajectory. Experimental results demonstrate that our approach significantly improves the visual fidelity of bandwidth-adapted generations compared to naive early-stopping, offering a promising solution for efficient image delivery in bandwidth-constrained environments. Code is available at: https://github.com/xzhang9308/BADiff.

View full details

Poster

Care-PD: A Multi-Site Anonymized Clinical Dataset for Parkinson’s Disease Gait Assessment

Vida Adeli · Ivan Klabučar · Javad Rajabi · Benjamin Filtjens · Soroush Mehraban · Diwei Wang · Trung Hieu Hoang · Minh Do · Hyewon Seo · Candice MULLER · Daniel Coelho · Claudia de Oliveira · Pieter Ginis · Moran Gilat · Alice Nieuwboer · Joke Spildooren · J. Mckay · Hyeokhyen Kwon · Gari Clifford · Christine Esper · Stewart Factor · Imari Genias · Amirhossein Dadashzadeh · Leia Shum · Alan Whone · Majid Mirmehdi · Andrea Iaboni · Babak Taati

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

Objective gait assessment in Parkinson’s Disease (PD) is limited by the absence of large, diverse, and clinically annotated motion datasets. We introduce Care-PD, the largest publicly available archive of 3D mesh gait data for PD, and the first multi-site collection spanning 9 cohorts from 8 clinical centers. All recordings (RGB video or motion capture) are converted into anonymized SMPL meshes via a harmonized preprocessing pipeline. Care-PD supports two key benchmarks: supervised clinical score prediction (estimating Unified Parkinson’s Disease Rating Scale, UPDRS, gait scores) and unsupervised motion pretext tasks (2D-to-3D keypoint lifting and full-body 3D reconstruction). Clinical prediction is evaluated under four generalization protocols: within-dataset, cross-dataset, leave-one-dataset-out, and multi-dataset in-domain adaptation.To assess clinical relevance, we compare state-of-the-art motion encoders with a traditional gait-feature baseline, finding that encoders consistently outperform handcrafted features. Pretraining on Care-PD reduces MPJPE (from 60.8mm to 7.5mm) and boosts PD severity macro-F1 by 17\%, underscoring the value of clinically curated, diverse training data. Care-PD and all benchmark code are released for non-commercial research (Code, Data).

View full details

Poster

A Learning-Augmented Approach to Online Allocation Problems

Ilan Cohen · Debmalya Panigrahi

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

In online allocation problems, an algorithm must choose from a set of options at each step, where each option incurs a set of costs/rewards associated with a set of $d$ agents. The goal is to minimize/maximize a function of the accumulated costs/rewards assigned to the agents over the course of the entire allocation process. Such problems are common in combinatorial optimization, including minimization problems such as machine scheduling and network routing, as well as maximization problems such as fair allocation for welfare maximization. In this paper, we develop a general learning-augmented algorithmic framework for online allocation problems that produces a nearly optimal solution using only a single $d$-dimensional vector of learned weights. Using this general framework, we derive learning-augmented online algorithms for a broad range of application problems in routing, scheduling, and fair allocation. Our main tool is convex programming duality, which may also have further implications for learning-augmented algorithms in the future.

View full details

Poster

DERD-Net: Learning Depth from Event-based Ray Densities

Diego de Oliveira Hitzges · Suman Ghosh · Guillermo Gallego

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

Event cameras offer a promising avenue for multi-view stereo depth estimation and Simultaneous Localization And Mapping (SLAM) due to their ability to detect blur-free 3D edges at high-speed and over broad illumination conditions. However, traditional deep learning frameworks designed for conventional cameras struggle with the asynchronous, stream-like nature of event data, as their architectures are optimized for discrete, image-like inputs. We propose a scalable, flexible and adaptable framework for pixel-wise depth estimation with event cameras in both monocular and stereo setups. The 3D scene structure is encoded into disparity space images (DSIs), representing spatial densities of rays obtained by back-projecting events into space via known camera poses. Our neural network processes local subregions of the DSIs combining 3D convolutions and a recurrent structure to recognize valuable patterns for depth prediction. Local processing enables fast inference with full parallelization and ensures constant ultra-low model complexity and memory costs, regardless of camera resolution. Experiments on standard benchmarks (MVSEC and DSEC datasets) demonstrate unprecedented effectiveness: (i) using purely monocular data, our method achieves comparable results to existing stereo methods; (ii) when applied to stereo data, it strongly outperforms all state-of-the-art (SOTA) approaches, reducing the mean absolute error by at least 42\%; (iii) our method also allows for increases in depth completeness by more than 3-fold while still yielding a reduction in median absolute error of at least 30\%. Given its remarkable performance and effective processing of event-data, our framework holds strong potential to become a standard approach for using deep learning for event-based depth estimation and SLAM.

View full details

Poster

One Sample is Enough to Make Conformal Prediction Robust

Soroush H. Zargarbashi · Mohammad Sadegh Akhondzadeh · Aleksandar Bojchevski

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

For any black-box model, conformal prediction (CP) returns prediction *sets* guaranteed to include the true label with high adjustable probability. Robust CP (RCP) extends the guarantee to the worst case noise up to a pre-defined magnitude. For RCP, a well-established approach is to use randomized smoothing since it is applicable to any black-box model and provides smaller sets compared to deterministic methods. However, smoothing-based robustness requires many model forward passes per each input which is computationally expensive. We show that conformal prediction attains some robustness even with *a single forward pass on a randomly perturbed input*. Using any binary certificate we propose a single sample robust CP (RCP1). Our approach returns robust sets with smaller average set size compared to SOTA methods which use many (e.g. $\sim 100$) passes per input. Our key insight is to certify the conformal procedure itself rather than individual conformity scores. Our approach is agnostic to the task (classification and regression). We further extend our approach to smoothing-based robust conformal risk control.

View full details

Poster

scSplit: Bringing Severity Cognizance to Image Decomposition in Fluorescence Microscopy

Ashesh Ashesh · Florian Jug

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

Fluorescence microscopy, while being a key driver for progress in the life sciences, is also subject to technical limitations. To overcome them, computational multiplexing techniques have recently been proposed, which allow multiple cellular structures to be captured in a single image and later be unmixed. Existing image decomposition methods are trained on a set of superimposed input images and the respective unmixed target images. It is critical to note that the relative strength (mixing ratio) of the superimposed images for a given input is a priori unknown. However, existing methods are trained on a fixed intensity ratio of superimposed inputs, making them not cognizant of the range of relative intensities that can occur in fluorescence microscopy. In this work, we propose a novel method called scSplit that is cognizant of the severity of the above-mentioned mixing ratio. Our idea is based on InDI, a popular iterative method for image restoration, and an ideal starting point to embrace the unknown mixing ratio in any given input. We introduce (i) a suitably trained regressor network that predicts the degradation level (mixing asymmetry) of a given input image and (ii) a degradation-specific normalization module, enabling degradation-aware inference across all mixing ratios. We show that this method solves two relevant tasks in fluorescence microscopy, namely image splitting and bleedthrough removal, and empirically demonstrate the applicability of scSplit on 5 public datasets. The source code with pre-trained models is hosted at https://github.com/juglab/scSplit/.

View full details

Poster

Enhancing Temporal Understanding in Video-LLMs through Stacked Temporal Attention in Vision Encoders

Leibniz University Hannover, L3S Research Center Ali Rasekh · Erfan Soula · Omid Daliran · Simon Gottschalk · Mohsen Fayyaz

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

Despite significant advances in Multimodal Large Language Models (MLLMs), understanding complex temporal dynamics in videos remains a major challenge. Our experiments show that current Video Large Language Model (Video-LLM) architectures have critical limitations in temporal understanding, struggling with tasks that require detailed comprehension of action sequences and temporal progression. In this work, we propose a Video-LLM architecture that introduces stacked temporal attention modules directly within the vision encoder. This design incorporates a temporal attention in vision encoder, enabling the model to better capture the progression of actions and the relationships between frames before passing visual tokens to the LLM. Our results show that this approach significantly improves temporal reasoning and outperforms existing models in video question answering tasks, specifically in action recognition. We improve on benchmarks including VITATECS, MVBench, and Video-MME by up to +5.5%. By enhancing the vision encoder with temporal structure, we address a critical gap in video understanding for Video-LLMs. Project page and code are available at: https://alirasekh.github.io/STAVEQ2/

View full details

Poster

Understanding Prompt Tuning and In-Context Learning via Meta-Learning

Tim Genewein · Kevin Li · Jordi Grau-Moya · Anian Ruoss · Laurent Orseau · Marcus Hutter

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

Prompting is one of the main ways to adapt a pretrained model to target tasks. Besides manually constructing prompts, many prompt optimization methods have been proposed in the literature. Method development is mainly empirically driven, with less emphasis on a conceptual understanding of prompting. In this paper we discuss how optimal prompting can be understood through a Bayesian view, which also implies some fundamental limitations of prompting that can only be overcome by tuning weights. The paper explains in detail how meta-trained neural networks behave as Bayesian predictors over the pretraining distribution, whose hallmark feature is rapid in-context adaptation. Optimal prompting can be studied formally as conditioning these Bayesian predictors, yielding criteria for target tasks where optimal prompting is and is not possible. We support the theory with educational experiments on LSTMs and Transformers, where we compare different versions of prefix-tuning and different weight-tuning methods. We also confirm that soft prefixes, which are sequences of real-valued vectors outside the token alphabet, can lead to very effective prompts for trained and even untrained networks by manipulating activations in ways that are not achievable by hard tokens. This adds an important mechanistic aspect beyond the conceptual Bayesian theory.

View full details

Poster

Enhancing Sample Selection Against Label Noise by Cutting Mislabeled Easy Examples

Suqin Yuan · Lei Feng · Bo Han · Tongliang Liu

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

Sample selection is a prevalent approach in learning with noisy labels, aiming to identify confident samples for training. Although existing sample selection methods have achieved decent results by reducing the noise rate of the selected subset, they often overlook that not all mislabeled examples harm the model's performance equally. In this paper, we demonstrate that mislabeled examples correctly predicted by the model early in the training process are particularly harmful to model performance. We refer to these examples as Mislabeled Easy Examples (MEEs). To address this, we propose Early Cutting, which introduces a recalibration step that employs the model's later training state to re-select the confident subset identified early in training, thereby avoiding misleading confidence from early learning and effectively filtering out MEEs. Experiments on the CIFAR, WebVision, and full ImageNet-1k datasets demonstrate that our method effectively improves sample selection and model performance by reducing MEEs.

View full details

Poster

BNMusic: Blending Environmental Noises into Personalized Music

Chi Zuo · Martin Møller · Pablo Martínez-Nuevo · Huayang Huang · Yu Wu · Ye Zhu

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

While being disturbed by environmental noises, the acoustic masking technique is a conventional way to reduce the annoyance in audio engineering that seeks to cover up the noises with other dominant yet less intrusive sounds. However, misalignment between the dominant sound and the noise—such as mismatched downbeats—often requires an excessive volume increase to achieve effective masking. Motivated by recent advances in cross-modal generation, in this work, we introduce an alternative method to acoustic masking, aiming to reduce the noticeability of environmental noises by blending them into personalized music generated based on user-provided text prompts. Following the paradigm of music generation using mel-spectrogram representations, we propose a Blending Noises into Personalized Music (BNMusic) framework with two key stages. The first stage synthesizes a complete piece of music in a mel-spectrogram representation that encapsulates the musical essence of the noise. In the second stage, we adaptively amplifying the generated music segment to further reduce noise perception and enhance the blending effectiveness, while preserving auditory quality. Our experiments with comprehensive evaluations on MusicBench, EPIC-SOUNDS, and ESC-50 demonstrate the effectiveness of our framework, highlighting the ability to blend environmental noise with rhythmically aligned, adaptively amplified, and enjoyable music segments, minimizing the noticeability of the noise, thereby improving overall acoustic experiences. Project page: https://d-fas.github.io/BNMusic_page/.

View full details

Poster

Feature Distillation is the Better Choice for Model-Heterogeneous Federated Learning

Yichen Li · Xiuying Wang · Wenchao Xu · Haozhao Wang · Yining Qi · Jiahua Dong · Ruixuan Li

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

Model-Heterogeneous Federated Learning (Hetero-FL) has attracted growing attention for its ability to aggregate knowledge from heterogeneous models while keeping private data locally. To better aggregate knowledge from clients, ensemble distillation, as a widely used and effective technique, is often employed after global aggregation to enhance the performance of the global model. However, simply combining Hetero-FL and ensemble distillation does not always yield promising results and can make the training process unstable. The reason is that existing methods primarily focus on logit distillation, which, while being model-agnostic with softmax predictions, fails to compensate for the knowledge bias arising from heterogeneous models. To tackle this challenge, we propose a stable and efficient Feature Distillation for model-heterogeneous Federated learning, dubbed FedFD, that can incorporate aligned feature information via orthogonal projection to integrate knowledge from heterogeneous models better. Specifically, a new feature-based ensemble federated knowledge distillation paradigm is proposed. The global model on the server needs to maintain a projection layer for each client-side model architecture to align the features separately. Orthogonal techniques are employed to re-parameterize the projection layer to mitigate knowledge bias from heterogeneous models and thus maximize the distilled knowledge. Extensive experiments show that FedFD achieves superior performance compared to state-of-the-art methods.

View full details

Poster

Manipulating Feature Visualizations with Gradient Slingshots

Dilyara Bareeva · Marina Höhne · Alexander Warnecke · Lukas Pirch · Klaus-Robert Müller · Konrad Rieck · Sebastian Lapuschkin · Kirill Bykov

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

Feature Visualization (FV) is a widely used technique for interpreting concepts learned by Deep Neural Networks (DNNs), which synthesizes input patterns that maximally activate a given feature. Despite its popularity, the trustworthiness of FV explanations has received limited attention. We introduce Gradient Slingshots, a novel method that enables FV manipulation without modifying model architecture or significantly degrading performance. By shaping new trajectories in off-distribution regions of a feature's activation landscape, we coerce the optimization process to converge to a predefined visualization. We evaluate our approach on several DNN architectures, demonstrating its ability to replace faithful FVs with arbitrary targets. These results expose a critical vulnerability: auditors relying solely on FV may accept entirely fabricated explanations. To mitigate this risk, we propose a straightforward defense and quantitatively demonstrate its effectiveness.

View full details

Poster

Mixtures of Subspaces for Bandwidth Efficient Context Parallel Training

Sameera Ramasinghe · Thalaiyasingam Ajanthan · Hadi Mohaghegh Dolatabadi · Gil Avraham · Violetta Shevchenko · Yan Zuo · Chamin Hewa Koneputugodage · Alexander Long

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

Pretraining language models with extended context windows enhances their ability to leverage rich information during generation. Existing methods split input sequences into chunks, broadcast them across multiple devices, and compute attention block by block which incurs significant communication overhead. While feasible in high-speed clusters, these methods are impractical for decentralized training over low-bandwidth connections. We propose a compression method for communication-efficient context parallelism in decentralized settings, achieving a remarkable compression rate of over 95% with negligible overhead and no loss in convergence. Our key insight is to exploit the intrinsic low-rank structure of activation outputs by dynamically constraining them to learned mixtures of subspaces via efficient reparameterizations. We demonstrate scaling billion-parameter decentralized models to context lengths exceeding 100K tokens on networks as slow as 300Mbps, matching the wall-clock convergence speed of centralized models on 100Gbps interconnects.

View full details

Poster

Capturing Individual Human Preferences with Reward Features

Andre Barreto · Vincent Dumoulin · Yiran Mao · Mark Rowland · Nicolas Perez-Nieves · Bobak Shahriari · Yann Dauphin · Doina Precup · Hugo Larochelle

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

Reinforcement learning from human feedback usually models preferences using a reward function that does not distinguish between people. We argue that this is unlikely to be a good design choice in contexts with high potential for disagreement, like in the training of large language models. We formalise and analyse the problem of learning a reward model that can be specialised to a user. Using the principle of empirical risk minimisation, we derive a probably approximately correct (PAC) bound showing the dependency of the approximation error on the number of training examples, as usual, and also on the number of human raters who provided feedback on them. Based on our theoretical findings, we discuss how to best collect pairwise preference data and argue that adaptive reward models should be beneficial when there is considerable disagreement among users. We also propose a concrete architecture for an adaptive reward model. Our approach leverages the observation that individual preferences can be captured as a linear combination of a set of general reward features. We show how to learn such features and subsequently use them to quickly adapt the reward model to a specific individual, even if their preferences are not reflected in the training data. We present experiments with large language models illustrating our theoretical results and comparing the proposed architecture with a non-adaptive baseline. Consistent with our analysis, the benefits provided by our model increase with the number of raters and the heterogeneity of their preferences. We also show that our model compares favourably to adaptive counterparts, including those performing in-context personalisation.

View full details

Poster

Attack by Yourself: Effective and Unnoticeable Multi-Category Graph Backdoor Attacks with Subgraph Triggers Pool

Jiangtong Li · Dongyi Liu · Kun Zhu · Dawei Cheng · changjun jiang

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

Graph Neural Networks (GNNs) have achieved significant success in various real-world applications, including social networks, finance systems, and traffic management. Recent researches highlight their vulnerability to backdoor attacks in node classification, where GNNs trained on a poisoned graph misclassify a test node only when specific triggers are attached. These studies typically focus on single attack categories and use adaptive trigger generators to create node-specific triggers. However, adaptive trigger generators typically have a simple structure, limited parameters, and lack category-aware graph knowledge, which makes them struggle to handle backdoor attacks across multiple categories as the number of target categories increases. We address this gap by proposing a novel approach for Effective and Unnoticeable Multi-Category (EUMC) graph backdoor attacks, leveraging subgraph from the attacked graph as category-aware triggers to precisely control the target category. To ensure the effectiveness of our method, we construct a Multi-Category Subgraph Triggers Pool (MC-STP) using the subgraphs of the attacked graph as triggers. We then exploit the attachment probability shifts of each subgraph trigger as category-aware priors for target category determination. Moreover, we develop a ``select then attach'' strategy that connects suitable category-aware trigger to attacked nodes for unnoticeability. Extensive experiments across different real-world datasets confirm the efficacy of our method in conducting multi-category graph backdoor attacks on various GNN models and defense strategies.

View full details

Poster

AugGen: Synthetic Augmentation using Diffusion Models Can Improve Recognition

Parsa Rahimi · Damien Teney · Sébastien Marcel

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

The increasing reliance on large-scale datasets in machine learning poses significant privacy and ethical challenges, particularly in sensitive domains such as face recognition. Synthetic data generation offers a promising alternative; however, most existing methods depend heavily on external datasets or pre-trained models, increasing complexity and resource demands. In this paper, we introduce AugGen, a self-contained synthetic augmentation technique. AugGen strategically samples from a class-conditional generative model trained exclusively on the target FR dataset, eliminating the need for external resources. Evaluated across 8 FR benchmarks, including IJB-C and IJB-B, our method achieves 1–12% performance improvements, outperforming models trained solely on real data and surpassing state-of-the-art synthetic data generation approaches, while using less real data. Notably, these gains often exceed those from architectural modifications, underscoring the value of synthetic augmentation in data-limited scenarios. Our findings demonstrate that carefully integrated synthetic data can both mitigate privacy constraints and substantially enhance discriminative performance in face recognition. Code and datasets will be made publicly available upon publication.

View full details

Poster

Predictable Scale (Part II) --- Farseer: A Refined Scaling Law in LLMs

Houyi Li · Wenzhen Zheng · Qiufeng Wang · Zhenyu Ding · Haoying Wang · Zili Wang · Shijie Xuyang · Ning DING · Shuigeng Zhou · Xiangyu Zhang · Daxin Jiang

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

Training Large Language Models (LLMs) is prohibitively expensive, creating a critical scaling gap where insights from small-scale experiments often fail to transfer to resource-intensive production systems, thereby hindering efficient innovation. To bridge this, we introduce Farseer, a novel and refined scaling law offering enhanced predictive accuracy across scales. By systematically constructing a model loss surface $L(N,D)$, Farseer achieves a significantly better fit to empirical data than prior laws (e.g., \Chinchilla's law). Our methodology yields accurate, robust, and highly generalizable predictions, demonstrating excellent extrapolation capabilities, outperforming Chinchilla's law, whose extrapolation error is 433\% higher. This allows for the reliable evaluation of competing training strategies across all $(N,D)$ settings, enabling conclusions from small-scale ablation studies to be confidently extrapolated to predict large-scale performance. Furthermore, Farseer provides new insights into optimal compute allocation, better reflecting the nuanced demands of modern LLM training. To validate our approach, we trained an extensive suite of approximately 1,000 LLMs across diverse scales and configurations, consuming roughly 3 million NVIDIA H100 GPU hours. To foster further research, we are comprehensively open-sourcing all code, data, results (https://github.com/Farseer-Scaling-Law/Farseer), all training logs (https://wandb.ai/billzid/Farseer?nw=nwuserbillzid), all models used in scaling law fitting (https://huggingface.co/Farseer-Scaling-Law).

View full details

Poster

GRAVER: Generative Graph Vocabularies for Robust Graph Foundation Models Fine-tuning

Haonan Yuan · Qingyun Sun · Junhua Shi · Xingcheng Fu · Bryan Hooi · Jianxin Li · Philip S Yu

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

Inspired by the remarkable success of foundation models in language and vision, Graph Foundation Models (GFMs) hold significant promise for broad applicability across diverse graph tasks and domains. However, existing GFMs struggle with unstable few-shot fine-tuning, where both performance and adaptation efficiency exhibit significant fluctuations caused by the randomness in the support sample selection and structural discrepancies between the pre-trained and target graphs. How to fine-tune GFMs robustly and efficiently to enable trustworthy knowledge transfer across domains and tasks is the major challenge. In this paper, we propose GRAVER, a novel Generative gRAph VocabulariEs for Robust GFM fine-tuning framework that tackles the aforementioned instability via generative augmentations. Specifically, to identify transferable units, we analyze and extract key class-specific subgraph patterns by ego-graph disentanglement and validate their transferability both theoretically and empirically. To enable effective pre-training across diverse domains, we leverage a universal task template based on ego-graph similarity and construct graph vocabularies via graphon-based generative experts. To facilitate robust and efficient prompt fine-tuning, we grave the support samples with in-context vocabularies, where the lightweight MoE-CoE network attentively routes knowledge from source domains. Extensive experiments demonstrate the superiority of GRAVER over effectiveness, robustness, and efficiency on downstream few-shot node and graph classification tasks compared with 15 state-of-the-art baselines.

View full details

Poster

NegoCollab: A Common Representation Negotiation Approach for Heterogeneous Collaborative Perception

CONGZHANG SHAO · Quan Yuan · Guiyang Luo · Yue Hu · Danni Wang · Liu Yilin · Rui Pan · Bo Chen · Jinglin Li

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

Collaborative perception expands the perception range by sharing information among agents, effectively improving task performance. Immutable heterogeneity poses a significant challenge in collaborative perception, as participating agents may employ different and fixed perception models. This leads to domain gaps in the intermediate features shared among agents, consequently degrading collaborative performance. Aligning the features of all agents to a common representation can eliminate domain gaps with low training cost. However, in existing methods, the common representation is designated as the representation of a specific agent, making it difficult for agents with significant domain discrepancies from this specific agent to achieve proper alignment. This paper proposes NegoCollab, a heterogeneous collaboration method based on negotiated common representation. It achieves bidirectional transformation of each modality's features between local representation space and common representation space through paired sender-receiver, thereby eliminating domain gaps. The common representation in NegoCollab is negotiated from local representations of each modality's agent via a negotiator introduced during training, effectively reducing inherent domain discrepancies with each local representation. Furthermore, to better align local representations with the multimodal common representation, we introduce both structural alignment loss and pragmatic alignment loss alongside the conventional distribution alignment loss during supervised training, enabling comprehensive knowledge distillation from the common representation to the senders. The experimental results demonstrate that NegoCollab significantly outperforms existing methods in common representation-based collaboration approaches. The negotiation-based mechanism for acquiring common representations provides more diverse and reliable alternatives for establishing common representations required in heterogeneous collaboration perception.

View full details

Poster

Majority of the Bests: Improving Best-of-N via Bootstrapping

Amin Rakhsha · Kanika Madan · Tianyu Zhang · Amir-massoud Farahmand · Amir Khasahmadi

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

Sampling multiple outputs from a Large Language Model (LLM) and selecting the most frequent (Self-consistency) or highest-scoring (Best-of-N) candidate is a popular approach to achieve higher accuracy in tasks with discrete final answers. Best-of-N (BoN) selects the output with the highest reward, and with perfect rewards, it often achieves near-perfect accuracy. With imperfect rewards from reward models, however, BoN fails to reliably find the correct answer and its performance degrades drastically. We consider the distribution of BoN’s outputs and highlight that, although the correct answer does not usually have a probability close to one under imperfect rewards, it is often the most likely outcome. This suggests that the mode of this distribution can be more reliably correct than a sample from it. Based on this idea, we propose Majority-of-the-Bests (MoB), a novel selection mechanism that estimates the output distribution of BoN via bootstrapping and selects its mode. Experimental results across five benchmarks, three different base LLMs, and two reward models demonstrate consistent improvements over BoN in 25 out of 30 setups. We also provide theoretical results for the consistency of the bootstrapping. MoB serves as a simple, yet strong alternative to BoN and self-consistency, and more broadly, motivates further research in more nuanced selection mechanisms.

View full details

Poster

Tight Lower Bounds and Improved Convergence in Performative Prediction

Pedram Khorsandi · Rushil Gupta · Mehrnaz Mofakhami · Simon Lacoste-Julien · Gauthier Gidel

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

Performative prediction is a framework accounting for the shift in the data distribution induced by the prediction of a model deployed in the real world. Ensuring convergence to a stable solution—one at which the post‑deployment data distribution no longer changes—is crucial in settings where model predictions can influence future data. This paper, for the first time, extends the Repeated Risk Minimization (RRM) algorithm class by utilizing historical datasets from previous retraining snapshots, yielding a class of algorithms that we call Affine Risk Minimizers that converges to a performatively stable point for a broader class of problems. We introduce a new upper bound for methods that use only the final iteration of the dataset and prove for the first time the tightness of both this new bound and the previous existing bounds within the same regime. We also prove that our new algorithm class can surpass the lower bound for standard RRM, thus breaking the prior lower bound, and empirically observe faster convergence to the stable point on various performative prediction benchmarks. We offer at the same time the first lower bound analysis for RRM within the class of Affine Risk Minimizers, quantifying the potential improvements in convergence speed that could be achieved with other variants in our scheme.

View full details

Poster

Efficient Knowledge Transfer in Federated Recommendation for Joint Venture Ecosystem

Yichen Li · Yijing Shan · YI LIU · Haozhao Wang · Cheng Wang · wangshi.ww · Yi Wang · Ruixuan Li

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

The current Federated Recommendation System (FedRS) focuses on personalized recommendation services and assumes clients are personalized IoT devices (e.g., Mobile phones). In this paper, we deeply dive into new but practical FedRS applications within the joint venture ecosystem. Subsidiaries engage as participants with their users and items. However, in such a situation, merely exchanging item embedding is insufficient, as user bases always exhibit both overlaps and exclusive segments, demonstrating the complexity of user information. Meanwhile, directly uploading user information is a violation of privacy and unacceptable. To tackle the above challenges, we propose an efficient and privacy-enhanced federated recommendation for the joint venture ecosystem (FR-JVE) that each client transfers more common knowledge from other clients with a distilled user's \textit{rating preference} from the local dataset. More specifically, we first transform the local data into a new format and apply model inversion techniques to distill the rating preference with frozen user gradients before the federated training. Then, a bridge function is employed on each client side to align the local rating preference and aggregated global preference in a privacy-friendly manner. Finally, each client matches similar users to make a better prediction for overlapped users. From a theoretical perspective, we analyze how effectively FR-JVE can guarantee user privacy. Empirically, we show that FR-JVE achieves superior performance compared to state-of-the-art methods.

View full details

Poster

No Object Is an Island: Enhancing 3D Semantic Segmentation Generalization with Diffusion Models

Fan Li · Xuan Wang · Xuanbin Wang · Zhaoxiang Zhang · Yuelei Xu

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

Enhancing the cross-domain generalization of 3D semantic segmentation is a pivotal task in computer vision that has recently gained increasing attention. Most existing methods, whether using consistency regularization or cross-modal feature fusion, focus solely on individual objects while overlooking implicit semantic dependencies among them, resulting in the loss of useful semantic information. Inspired by the diffusion model's ability to flexibly compose diverse objects into high-quality images across varying domains, we seek to harness its capacity for capturing underlying contextual distributions and spatial arrangements among objects to address the challenging task of cross-domain 3D semantic segmentation. In this paper, we propose a novel cross-modal learning framework based on diffusion models to enhance the generalization of 3D semantic segmentation, named XDiff3D. XDiff3D comprises three key ingredients: (1) constructing object agent queries from diffusion features to aggregate instance semantic information; (2) decoupling fine-grained local details from object agent queries to prevent interference with 3D semantic representation; (3) leveraging object agent queries as an interface to enhance the modeling of object semantic dependencies in 3D representations. Extensive experiments validate the effectiveness of our method, achieving state-of-the-art performance across multiple benchmarks in different task settings. Code is available at \url{https://github.com/FanLiHub/XDiff3D}.

View full details

Poster

VPO: Reasoning Preferences Optimization Based on $\mathcal{V}$-Usable Information

Zecheng Wang · Chunshan Li · Yupeng Zhang · Han Liu · Bingning Wang · Dianhui Chu · Dianbo Sui

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

Direct Preference Optimization (DPO) is a widely used preference optimization algorithm in large language model (LLM) alignment, which reparameterizes the reward function in reinforcement learning with human feedback (RLHF) without requiring a separate reward model. However, during the DPO training process, when a large negative gradient is applied to low-confidence samples, LLMs with a softmax output head tend to squeeze the confidence in the model's output distribution towards the highest-confidence sentence, which may lead to a decrease in the confidence of both preference and non-preference samples, while increasing the confidence of unrelated tokens. This phenomenon becomes more complex in reasoning tasks. In this work, focusing on reasoning tasks, we propose VPO, a negative gradient constraint method for human non-preference samples based on $\mathcal{V}$-usable information. By using $\mathcal{V}$-usable information to measure the similarity between preference pairs and selectively constrain the negative gradient, VPO can alleviate the squeezing effect of DPO, enhance alignment with the generation objective, and maintain the model's ability to distinguish between preference and non-preference samples. We compare VPO with DPO and its latest variants on mathematical reasoning tasks using the LLama 3.1 and Qwen 2.5 series, including both Base and Instruct models. Our results demonstrate that VPO consistently and significantly outperforms existing methods. Specifically, on Qwen2.5-7B-Base, VPO achieves 7.80\% and 13.25\% improvement over DPO on MATH500 and AMC23, respectively. We also conduct ablation experiments and in-depth analysis on VPO to explain its effectiveness and rationale.

View full details

Poster

Tighter CMI-Based Generalization Bounds via Stochastic Projection and Quantization

Milad Sefidgaran · Kimia Nadjahi · Abdellatif Zaidi

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

In this paper, we leverage stochastic projection and lossy compression to establish new conditional mutual information (CMI) bounds on the generalization error of statistical learning algorithms. It is shown that these bounds are generally tighter than the existing ones. In particular, we prove that for certain problem instances for which existing MI and CMI bounds were recently shown in Attias et al. [2024] and Livni [2023] to become vacuous or fail to describe the right generalization behavior, our bounds yield suitable generalization guarantees of the order of $\mathcal{O}(1/\sqrt{n})$, where $n$ is the size of the training dataset. Furthermore, we use our bounds to investigate the problem of data "memorization" raised in those works, and which asserts that there are learning problem instances for which any learning algorithm that has good prediction there exist distributions under which the algorithm must "memorize'' a big fraction of the training dataset. We show that for every learning algorithm, there exists an auxiliary algorithm that does not memorize and which yields comparable generalization error for any data distribution. In part, this shows that memorization is not necessary for good generalization.

View full details

Poster

Diffusion-Guided Graph Data Augmentation

Maria Marrium · Arif Mahmood · Muhammad Haris Khan · M. Shakeel · Wenxiong Kang

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

Graph Neural Networks (GNNs) have achieved remarkable success in a wide range of applications. However, when trained on limited or low-diversity datasets, GNNs are prone to overfitting and memorization, which impacts their generalization. To address this, graph data augmentation (GDA) has become a crucial task to enhance the performance and generalization of GNNs. Traditional GDA methods employ simple transformations that result in limited performance gains. Although recent diffusion-based augmentation methods offer improved results, they are sparse, task-specific, and constrained by class labels. In this work, we propose a more general and effective diffusion-based GDA framework that is task-agnostic and label-free. For better training stability and reduced computational cost, we employ a graph variational auto-encoder (GVAE) to learn a compact latent graph representation. A diffusion model is used in the learned latent space to generate both consistent and diverse augmentations. For a fixed augmentation budget, our algorithm selects a subset of samples that would benefit the most from the augmentation. To further improve performance, we also perform test-time augmentation, leveraged by the label-free nature of our method. Thanks to the efficient utilization of GVAE and latent diffusion, our algorithm significantly enhances machine learning safety measures, including calibration, robustness to corruptions, and prediction consistency. Moreover, our method has shown improved robustness against four types of adversarial attacks and achieves better generalization performance. To demonstrate the effectiveness of the proposed method, we compare it with 30 existing methods on 12 benchmark datasets across node classification, link prediction, and graph classification in various learning settings, including semi-supervised, supervised, and long-tailed data distributions. The code will soon be made publicly available.

View full details

Poster

3DID: Direct 3D Inverse Design for Aerodynamics with Physics-Aware Optimization

Yuze Hao · Linchao Zhu · Yi Yang

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

Inverse design aims to design the input variables of a physical system to optimize a specified objective function, typically formulated as a search or optimization problem. However, in 3D domains, the design space grows exponentially, rendering exhaustive grid-based searches infeasible. Recent advances in deep learning have accelerated inverse design by providing powerful generative priors and differentiable surrogate models. Nevertheless, current methods tend to approximate the 3D design space using 2D projections or fine-tune existing 3D shapes. These approaches sacrifice volumetric detail and constrain design exploration, preventing true 3D design from scratch. In this paper, we propose a 3D Inverse Design (3DID) framework that directly navigates the 3D design space by coupling a continuous latent representation with a physics-aware optimization strategy. We first learn a unified physics–geometry embedding that compactly captures shape and physical field data in a continuous latent space. Then, we introduce a two-stage strategy to perform physics-aware optimization. In the first stage, a gradient-guided diffusion sampler explores the global latent manifold. In the second stage, an objectivedriven, topology-preserving refinement further sculpts each candidate toward the target objective. This enables 3DID to generate high-fidelity 3D geometries, outperforming existing methods in both solution quality and design versatility.

View full details

Poster

An Evidence-Based Post-Hoc Adjustment Framework for Anomaly Detection Under Data Contamination

Sukanya Patra · Souhaib Ben Taieb

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

Unsupervised anomaly detection (AD) methods typically assume clean training data, yet real-world datasets often contain undetected or mislabeled anomalies, leading to significant performance degradation. Existing solutions require access to the training pipelines, data or prior knowledge of the proportions of anomalies in the data, limiting their real-world applicability. To address this challenge, we propose EPHAD, a simple yet effective test-time adaptation framework that updates the outputs of AD models trained on contaminated datasets using evidence gathered at test time. Our approach integrates the prior knowledge captured by the AD model trained on contaminated datasets with evidence derived from multimodal foundation models like Contrastive Language-Image Pre-training (CLIP), classical AD methods like the Latent Outlier Factor or domain-specific knowledge. We illustrate the intuition behind EPHAD using a synthetic toy example and validate its effectiveness through comprehensive experiments across eight visual AD datasets, twenty-six tabular AD datasets, and a real-world industrial AD dataset. Additionally, we conduct an ablation study to analyse hyperparameter influence and robustness to varying contamination levels, demonstrating the versatility and robustness of EPHAD across diverse AD models and evidence pairs. To ensure reproducibility, our code is publicly available at https://github.com/sukanyapatra1997/EPHAD.

View full details

Poster

NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding

Wei Xu · Cheng Wang · Dingkang Liang · Zongchuang Zhao · Xingyu Jiang · Peng Zhang · Xiang Bai

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

Underwater exploration offers critical insights into our planet and attracts increasing attention for its broader applications in resource exploration, national security, etc. We study the underwater scene understanding methods, which aim to achieve automated underwater exploration. The underwater scene understanding task demands multi-task perceptions from multiple granularities. However, the absence of large-scale underwater multi-task instruction-tuning datasets hinders the progress of this research. To bridge this gap, we construct NautData, a dataset containing 1.45 M image-text pairs supporting eight underwater scene understanding tasks. It enables the development and thorough evaluation of the underwater scene understanding models. Underwater image degradation is a widely recognized challenge that interferes with underwater tasks. To improve the robustness of underwater scene understanding, we introduce physical priors derived from underwater imaging models and propose a plug-and-play vision feature enhancement (VFE) module, which explicitly restores clear underwater information. We integrate this module into renowned baselines LLaVA-1.5 and Qwen2.5-VL and build our underwater LMM, NAUTILUS. Experiments conducted on the NautData and public underwater datasets demonstrate the effectiveness of the VFE module, consistently improving the performance of both baselines on the majority of supported tasks, thus ensuring the superiority of NAUTILUS in the underwater scene understanding area. Data and models are available at https://github.com/H-EmbodVis/NAUTILUS.

View full details

Poster

Visual Structures Help Visual Reasoning: Addressing the Binding Problem in LVLMs

Amirmohammad Izadi · Mohammadali Banayeeanzade · Fatemeh Askari · Ali Rahimiakbar · Mohammad Vahedi · Hosein Hasani · Mahdieh Soleymani

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

Despite progress in Large Vision-Language Models (LVLMs), their capacity for visual reasoning is often limited by the binding problem: the failure to reliably associate perceptual features with their correct visual referents. This limitation underlies persistent errors in tasks such as counting, visual search, scene description, and spatial relationship understanding. A key factor is that current LVLMs process visual features largely in parallel, lacking mechanisms for spatially grounded, serial attention. This paper introduces Visual Input Structure for Enhanced Reasoning (VISER), a simple, effective method that augments visual inputs with low-level spatial structures and pairs them with a textual prompt that encourages sequential, spatially-aware parsing. We empirically demonstrate substantial performance improvements across core visual reasoning tasks, using only a single-query inference. Specifically, VISER improves GPT-4o performance on visual search, counting, and spatial relationship tasks by 25.0%, 26.8%, and 9.5%, respectively, and reduces edit distance error in scene description by 0.32 on 2D datasets. Furthermore, we find that the visual modification is essential for these gains; purely textual strategies, including Chain-of-Thought prompting, are insufficient and can even degrade performance. VISER underscores the importance of visual input design over purely linguistically based reasoning strategies and suggests that visual structuring is a powerful and general approach for enhancing compositional and spatial reasoning in LVLMs.

View full details

Poster

Transcending Cost-Quality Tradeoff in Agent Serving via Session-Awareness

Yanyu Ren · Li Chen · Dan Li · Xizheng Wang · Zhiyuan Wu · Yukai Miao · Yu Bai

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

Large Language Model (LLM) agents are capable of task execution across various domains by autonomously interacting with environments and refining LLM responses based on feedback. However, existing model serving systems are not optimized for the unique demands of serving agents. Compared to classic model serving, agent serving has different characteristics: predictable request pattern, increasing quality requirement, and unique prompt formatting. We identify a key problem for agent serving: LLM serving systems lack session-awareness. They neither perform effective KV cache management nor precisely select the cheapest yet competent model in each round. This leads to a cost-quality tradeoff, and we identify an opportunity to surpass it in an agent serving system. To this end, we introduce AgServe for AGile AGent SERVing. AgServe features a session-aware server that boosts KV cache reuse via Estimated-Time-of-Arrival-based eviction and in-place positional embedding calibration, a quality-aware client that performs session-aware model cascading through real-time quality assessment, and a dynamic resource scheduler that maximizes GPU utilization. With AgServe, we allow agents to select and upgrade models during the session lifetime, and to achieve similar quality at much lower costs, effectively transcending the tradeoff. Extensive experiments on real testbeds demonstrate that AgServe (1) achieves comparable response quality to GPT-4o at a 16.5\% cost. (2) delivers 1.8$\times$ improvement in quality relative to the tradeoff curve.

View full details

Poster

Don’t Give Up on Democratizing AI for the Wrong Reasons

Annette Zimmermann · Andrew Zeppa · Srijan Pandey · Kenneth Diao

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

The claim that the AI community, or society at large, should ‘democratize AI’ has attracted considerable critical attention and controversy. Two core problems have arisen and remain unsolved: conceptual disagreement persists about what democratizing AI means; normative disagreement persists over whether democratizing AI is ethically and politically desirable. We identify eight common AI democratization traps: democratization-skeptical arguments that seem plausible at first glance, but turn out to be misconceptions. We develop arguments about how to resist each trap. We conclude that, while AI democratization may well have drawbacks, we should be cautious about dismissing AI democratization prematurely and for the wrong reasons. We offer a constructive roadmap for developing alternative conceptual and normative approaches to democratizing AI that successfully avoid the traps.

View full details

Poster

PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts

Yiming Wang · Pei Zhang · Jialong Tang · Hao-Ran Wei · Baosong Yang · Rui Wang · Chenshu Sun · Feitong Sun · Jiran Zhang · Junxuan Wu · Qiqian Cang · Yichang Zhang · Fei Huang · Junyang Lin · Fei Huang · Jingren Zhou

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

In this paper, we introduce PolyMath, a multilingual mathematical reasoning benchmark covering 18 languages and 4 easy-to-hard difficulty levels. Our benchmark ensures difficulty comprehensiveness, language diversity, and high-quality translation, making it a highly discriminative multilingual mathematical benchmark in the era of reasoning LLMs.We conduct a comprehensive evaluation for advanced LLMs and find that even Qwen-3-235B-A22B-Thinking and Gemini-2.5-pro, achieve only 54.6 and 52.2 benchmark scores, with about 40% accuracy under the highest level.From a language perspective, our benchmark reveals several key challenges of LLMs in multilingual reasoning:(1) Reasoning performance varies widely across languages for current LLMs;(2) Input-output language consistency is low in reasoning LLMs and may be correlated with performance;(3) The thinking length differs significantly by language for current LLMs.Additionally, we demonstrate that controlling the output language in the instructions has the potential to affect reasoning performance, especially for some low-resource languages, suggesting a promising direction for improving multilingual capabilities in LLMs.

View full details

Poster

PolypSense3D: A Multi-Source Benchmark Dataset for Depth-Aware Polyp Size Measurement in Endoscopy

Ruyu Liu · Lin Wang · Zhou Mingming · Jianhua Zhang · ZHANG HAOYU · Xiufeng Liu · Xu Cheng · Sixian Chan · Shen yanbin · Dai Sheng · Yuping Yan · Yaochu Jin · Lingjuan Lyu

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

Accurate polyp sizing during endoscopy is crucial for cancer risk assessment but is hindered by subjective methods and inadequate datasets lacking integrated 2D appearance, 3D structure, and real-world size information. We introduce PolypSense3D, the first multi-source benchmark dataset specifically targeting depth-aware polyp size measurement. It uniquely integrates over 43,000 frames from virtual simulations, physical phantoms, and clinical sequences, providing synchronized RGB, dense/sparse depth, segmentation masks, camera parameters, and millimeter-scale size labels derived via a novel forceps-assisted in-vivo annotation technique. To establish its value, we benchmark state-of-the-art segmentation and depth estimation models. Results quantify significant domain gaps between simulated/phantom and clinical data and reveal substantial error propagation from perception stages to final size estimation, with the best fully automated pipelines achieving an average Mean Absolute Error (MAE) of 0.95 mm on the clinical data subset. Publicly released under CC BY-SA 4.0 with code and evaluation protocols, PolypSense3D offers a standardized platform to accelerate research in robust, clinically relevant quantitative endoscopic vision. The benchmark dataset and code are available at: https://github.com/HNUicda/PolypSense3D and https://doi.org/10.7910/DVN/K13H89.

View full details

Poster

OpenGU: A Comprehensive Benchmark for Graph Unlearning

Bowen Fan · Yuming Ai · Xunkai Li · Zhilin Guo · LEI ZHU · Guang Zeng · Rong-Hua Li · Guoren Wang

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

Graph Machine Learning is essential for understanding and analyzing relational data. However, privacy-sensitive applications demand the ability to efficiently remove sensitive information from trained graph neural networks (GNNs), avoiding the unnecessary time and space overhead caused by retraining models from scratch.To address this issue, Graph Unlearning (GU) has emerged as a critical solution to support dynamic graph updates while ensuring privacy compliance. Unlike machine unlearning in computer vision or other fields, GU faces unique difficulties due to the non-Euclidean nature of graph data and the recursive message-passing mechanism of GNNs. Additionally, the diversity of downstream tasks and the complexity of unlearning requests further amplify these challenges. Despite the proliferation of diverse GU strategies, the absence of a benchmark providing fair comparisons for GU, and the limited flexibility in combining downstream tasks and unlearning requests, have yielded inconsistencies in evaluations, hindering the development of this domain. To fill this gap, we present OpenGU, the first GU benchmark, where 16 SOTA GU algorithms and 37 multi-domain datasets are integrated, enabling various downstream tasks with 13 GNN backbones when responding to flexible unlearning requests. Through extensive experimentation, we have drawn $10$ crucial conclusions about existing GU methods, while also gaining valuable insights into their limitations, shedding light on potential avenues for future research. Our code is available at \href{https://github.com/bwfan-bit/OpenGU}{https://github.com/bwfan-bit/OpenGU}.

View full details

Poster

World Models as Reference Trajectories for Rapid Motor Adaptation

Carlos Stein Brito · Daniel McNamee

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

Learned control policies often fail when deployed in real-world environments with changing dynamics. When system dynamics shift unexpectedly, performance degrades until models are retrained on new data. We introduce Reflexive World Models (RWM), a dual control framework that uses world model predictions as implicit reference trajectories for rapid adaptation. Our method separates the control problem into long-term reward maximization through reinforcement learning and robust motor execution through reward-free rapid control in latent space. This dual architecture achieves significantly faster adaptation with low online computational cost compared to model-based RL baselines, while maintaining near-optimal performance. The approach combines the benefits of flexible policy learning through reinforcement learning with rapid error correction capabilities, providing a theoretically grounded method for maintaining performance in high-dimensional continuous control tasks under varying dynamics.

View full details

Poster

SYMPHONY: Synergistic Multi-agent Planning with Heterogeneous Language Model Assembly

Wei Zhu · Zhiwen Tang · Kun Yue

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

Recent advancements have increasingly focused on leveraging large language models (LLMs) to construct autonomous agents for complex problem-solving tasks. However, existing approaches predominantly employ a single-agent framework to generate search branches and estimate rewards during Monte Carlo Tree Search (MCTS) planning. This single-agent paradigm inherently limits exploration capabilities, often resulting in insufficient diversity among generated branches and suboptimal planning performance. To overcome these limitations, we propose $\textbf{SY}$nergistic $\textbf{M}$ulti-agent $\textbf{P}$lanning with $\textbf{H}$eter$\textbf{O}$geneous la$\textbf{N}$gauge model assembl$\textbf{Y}$ ($\textbf{SYMPHONY}$), a novel multi-agent planning framework that integrates a pool of heterogeneous language model-based agents. By leveraging diverse reasoning patterns across agents, SYMPHONY enhances rollout diversity and facilitates more effective exploration. Empirical results across multiple benchmark tasks show that SYMPHONY achieves strong performance even when instantiated with open-source LLMs deployable on consumer-grade hardware. When enhanced with cloud-based LLMs accessible via API, SYMPHONY demonstrates further improvements, outperforming existing state-of-the-art baselines and underscoring the effectiveness of heterogeneous multi-agent coordination in planning tasks.

View full details

Poster

Learning to Plan Like the Human Brain via Visuospatial Perception and Semantic-Episodic Synergistic Decision-Making

Tianyuan Jia · Ziyu Li · Qing Li · Xiuxing Li · Xiang Li · Chen Wei · Li Yao · Xia Wu

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

Motion planning in high-dimensional continuous spaces remains challenging due to complex environments and computational constraints. Although learning-based planners, especially graph neural network (GNN)-based, have significantly improved planning performance, they still struggle with inaccurate graph construction and limited structural reasoning, constraining search efficiency and path quality. The human brain exhibits efficient planning through a two-stage Perception-Decision model. First, egocentric spatial representations from visual and proprioceptive input are constructed, and then semantic–episodic synergy is leveraged to support decision-making in uncertainty scenarios. Inspired by this process, we propose NeuroMP, a brain-inspired planning framework that learns to plan like the human brain. NeuroMP integrates a Perceptive Segment Selector inspired by visuospatial perception to construct safer graphs, and a Global Alignment Heuristic guide search in weakly connected graphs by modeling semantic-episodic synergistic decision-making. Experimental results demonstrate that NeuroMP significantly outperforms existing planning methods in efficiency and quality while maintaining a high success rate.

View full details

Poster

Dr. RAW: Towards General High-Level Vision from RAW with Efficient Task Conditioning

Wenjun Huang · Ziteng Cui · Yinqiang Zheng · Yirui He · Tatsuya Harada · Mohsen Imani

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

We introduce Dr. RAW, a unified and tuning-efficient framework for high-level computer vision tasks directly operating on camera RAW data. Unlike previous approaches that optimize image signal processing (ISP) pipelines and fully fine-tune networks for each task, Dr. RAW achieves state-of-the-art performance with minimal parameter updates. At the input stage, we apply lightweight pre-processing modules, sensor and illumination mapping, followed by re-mosaicing, to mitigate data inconsistencies stemming from sensor variation and lighting. At the network level, we introduce task-specific adaptation through two modules: Sensor Prior Prompts (SPP) and Low-Rank Adaptation (LoRA). SPP injects sensor-aware conditioning into the network via learnable prompts derived from imaging priors, while LoRA enables efficient task-specific tuning by updating only low-rank matrices in key backbone layers. Despite minimal tuning, our method delivers superior results across four RAW-based tasks (object detection, semantic segmentation, instance segmentation, and pose estimation) on nine datasets encompassing low-light and over-exposed conditions. By harnessing the intrinsic physical cues of RAW data alongside parameter-efficient techniques, our method advances RAW-based vision systems, achieving both high accuracy and computational economy. We will release our source code.

View full details

Poster

PlanarGS: High-Fidelity Indoor 3D Gaussian Splatting Guided by Vision-Language Planar Priors

Xirui Jin · Renbiao Jin · Boying Li · Danping Zou · Wenxian Yu

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

Three-dimensional Gaussian Splatting (3DGS) has recently emerged as an efficient representation for novel-view synthesis, achieving impressive visual quality. However, in scenes dominated by large and low-texture regions, common in indoor environments, the photometric loss used to optimize 3DGS yields ambiguous geometry and fails to recover high-fidelity 3D surfaces. To overcome this limitation, we introduce PlanarGS, a 3DGS-based framework tailored for indoor scene reconstruction. Specifically, we design a pipeline for Language-Prompted Planar Priors (LP3) that employs a pretrained vision-language segmentation model and refines its region proposals via cross-view fusion and inspection with geometric priors. 3D Gaussians in our framework are optimized with two additional terms: a planar prior supervision term that enforces planar consistency, and a geometric prior supervision term that steers the Gaussians toward the depth and normal cues. We have conducted extensive experiments on standard indoor benchmarks. The results show that PlanarGS reconstructs accurate and detailed 3D surfaces, consistently outperforming state-of-the-art methods by a large margin. Project page: https://planargs.github.io

View full details

Poster

Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding

Yiming Wang · Pei Zhang · Siyuan Huang · Baosong Yang · Zhuosheng Zhang · Fei Huang · Rui Wang

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

Test-time scaling enhances large language model performance by allocating additional compute resources during decoding. Best-of-$N$ (BoN) sampling serves as a common sampling-based scaling technique, broadening the search space in parallel to find better solutions from the model distribution. However, its cost–performance trade-off is still underexplored. Two main challenges limit the efficiency of BoN sampling: (1) Generating $N$ full samples consumes substantial GPU memory, reducing inference capacity under limited resources. (2) Reward models add extra memory and latency overhead, and training strong reward models introduces potential training data costs. Although some studies have explored efficiency improvements, none have addressed both challenges at once. To address this gap, we propose **Self-Truncation Best-of-$N$ (ST-BoN)**, a decoding method that avoids fully generating all $N$ samples and eliminates the need for reward models. It leverages early sampling consistency in the model’s internal states to identify the most promising path and truncate suboptimal ones. In terms of cost, ST-BoN reduces dynamic GPU memory usage by over 80% and inference latency by 50%. In terms of cost–performance trade-off, ST-BoN achieves the same performance as Full-BoN while saving computational cost by 70%–80%, and under the same cost, it can improve accuracy by 3–4 points.

View full details

Poster

Empirical Study on Robustness and Resilience in Cooperative Multi-Agent Reinforcement Learning

Simin Li · Zihao Mao · Hanxiao Li · Zonglei Jing · Zhuohang bian · Jun Guo · Li Wang · Zhuoran Han · Ruixiao Xu · Xin Yu · Chengdong Ma · Yuqing Ma · Bo An · Yaodong Yang · Weifeng Lv · Xianglong Liu

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

In cooperative Multi-Agent Reinforcement Learning (MARL), it is a common practice to tune hyperparameters in ideal simulated environments to maximize cooperative performance. However, policies tuned for cooperation often fail to maintain robustness and resilience under real-world uncertainties. Building trustworthy MARL systems requires a deep understanding of \emph{robustness}, which ensures stability under uncertainties, and \emph{resilience}, the ability to recover from disruptions—a concept extensively studied in control systems but largely overlooked in MARL. In this paper, we present a large-scale empirical study comprising over 82,620 experiments to evaluate cooperation, robustness, and resilience in MARL across 4 real-world environments, 13 uncertainty types, and 15 hyperparameters. Our key findings are: (1) Under mild uncertainty, optimizing cooperation improves robustness and resilience, but this link weakens as perturbations intensify. Robustness and resilience also varies by algorithm and uncertainty type. (2) Robustness and resilience do not generalize across uncertainty modalities or agent scopes: policies robust to action noise for all agents may fail under observation noise on a single agent. (3) Hyperparameter tuning is critical for trustworthy MARL: surprisingly, standard practices like parameter sharing, GAE, and PopArt can hurt robustness, while early stopping, high critic learning rates, and Leaky ReLU consistently help. By optimizing hyperparameters only, we observe substantial improvement in cooperation, robustness and resilience across all MARL backbones, with the phenomenon also generalizing to robust MARL methods across these backbones.

View full details

Poster

Statistical Parity with Exponential Weights

Stephen Pasteris · Chris Hicks · Vasilios Mavroudis

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

Statistical parity is one of the most foundational constraints in algorithmic fairness and privacy. In this paper, we show that statistical parity can be enforced efficiently in the contextual bandit setting while retaining strong performance guarantees. Specifically, we present a meta-algorithm that transforms any efficient implementation of Hedge (or, equivalently, any discrete Bayesian inference algorithm) into an efficient contextual bandit algorithm that guarantees exact statistical parity on every trial. Compared to any comparator that satisfies the same statistical parity constraint, the algorithm achieves the same asymptotic regret bound as running the equivalent instance of Exp4 for each group. We also address the scenario where the target parity distribution is unknown and must be estimated online. Finally, using online-to-batch conversion, we extend our approach to the batch classification setting - achieving exact statistical parity there as well, whilst attaining excellent generalisation bounds. We believe these batch bounds to be a significant contribution to the literature in their own right.

View full details

Poster

Flow-Based Policy for Online Reinforcement Learning

Lei Lv · Yunfei Li · Yu Luo · Fuchun Sun · Tao Kong · Jiafeng Xu · Xiao Ma

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

We present $\textbf{FlowRL}$, a novel framework for online reinforcement learning that integrates flow-based policy representation with Wasserstein-2-regularized optimization. We argue that in addition to training signals, enhancing the expressiveness of the policy class is crucial for the performance gains in RL. Flow-based generative models offer such potential, excelling at capturing complex, multimodal action distributions. However, their direct application in online RL is challenging due to a fundamental objective mismatch: standard flow training optimizes for static data imitation, while RL requires value-based policy optimization through a dynamic buffer, leading to difficult optimization landscapes. FlowRL first models policies via a state-dependent velocity field, generating actions through deterministic ODE integration from noise. We derive a constrained policy search objective that jointly maximizes Q through the flow polciy while bounding the Wasserstein-2 distance to a behavior-optimal policy implicitly derived from the replay buffer. This formulation effectively aligns the flow optimization with the RL objective, enabling efficient and value-aware policy learning despite the complexity of the policy class. Empirical evaluations on DMControl and Humanoidbench demonstrate that FlowRL achieves competitive performance in online reinforcement learning benchmarks.

View full details

Poster

SubTrack++ : Gradient Subspace Tracking for Scalable LLM Training

Sahar Rajabi · Nayeema Nonta · Sirisha Rambhatla

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

Training large language models (LLMs) is highly resource-intensive due to their massive number of parameters and the overhead of optimizer states. While recent work has aimed to reduce memory consumption, such efforts often entail trade-offs among memory efficiency, training time, and model performance. Yet, true democratization of LLMs requires simultaneous progress across all three dimensions. To this end, we propose SubTrack++ that leverages Grassmannian gradient subspace tracking combined with projection-aware optimizers, enabling Adam’s internal statistics to adapt to subspace changes. Additionally, employing recovery scaling, a technique that restores information lost through low-rank projections, further enhances model performance. Our method demonstrates SOTA convergence by exploiting Grassmannian geometry, reducing pre-training wall-time by up to 65% and fine-tuning time by 36% compared to existing SOTA methods, while maintaining the same memory footprint. Code is at https://github.com/criticalml-uw/SubTrack.

View full details

Poster

Identifying Macro Causal Effects in C-DMGs over DMGs

Simon Ferreira · Charles Assaad

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

The do-calculus is a sound and complete tool for identifying causal effects in acyclic directed mixed graphs (ADMGs) induced by structural causal models (SCMs). However, in many real-world applications, especially in high-dimensional settings, constructing a fully specified ADMG is often infeasible. This limitation has led to growing interest in partially specified causal representations, particularly through cluster-directed mixed graphs (C-DMGs), which group variables into clusters and offer a more abstract yet practical view of causal dependencies. While these representations can include cycles, recent work has shown that the do-calculus remains sound and complete for identifying macro-level causal effects in C-DMGs over ADMGs under the assumption that all clusters sizes are greater than 1. Nevertheless, real-world systems often exhibit cyclic causal dynamics at the structural level. To account for this, input-output structural causal models (ioSCMs) have been introduced as a generalization of SCMs that allow for cycles. ioSCMs induce another type of graph structure known as a directed mixed graph (DMG). Analogous to the ADMG setting, one can define C-DMGs over DMGs as high-level representations of causal relations among clusters of variables. In this paper, we prove that, unlike in the ADMG setting, the do-calculus is unconditionally sound and complete for identifying macro causal effects in C-DMGs over DMGs. Furthermore, we show that the graphical criteria for non-identifiability of macro causal effects previously established C-DMGs over ADMGs naturally extends to a subset of C-DMGs over DMGs.

View full details

Poster

FrameShield: Adversarially Robust Video Anomaly Detection

Mojtaba Nafez · Mobina Poulaei · Nikan Vasei · Bardia moakhar · Mohammad Sabokrou · Mohammad Hossein Rohban

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

Weakly Supervised Video Anomaly Detection (WSVAD) has achieved notable advancements, yet existing models remain vulnerable to adversarial attacks, limiting their reliability. Due to the inherent constraints of weak supervision—where only video-level labels are provided despite the need for frame-level predictions—traditional adversarial defense mechanisms, such as adversarial training, are not effective since video-level adversarial perturbations are typically weak and inadequate. To address this limitation, pseudo-labels generated directly from the model can enable frame-level adversarial training; however, these pseudo-labels are inherently noisy, significantly degrading performance. We therefore introduce a novel Pseudo-Anomaly Generation method called Spatiotemporal Region Distortion (SRD), which creates synthetic anomalies by applying severe augmentations to localized regions in normal videos while preserving temporal consistency. Integrating these precisely annotated synthetic anomalies with the noisy pseudo-labels substantially reduces label noise, enabling effective adversarial training. Extensive experiments demonstrate that our method significantly enhances the robustness of WSVAD models against adversarial attacks, outperforming state-of-the-art methods by an average of 71.0\% in overall AUROC performance across multiple benchmarks. The implementation and code are publicly available at FrameShield (GitHub).

View full details

Poster

Adaptive Batch-Wise Sample Scheduling for Direct Preference Optimization

Zixuan Huang · Yikun Ban · Lean Fu · Xiaojie Li · Zhongxiang Dai · Jianxin Li · deqing wang

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

Direct Preference Optimization (DPO) has emerged as an effective approach for aligning large language models (LLMs) with human preferences. However, its performance is highly dependent on the quality of the underlying human preference data. To address this bottleneck, prior work has explored various data selection strategies, but these methods often overlook the impact of the evolving states of the language model during the optimization process. In this paper, we introduce a novel problem: Sample Scheduling for DPO, which aims to dynamically and adaptively schedule training samples based on the model's evolving batch-wise states throughout preference optimization. To solve this problem, we propose SamS, an efficient and effective algorithm that adaptively selects samples in each training batch based on the LLM's learning feedback to maximize the potential generalization performance. Notably, without modifying the core DPO algorithm, simply integrating SamS significantly improves performance across tasks, with minimal additional computational overhead. This work points to a promising new direction for improving LLM alignment through batch-wise sample selection, with potential generalization to RLHF and broader supervised learning paradigms.

View full details

Poster

AutoPartGen: Autoregressive 3D Part Generation and Discovery

Minghao Chen · Jianyuan Wang · Roman Shapovalov · Tom Monnier · Hyunyoung Jung · Dilin Wang · Rakesh Ranjan · Iro Laina · Andrea Vedaldi

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

We introduce AutoPartGen, a model that generates objects composed of 3D parts in an autoregressive manner. This model can take as input an image of an object, 2D masks of the object's parts, or an existing 3D object, and generate a corresponding compositional 3D reconstruction. Our approach builds upon 3DShape2VecSet, a recent latent 3D representation with powerful geometric expressiveness. We observe that this latent space exhibits strong compositional properties, making it particularly well-suited for part-based generation tasks. Specifically, AutoPartGen generates object parts autoregressively, predicting one part at a time while conditioning on previously generated parts and additional inputs, such as 2D images, masks, or 3D objects. This process continues until the model decides that all parts have been generated, thus determining automatically the type and number of parts. The resulting parts can be seamlessly assembled into coherent objects or scenes without requiring additional optimization. We evaluate both the overall 3D generation capabilities and the part-level generation quality of AutoPartGen, demonstrating that it achieves state-of-the-art performance in 3D part generation.

View full details

Poster

Addressing Mark Imbalance in Integration-free Marked Temporal Point Processes

Sishun Liu · KE DENG · Yongli Ren · Yan Wang · Xiuzhen Zhang

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

Marked Temporal Point Process (MTPP) has been well studied to model the event distribution in marked event streams, which can be used to predict the mark and arrival time of the next event. However, existing studies overlook that the distribution of event marks is highly imbalanced in many real-world applications, with some marks being frequent but others rare. The imbalance poses a significant challenge to the performance of the next event prediction, especially for events of rare marks. To address this issue, we propose a thresholding method, which learns thresholds to tune the mark probability normalized by the mark's prior probability to optimize mark prediction, rather than predicting the mark directly based on the mark probability as in existing studies. In conjunction with this method, we predict the mark first and then the time. In particular, we develop a novel neural Marked Temporal Point Process (MTPP) model to support effective time sampling and estimation of mark probability without computationally expensive numerical improper integration. Extensive experiments on real-world datasets demonstrate the superior performance of our solution against various baselines for the next event mark and time prediction. The code is available at https://github.com/undes1red/IFNMTPP.

View full details

Poster

Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization

Ming Nie · Chunwei Wang · Jianhua Han · Hang Xu · Li Zhang

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

Unified vision-language models have made significant progress in multimodal understanding and generation, yet they largely fall short in producing multimodal interleaved outputs, which is a crucial capability for tasks like visual storytelling and step-by-step visual reasoning. In this work, we propose a reinforcement learning-based post-training strategy to unlock this capability in existing unified models, without relying on large-scale multimodal interleaved datasets. We begin with a warm-up stage using a hybrid dataset comprising curated interleaved sequences and limited data for multimodal understanding and text-to-image generation, which exposes the model to interleaved generation patterns while preserving its pretrained capabilities. To further refine interleaved generation, we propose a unified policy optimization framework that extends Group Relative Policy Optimization (GRPO) to the multimodal setting. Our approach jointly models text and image generation within a single decoding trajectory and optimizes it with our novel hybrid rewards covering textual relevance, visual-text alignment, and structural fidelity. Additionally, we incorporate process-level rewards to provide step-wise guidance, enhancing training efficiency in complex multimodal tasks. Experiments on MMIE and InterleavedBench demonstrate that our approach significantly enhances the quality and coherence of multimodal interleaved generation.

View full details

Poster

Localized Data Shapley: Accelerating Valuation for Nearest Neighbor Algorithms

Guangyi Zhang · Yanhao Wang · Chengliang Chai · Qiyu Liu · Wei Wang

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

Data Shapley values provide a principled approach for quantifying the contribution of individual training examples to machine learning models. However, computing these values often requires computational complexity that is exponential in the data size, and this has led researchers to pursue efficient algorithms tailored to specific machine learning models. Building on the prior success of the Shapley valuation for $K$-nearest neighbor (KNN) models, in this paper, we introduce a localized data Shapley framework that significantly accelerates the valuation of data points. Our approach leverages the distance-based local structure in the data space to decompose the global valuation problem into smaller, localized computations. Our primary contribution is an efficient valuation algorithm for a threshold-based KNN variant and shows that it provides provable speedups over the baseline under mild assumptions. Extensive experiments on real-life datasets demonstrate that our methods achieve a substantial speedup compared to previous approaches.

View full details

Poster

Pessimistic Data Integration for Policy Evaluation

Xiangkun Wu · Ting Li · Gholamali Aminian · Armin Behnamnia · Hamid Rabiee · Chengchun Shi

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

This paper studies how to integrate historical control data with experimental data to enhance A/B testing, while addressing the distributional shift between historical and experimental datasets. We propose a pessimistic data integration method that combines two causal effect estimators constructed based on experimental and historical datasets. Our main idea is to conceptualize the weight function for this combination as a policy so that existing pessimistic policy learning algorithms are applicable to learn the optimal weight that minimizes the resulting weighted estimator's mean squared error. Additionally, we conduct comprehensive theoretical and empirical analyses to compare our method against various baseline estimators across five scenarios. Both our theoretical and numerical findings demonstrate that the proposed estimator achieves near-optimal performance across all scenarios.

View full details

Poster

Detecting Data Deviations in Electronic Health Records

Kaiping Zheng · Horng-Ruey Chua · Beng Chin Ooi

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

Data deviations in electronic health records (EHR) refer to discrepancies between recorded entries and a patient’s actual physiological state, indicating a decline in EHR data fidelity. Such deviations can result from pre-analytical variability, documentation errors, or unvalidated data sources. Effectively detecting data deviations is clinically valuable for identifying erroneous records, excluding them from downstream clinical workflows, and informing corrective actions. Despite its importance and practical relevance, this problem remains largely underexplored in existing research. To bridge this gap, we propose a bi-level knowledge distillation approach centered on a task-agnostic formulation of EHR data fidelity as an intrinsic measure of data reliability. Our approach performs layered knowledge distillation in two levels: from a computation-intensive, task-specific data Shapley oracle to a neural oracle for individual tasks, and then to a unified EHR data fidelity predictor. This design enables the integration of task-specific insights into a holistic assessment of a patient’s EHR data fidelity from a multi-task perspective. By tracking the outputs of this learned predictor, we detect potential data deviations in EHR data. Experiments on both real-world EHR data from National University Hospital in Singapore and the public MIMIC-III dataset consistently validate the effectiveness of our approach in detecting data deviations in EHR data. Case studies further demonstrate its practical value in identifying clinically meaningful data deviations.

View full details

Poster

Hierarchical Self-Attention: Generalizing Neural Attention Mechanics to Multi-Scale Problems

Saeed Amizadeh · Sara Abdali · Yinheng Li · Kazuhito Koishida

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

Transformers and their attention mechanism have been revolutionary in the field of Machine Learning. While originally proposed for the language data, they quickly found their way to the image, video, graph, etc. data modalities with various signal geometries. Despite this versatility, generalizing the attention mechanism to scenarios where data is presented at different scales from potentially different modalities is not straightforward. The attempts to incorporate hierarchy and multi-modality within transformers are largely based on ad hoc heuristics, which are not seamlessly generalizable to similar problems with potentially different structures. To address this problem, in this paper, we take a fundamentally different approach: we first propose a mathematical construct to represent multi-modal, multi-scale data. We then mathematically derive the neural attention mechanics for the proposed construct from the first principle of entropy minimization. We show that the derived formulation is optimal in the sense of being the closest to the standard Softmax attention while incorporating the inductive biases originating from the hierarchical/geometric information of the problem. We further propose an efficient algorithm based on dynamic programming to compute our derived attention mechanism. By incorporating it within transformers, we show that the proposed hierarchical attention mechanism not only can be employed to train transformer models in hierarchical/multi-modal settings from scratch, but it can also be used to inject hierarchical information into classical, pre-trained transformer models post training, resulting in more efficient models in zero-shot manner.

View full details

Poster

Implicit Modeling for Transferability Estimation of Vision Foundation Models

Yaoyan Zheng · Huiqun Wang · Nan Zhou · Di Huang

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

Transferability estimation identifies the best pre-trained models for downstream tasks without incurring the high computational cost of full fine-tuning. This capability facilitates deployment and advances the pre-training and fine-tuning paradigm. However, existing methods often struggle to accurately assess transferability for emerging pre-trained models with diverse architectures, training strategies, and task alignments. In this work, we propose Implicit Transferability Modeling (ITM), a novel framework that implicitly models each model’s intrinsic transferability, coupled with a Divide-and-Conquer Variational Approximation (DVA) strategy to efficiently approximate embedding space evolution. This design enables generalization across a broader range of models and downstream tasks. Extensive experiments on a comprehensive benchmark—spanning extensive training regimes and a wider variety of model types—demonstrate that ITM consistently outperforms existing methods in terms of stability, effectiveness, and efficiency.

View full details

Poster

FedIGL: Federated Invariant Graph Learning for Non-IID Graphs

Lingren Wang · Wenxuan Tu · Jiaxin Wang · Xiong Wang · Jieren Cheng · Jingxin Liu

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

Federated Graph Learning (FGL) shows superiority in cross-domain graph training while preserving data privacy. Existing approaches usually assume shared generic knowledge (e.g., prototypes, spectral features) via aggregating local structures statistically to alleviate structural heterogeneity. However, imposing overly strict assumptions about the presumed correlation between structural features and the global objective often fails in generalizing to local tasks, leading to suboptimal performance. To tackle this issue, we propose a Federated Invariant Graph Learning (FedIGL) framework based on invariant learning, which effectively disrupts spurious correlations and further mines the invariant factors across different distributions. Specifically, a server-side global model is trained to capture client-agnostic subgraph patterns shared across clients, whereas client-side models specialize in client-specific subgraph patterns. Subsequently, without compromising privacy, we propose a novel Bi-Gradient Regularization strategy that introduces gradient constraints to guide the model in identifying client-agnostic and client-specific subgraph patterns for better graph representations. Extensive experiments on graph-level clustering and classification tasks demonstrate the superiority of FedIGL against its competitors.

View full details

Poster

Enhancing GUI Agent with Uncertainty-Aware Self-Trained Evaluator

Gongwei Chen · Lirong Jie · Lexiao Zou · Weili Guan · Miao Zhang · Liqiang Nie

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

Benefiting from the availability of extensive navigation trajectories, both manually and automatically annotated, current graphical user interface (GUI) agents have achieved remarkable advancements in performance. However, these annotated datasets often contain substantial noise, which impedes effective agent training and underscores the necessity for rigorous trajectory quality assessment. In contrast to existing prompting-based evaluators that rely on proprietary multimodal large language models (MLLMs), we propose an Uncertainty-aware Reinforced Self-Training (URST) framework to train lightweight MLLMs for efficient and reliable trajectory evaluation. URST iteratively fine-tunes MLLMs using their own generated thoughts and judgments to enable self-improvement, while its uncertainty-aware sampling strategy ensures the selection of the most informative training examples. To further enhance reasoning and judgment capabilities, we propose a simplified group policy optimization approach that effectively leverages diverse positive and negative samples for evaluator learning. Our evaluator demonstrates superior judgment performance across both in-domain and out-of-domain datasets. When used to filter navigation datasets, it consistently leads to performance improvements in training GUI agents.

View full details

Poster

PCA++: How Uniformity Induces Robustness to Background Noise in Contrastive Learning

Mingqi Wu · Qiang Sun · Archer Yang

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

High-dimensional data often conceal low-dimensional signals beneath structured background noise, limiting standard PCA. Motivated by contrastive learning, we address the problem of recovering shared signal subspaces from positive pairs--paired observations sharing the same signal but differing in background. Our baseline, PCA+, uses alignment-only contrastive learning and succeeds when background variation is mild, but fails under strong noise or high-dimensional regimes. To address this, we introduce PCA++, a hard uniformity-constrained contrastive PCA that enforces identity covariance on projected features. PCA++ has a closed-form solution via a generalized eigenproblem, remains stable in high dimensions, and provably regularizes against background interference. We provide exact high-dimensional asymptotics in both fixed-aspect-ratio and growing-spike regimes, showing uniformity’s role in robust signal recovery. Empirically, PCA++ outperforms standard PCA and alignment-only PCA+ on simulations, corrupted-MNIST, and single-cell transcriptomics, reliably recovering condition-invariant structure. More broadly, we clarify uniformity’s role in contrastive learning—showing that explicit feature dispersion defends against structured noise and enhances robustness.

View full details

Poster

Test-Time Adaptive Object Detection with Foundation Model

Yingjie Gao · Yanan Zhang · Zhi Cai · Di Huang

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

In recent years, test-time adaptive object detection has attracted increasing attention due to its unique advantages in online domain adaptation, which aligns more closely with real-world application scenarios. However, existing approaches heavily rely on source-derived statistical characteristics while making the strong assumption that the source and target domains share an identical category space. In this paper, we propose the first foundation model-powered test-time adaptive object detection method that eliminates the need for source data entirely and overcomes traditional closed-set limitations. Specifically, we design a Multi-modal Prompt-based Mean-Teacher framework for vision-language detector-driven test-time adaptation, which incorporates text and visual prompt tuning to adapt both language and vision representation spaces on the test data in a parameter-efficient manner. Correspondingly, we propose a Test-time Warm-start strategy tailored for the visual prompts to effectively preserve the representation capability of the vision branch. Furthermore, to guarantee high-quality pseudo-labels in every test batch, we maintain an Instance Dynamic Memory (IDM) module that stores high-quality pseudo-labels from previous test samples, and propose two novel strategies-Memory Enhancement and Memory Hallucination-to leverage IDM's high-quality instances for enhancing original predictions and hallucinating images without available pseudo-labels, respectively. Extensive experiments on cross-corruption and cross-dataset benchmarks demonstrate that our method consistently outperforms previous state-of-the-art methods, and can adapt to arbitrary cross-domain and cross-category target data. Code is available at https://github.com/gaoyingjay/ttaod_foundation.

View full details

Poster

Proximalized Preference Optimization for Diverse Feedback Types: A Decomposed Perspective on DPO

Kaiyang Guo · Yinchuan Li · Zhitang Chen

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

Direct alignment methods typically train large language models (LLMs) by contrasting the likelihoods of preferred and dispreferred responses. While effective for matching relative preferences, these methods have been widely observed to depress the absolute likelihoods of example responses. Consequently, aligned models often exhibit behaviors that deviate from expected patterns, resembling the well‑known reward‑hacking effect even in the absence of an explicit reward model. This phenomenon exposes a fundamental limitation of contrastive alignment, which we characterize as likelihood underdetermination. In this work, we revisit direct preference optimization (DPO)—the seminal direct alignment method—and show that its loss admits a principled decomposition. The resulting reformulation not only extends naturally to a broader range of feedback types, but also sheds light on the origin of likelihood underdetermination. In particular, we identify that the standard DPO implementation implicitly oversimplifies a regularizer in the reformulated loss, and restoring its full version effectively resolves the underdetermination. Building on these insights, we introduce PRoximalized PReference Optimization (PRO), a unified alignment method that handles diverse feedback types while eliminating likelihood underdetermination through an efficient approximation of the full regularizer. Empirical evaluations demonstrate the consistent superiority of PRO over existing methods across pairwise, binary and scalar feedback.

View full details

Poster

StelLA: Subspace Learning in Low-rank Adaptation using Stiefel Manifold

Zhizhong Li · Sina Sajadmanesh · Jingtao Li · Lingjuan Lyu

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

Low-rank adaptation (LoRA) has been widely adopted as a parameter-efficient technique for fine-tuning large-scale pre-trained models. However, it still lags behind full fine-tuning in performance, partly due to its insufficient exploitation of the geometric structure underlying low-rank manifolds. In this paper, we propose a geometry-aware extension of LoRA that uses a three-factor decomposition $USV^\top$. Analogous to the structure of singular value decomposition (SVD), it separates the adapter's input and output subspaces, $V$ and $U$, from the scaling factor $S$. Our method constrains $U$ and $V$ to lie on the Stiefel manifold, ensuring their orthonormality throughout the training. To optimize on the Stiefel manifold, we employ a flexible and modular geometric optimization design that converts any Euclidean optimizer to a Riemannian one. It enables efficient subspace learning while remaining compatible with existing fine-tuning pipelines. Empirical results across a wide range of downstream tasks, including commonsense reasoning, math and code generation, image classification, and image generation, demonstrate the superior performance of our approach against the recent state-of-the-art variants of LoRA. Code is available at https://github.com/SonyResearch/stella.

View full details

Poster

Test-Time Adaptation of Vision-Language Models for Open-Vocabulary Semantic Segmentation

Mehrdad Noori · David OSOWIECHI · Gustavo Vargas Hakim · Ali Bahri · Moslem Yazdanpanah · Sahar Dastani · Farzad Beizaee · Ismail Ayed · Christian Desrosiers

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

Recently, test-time adaptation has attracted wide interest in the context of vision-language models for image classification. However, to the best of our knowledge, the problem is completely overlooked in dense prediction tasks such as Open-Vocabulary Semantic Segmentation (OVSS). In response, we propose a novel TTA method tailored to adapting VLMs for segmentation during test time. Unlike TTA methods for image classification, our Multi-Level and Multi-Prompt (MLMP) entropy minimization integrates features from intermediate vision-encoder layers and is performed with different text-prompt templates at both the global CLS token and local pixel-wise levels. Our approach could be used as plug-and-play for any segmentation network, does not require additional training data or labels, and remains effective even with a single test sample. Furthermore, we introduce a comprehensive OVSS TTA benchmark suite, which integrates a rigorous evaluation protocol, nine segmentation datasets, 15 common synthetic corruptions, and additional real and rendered domain shifts, with a total of 87 distinct test scenarios, establishing a standardized and comprehensive testbed for future TTA research in open-vocabulary segmentation. Our experiments on this suite demonstrate that our segmentation-tailored method consistently delivers significant gains over direct adoption of TTA classification baselines. Code and data are available at https://github.com/dosowiechi/MLMP.

View full details

Poster

Hawaii: Hierarchical Visual Knowledge Transfer for Efficient Vision-Language Models

Yimu Wang · Mozhgan Nasr Azadani · Sean Sedwards · Krzysztof Czarnecki

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

Improving the visual understanding ability of vision-language models (VLMs) is crucial for enhancing their performance across various tasks. While using multiple pretrained visual experts has shown great promise, it often incurs significant computational costs during training and inference. To address this challenge, we propose HAWAII, a novel framework that distills knowledge from multiple visual experts into a single vision encoder, enabling it to inherit the complementary strengths of several experts with minimal computational overhead. To mitigate conflicts among different teachers and switch between different teacher-specific knowledge, instead of using a fixed set of adapters for multiple teachers, we propose to use teacher-specific Low-Rank Adaptation (LoRA) adapters with a corresponding router. Each adapter is aligned with a specific teacher, avoiding noisy guidance during distillation. To enable efficient knowledge distillation, we propose fine-grained and coarse-grained distillation. At the fine-grained level, token importance scores are employed to emphasize the most informative tokens from each teacher adaptively. At the coarse-grained level, we summarize the knowledge from multiple teachers and transfer it to the student using a set of general-knowledge LoRA adapters with a router. Extensive experiments on various vision-language tasks demonstrate the superiority of HAWAII, compared to the popular open-source VLMs.

View full details

Poster

Online Multi-Class Selection with Group Fairness Guarantee

Faraz Zargari · Hossein Jazi · Lyndon Hallett · Bo Sun · Xiaoqi Tan

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

We study the online multi-class selection problem with group fairness guarantees, where limited resources must be allocated to sequentially arriving agents. Our work addresses two key limitations in the existing literature. First, we introduce a novel lossless rounding scheme that ensures the integral algorithm achieves the same expected performance as any fractional solution. Second, we explicitly address the challenges introduced by agents who belong to multiple classes. To this end, we develop a randomized algorithm based on a relax-and-round framework. The algorithm first computes a fractional solution using a resource reservation approach---referred to as the set-aside mechanism---to enforce fairness across classes. The subsequent rounding step preserves these fairness guarantees without degrading performance. Additionally, we propose a learning-augmented variant that incorporates untrusted machine-learned predictions to better balance fairness and efficiency in practical settings.

View full details

Poster

Orthogonal Contrastive Learning for Multi-Representation fMRI Analysis

Tony Yousefnezhad

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

Task-based functional magnetic resonance imaging (fMRI) provides invaluable insights into human cognition but faces critical hurdles—low signal-to-noise ratio, high dimensionality, limited sample sizes, and costly data acquisition—that are amplified when integrating datasets across subjects or sites. This paper introduces orthogonal contrastive learning (OCL), a unified multi-representation framework for multi-subject fMRI analysis that aligns neural responses without requiring temporal preprocessing or uniform time-series lengths across subjects or sites. OCL employs two identical encoders: an online network trained with a contrastive loss that pulls together same-stimulus responses and pushes apart different-stimulus responses, and a target network whose weights track the online network via exponential moving average to stabilize learning. Each OCL network layer combines QR decomposition for orthogonal feature extraction, locality-sensitive hashing (LSH) to produce compact subject-specific signatures, positional encoding to embed temporal structure alongside spatial features, and a transformer encoder to generate discriminative, stimulus-aligned embeddings. We further enhance OCL with an unsupervised pretraining stage on fMRI-like synthetic data and demonstrate a transfer-learning workflow for multi-site studies. Across extensive experiments on multi-subject and multi-site fMRI benchmarks, OCL consistently outperforms state-of-the-art alignment and analysis methods in both representation quality and downstream classification accuracy.

View full details

Poster

Near-Optimal Experiment Design in Linear non-Gaussian Cyclic Models

Ehsan Sharifian · Saber Salehkaleybar · Negar Kiyavash

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

We study the problem of causal structure learning from a combination of observational and interventional data generated by a linear non-Gaussian structural equation model that might contain cycles. Recent results show that using mere observational data identifies the causal graph only up to a permutation-equivalence class. We obtain a combinatorial characterization of this class by showing that each equivalence class corresponds to a perfect matching in a bipartite graph. This bipartite representation allows us to analyze how interventions modify or constrain the matchings. Specifically, we show that each atomic intervention reveals one edge of the true matching and eliminates all incompatible causal graphs. Consequently, we formalize the optimal experiment design task as an adaptive stochastic optimization problem over the set of equivalence classes with a natural reward function that quantifies how many graphs are eliminated from the equivalence class by an intervention. We show that this reward function is adaptive submodular and provide a greedy policy with a provable near-optimal performance guarantee. A key technical challenge is to efficiently estimate the reward function without having to explicitly enumerate all the graphs in the equivalence class. We propose a sampling-based estimator using random matchings and analyze its bias and concentration behavior. Our simulation results show that performing a small number of interventions guided by our stochastic optimization framework recovers the true underlying causal structure.

View full details

Poster

When Kernels Multiply, Clusters Unify: Fusing Embeddings with the Kronecker Product

Youqi WU · Jingwei Zhang · Farzan Farnia

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

State-of-the-art embeddings often capture distinct yet complementary discriminative features: For instance, one image embedding model may excel at distinguishing fine-grained textures, while another focuses on object-level structure. Motivated by this observation, we propose a principled approach to fuse such complementary representations through kernel multiplication. Multiplying the kernel similarity functions of two embeddings allows their discriminative structures to interact, producing a fused representation whose kernel encodes the union of the clusters identified by each parent embedding. This formulation also provides a natural way to construct joint kernels for paired multi-modal data (e.g., image–text tuples), where the product of modality-specific kernels inherits structure from both domains. We highlight that this kernel product is mathematically realized via the Kronecker product of the embedding feature maps, yielding our proposed KrossFuse framework for embedding fusion. To address the computational cost of the resulting high-dimensional Kronecker space, we further develop RP-KrossFuse, a scalable variant that leverages random projections for efficient approximation. As a key application, we use this framework to bridge the performance gap between cross-modal embeddings (e.g., CLIP, BLIP) and unimodal experts (e.g., DINOv2, E5). Experiments show that RP-KrossFuse effectively integrates these models, enhancing modality-specific performance while preserving cross-modal alignment. The project code is available at https://github.com/yokiwuuu/KrossFuse.

View full details

Poster

SSRB: Direct Natural Language Querying to Massive Heterogeneous Semi-Structured Data

Xin Zhang · Mingxin Li · Yanzhao Zhang · Dingkun Long · Yongqi Li · Yinghui Li · Pengjun Xie · Meishan Zhang · Wenjie Li · Min Zhang · Philip S Yu

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

Searching over semi-structured data with natural language (NL) queries has attracted sustained attention, enabling broader audiences to access information easily. As more applications, such as LLM agents and RAG systems, emerge to search and interact with semi-structured data, two major challenges have become evident: (1) the increasing diversity of domains and schema variations, making domain-customized solutions prohibitively costly; (2) the growing complexity of NL queries, which combine both exact field matching conditions and fuzzy semantic requirements, often involving multiple fields and implicit reasoning. These challenges make formal language querying or keyword-based search insufficient. In this work, we explore neural retrievers as a unified non-formal querying solution by directly index semi-structured collections and understand NL queries. We employ LLM-based automatic evaluation and build a large-scale semi-structured retrieval benchmark (SSRB) using LLM generation and filtering, containing 14M semi-structured objects from 99 different schemas across 6 domains, along with 8,485 test queries that combine both exact and fuzzy matching conditions. Our systematic evaluation of popular retrievers shows that current state-of-the-art models could achieve acceptable performance, yet they still lack precise understanding of matching constraints. While by in-domain training of dense retrievers, the performance can be significantly improved. We believe that our SSRB could serve as a valuable resource for future research in this area, and we hope to inspire further exploration of semi-structured retrieval with complex queries.

View full details

Poster

MUVR: A Multi-Modal Untrimmed Video Retrieval Benchmark with Multi-Level Visual Correspondence

Yue Feng · Jinwei Hu · Qijia Lu · Jiawei Niu · Li Tan · Shuo Yuan · Ziyi Yan · Yizhen Jia · Qingzhi He · Shiping Ge · Ethan Chen · Wentong Li · Limin Wang · Jie Qin

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

We propose the Multi-modal Untrimmed Video Retrieval task, along with a new benchmark (MUVR) to advance video retrieval for long-video platforms. MUVR aims to retrieve untrimmed videos containing relevant segments using multi-modal queries. It has the following features: 1) Practical retrieval paradigm: MUVR supports video-centric multi-modal queries, expressing fine-grained retrieval needs through long text descriptions, video tag prompts, and mask prompts. It adopts a one-to-many retrieval paradigm and focuses on untrimmed videos, tailored for long-video platform applications. 2) Multi-level visual correspondence: To cover common video categories (e.g., news, travel, dance) and precisely define retrieval matching criteria, we construct multi-level visual correspondence based on core video content (e.g., news events, travel locations, dance moves) which users are interested in and want to retrieve. It covers six levels: copy, event, scene, instance, action, and others. 3) Comprehensive evaluation criteria: We develop 3 versions of MUVR (i.e., Base, Filter, QA). MUVR-Base/Filter evaluates retrieval models, while MUVR-QA assesses MLLMs in a question-answering format. We also propose a Reranking Score to evaluate the reranking ability of MLLMs. MUVR consists of 53K untrimmed videos from the video platform Bilibili, with 1,050 multi-modal queries and 84K matches. Extensive evaluations of 3 state-of-the-art video retrieval models, 6 image-based VLMs, and 10 MLLMs are conducted. MUVR reveals the limitations of retrieval methods in processing untrimmed videos and multi-modal queries, as well as MLLMs in multi-video understanding and reranking. Our code and benchmark is available at https://github.com/debby-0527/MUVR.

View full details

Poster

Limitations of Normalization in Attention

Timur Mudarisov · Mikhail Burtsev · Tatiana Petrova · Radu State

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

This paper investigates the limitations of the normalization in attention mechanisms. We begin with a theoretical framework that enables the identification of the model's selective ability and the geometric separation involved in token selection. Our analysis includes explicit bounds on distances and separation criteria for token vectors under softmax scaling. Through experiments with pre-trained GPT-2 model, we empirically validate our theoretical results and analyze key behaviors of the attention mechanism. Notably, we demonstrate that as the number of selected tokens increases, the model's ability to distinguish informative tokens declines, often converging toward a uniform selection pattern. We also show that gradient sensitivity under softmax normalization presents challenges during training, especially at low temperature settings. These findings advance current understanding of softmax-based attention mechanism and motivate the need for more robust normalization and selection strategies in future attention architectures.

View full details

Poster

A unified framework for establishing the universal approximation of transformer-type architectures

Jingpu Cheng · Ting Lin · Zuowei Shen · Qianxiao Li

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

We investigate the universal approximation property (UAP) of transformer-type architectures, providing a unified theoretical framework that extends prior results on residual networks to models incorporating attention mechanisms. Our work identifies token distinguishability as a fundamental requirement for UAP and introduces a general sufficient condition that applies to a broad class of architectures. Leveraging an analyticity assumption on the attention layer, we can significantly simplify the verification of this condition, providing a non-constructive approach in establishing UAP for such architectures. We demonstrate the applicability of our framework by proving UAP for transformers with various attention mechanisms, including kernel-based and sparse ones. The corollaries of our results either generalize prior works or establish UAP for architectures not previously covered. Furthermore, our framework offers a principled foundation for designing novel transformer architectures with inherent UAP guarantees, including those with specific functional symmetries. We propose examples to illustrate these insights.

View full details

Poster

A machine learning approach that beats Rubik's cubes

Alexander Chervov · Kirill Khoruzhii · Nikita Bukhal · Jalal Naghiyev · Vladislav Zamkovoy · Ivan Koltsov · Lyudmila Cheldieva · Arsenii Sychev · Arsenii Lenin · Mark Obozov · Egor Urvanov · Alexey Romanov

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

The paper proposes a novel machine learning-based approach to the pathfinding problem on extremely large graphs. This method leverages diffusion distance estimation via a neural network and uses beam search for pathfinding. We demonstrate its efficiency by finding solutions for 4x4x4 and 5x5x5 Rubik's cubes with unprecedentedly short solution lengths, outperforming all available solvers and introducing the first machine learning solver beyond the 3x3x3 case. In particular, it surpasses every single case of the combined best results in the Kaggle Santa 2023 challenge, which involved over 1,000 teams. For the 3x3x3 Rubik's cube, our approach achieves an optimality rate exceeding 98%, matching the performance of task-specific solvers and significantly outperforming prior solutions such as DeepCubeA (60.3%) and EfficientCube (69.6%). Our solution in its current implementation is approximately 25.6 times faster in solving 3x3x3 Rubik's cubes while requiring up to 8.5 times less model training time than the most efficient state-of-the-art competitor. Finally, it is demonstrated that even a single agent trained using a relatively small number of examples can robustly solve a broad range of puzzles represented by Cayley graphs of size up to $10^{145}$, confirming the generality of the proposed method.

View full details

Poster

Information Theoretic Learning for Diffusion Models with Warm Start

Yirong Shen · Lu GAN · Cong Ling

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

Generative models that maximize model likelihood have gained traction in many practical settings. Among them, perturbation-based approaches underpin many state-of-the-art likelihood estimation models, yet they often face slow convergence and limited theoretical understanding. In this paper, we derive a tighter likelihood bound for noise-driven models to improve both the accuracy and efficiency of maximum likelihood learning. Our key insight extends the classical Kullback–Leibler (KL) divergence–Fisher information relationship to arbitrary noise perturbations, going beyond the Gaussian assumption and enabling structured noise distributions. This formulation allows flexible use of randomized noise distributions that naturally account for sensor artifacts, quantization effects, and data distribution smoothing, while remaining compatible with standard diffusion training. Treating the diffusion process as a Gaussian channel, we further express the mismatched entropy between data and model, showing that the proposed objective upper-bounds the negative log-likelihood (NLL). In experiments, our models achieve competitive NLL on CIFAR-10 and state-of-the-art results on ImageNet across multiple resolutions, all without data augmentation, and the framework extends naturally to discrete data.

View full details

Poster

Trust Region Reward Optimization and Proximal Inverse Reward Optimization Algorithm

Yang Chen · Menglin Zou · Jiaqi Zhang · Yitan Zhang · Junyi Yang · Gaël Gendron · Libo Zhang · Jiamou Liu · Michael Witbrock

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

Inverse Reinforcement Learning (IRL) learns a reward function to explain expert demonstrations. Modern IRL methods often use the adversarial (minimax) formulation that alternates between reward and policy optimization, which often lead to {\em unstable} training. Recent non-adversarial IRL approaches improve stability by jointly learning reward and policy via energy-based formulations but lack formal guarantees. This work bridges this gap. We first present a unified view showing canonical non-adversarial methods explicitly or implicitly maximize the likelihood of expert behavior, which is equivalent to minimizing the expected return gap. This insight leads to our main contribution: Trust Region Reward Optimization (TRRO), a framework that guarantees monotonic improvement in this likelihood via a Minorization-Maximization process. We instantiate TRRO into Proximal Inverse Reward Optimization (PIRO), a practical and stable IRL algorithm. Theoretically, TRRO provides the IRL counterpart to the stability guarantees of Trust Region Policy Optimization (TRPO) in forward RL. Empirically, PIRO matches or surpasses state-of-the-art baselines in reward recovery, policy imitation with high sample efficiency on MuJoCo and Gym-Robotics benchmarks and a real-world animal behavior modeling task.

View full details

Poster

The Atlas of In-Context Learning: How Attention Heads Shape In-Context Retrieval Augmentation

Patrick Kahardipraja · Reduan Achtibat · Thomas Wiegand · Wojciech Samek · Sebastian Lapuschkin

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

Large language models are able to exploit in-context learning to access external knowledge beyond their training data through retrieval-augmentation. While promising, its inner workings remain unclear. In this work, we shed light on the mechanism of in-context retrieval augmentation for question answering by viewing a prompt as a composition of informational components. We propose an attribution-based method to identify specialized attention heads, revealing in-context heads that comprehend instructions and retrieve relevant contextual information, and parametric heads that store entities' relational knowledge. To better understand their roles, we extract function vectors and modify their attention weights to show how they can influence the answer generation process. Finally, we leverage the gained insights to trace the sources of knowledge used during inference, paving the way towards more safe and transparent language models.

View full details

Poster

Multi-Objective Hyperparameter Selection via Hypothesis Testing on Reliability Graphs

Amirmohammad Farzaneh · Osvaldo Simeone

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

The selection of hyperparameters, such as prompt templates in large language models (LLMs), must often strike a balance between reliability and cost. In many cases, structural relationships between the expected reliability levels of the hyperparameters can be inferred from prior information and held-out data -- e.g., longer prompt templates may be more detailed and thus more reliable. However, existing hyperparameter selection methods either do not provide formal reliability guarantees or are unable to incorporate structured knowledge in the hyperparameter space. This paper introduces reliability graph-based Pareto testing (RG-PT), a novel multi-objective hyperparameter selection framework that maintains formal reliability guarantees in terms of false discovery rate (FDR), while accounting for known relationships among hyperparameters via a directed acyclic graph. Edges in the graph reflect expected reliability and cost trade-offs among hyperparameters, which are inferred via the Bradley-Terry (BT) ranking model from prior information and held-out data. Experimental evaluations demonstrate that RG-PT significantly outperforms existing methods such as learn-then-test (LTT) and Pareto testing (PT) through a more efficient exploration of the hyperparameter space.

View full details

Poster

Fair Matroid Selection

Kiarash Banihashem · MohammadTaghi Hajiaghayi · Danny Mittal

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

We investigate the problem of sequentially selecting elements of an unknown matroid in an online manner to form an independent set, with the goal of maximizing the minimum probability of acceptance across all elements, a property we define as $f$-fairness. Under adversarial arrival orders, we design an $\alpha(\ln(k)+1)$-fair algorithm, where $\alpha$ is the arboricity of the matroid and $k$ is the rank, a result that is nearly optimal. For laminar matroids, we develop an $(2\alpha-1)$-fair algorithm, which is optimal up to constant factors, achieved through a novel online coloring scheme. In the random arrival order setting, we achieve a $(4+o(1))\alpha$-fair algorithm for graphic matroids, matching the optimal result up to constant factors, relying on a novel technique for learning a degeneracy ordering using a sampled subset of edges. We further generalize our result to $p$-matchoids, obtaining a $\beta(p\ln k+1)$-fair algorithm for the adversarial arrival model, where $\beta$ is the optimal offline fairness. Notably, all our results can be extended to a setting with no prior knowledge of the matroid with only a logarithmic increase in the fairness factor.

View full details

Poster

V2V: Scaling Event-Based Vision through Efficient Video-to-Voxel Simulation

Hanyue Lou · Jinxiu Liang · Minggui Teng · Yi Wang · Boxin Shi

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

Event-based cameras offer unique advantages such as high temporal resolution, high dynamic range, and low power consumption. However, the massive storage requirements and I/O burdens of existing synthetic data generation pipelines and the scarcity of real data prevent event-based training datasets from scaling up, limiting the development and generalization capabilities of event vision models. To address this challenge, we introduce Video-to-Voxel (V2V), an approach that directly converts conventional video frames into event-based voxel grid representations, bypassing the storage-intensive event stream generation entirely. V2V enables a 150× reduction in storage requirements while supporting on-the-fly parameter randomization for enhanced model robustness. Leveraging this efficiency, we train several video reconstruction and optical flow estimation model architectures on 10,000 diverse videos totaling 52 hours—an order of magnitude larger than existing event datasets, yielding substantial improvements.

View full details

Poster

Doubly Robust Alignment for Large Language Models

Erhan Xu · Kai Ye · Hongyi Zhou · Luhan Zhu · Francesco Quinzan · Chengchun Shi

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

This paper studies reinforcement learning from human feedback (RLHF) for aligning large language models with human preferences. While RLHF has demonstrated promising results, many algorithms are highly sensitive to misspecifications in the underlying preference model (e.g., the Bradley-Terry model), the reference policy, or the reward function, resulting in undesirable fine-tuning. To address model misspecification, we propose a doubly robust preference optimization algorithm that remains consistent when either the preference model or the reference policy is correctly specified (without requiring both). Our proposal demonstrates superior and more robust performance than state-of-the-art algorithms, both in theory and in practice. The code is available at https://github.com/DRPO4LLM/DRPO4LLM

View full details

Poster

Shapley-Based Data Valuation for Weighted $k$-Nearest Neighbors

Guangyi Zhang · Qiyu Liu · Aristides Gionis

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

Data valuation quantifies the impact of individual data points on model performance, and Shapley values provide a principled approach to this important task due to their desirable axiomatic properties, albeit with high computational complexity. Recent breakthroughs have enabled fast computation of exact Shapley values for unweighted $k$-nearest neighbor ($k$NN) classifiers. However, extending this to weighted $k$NN models has remained a significant open challenge. The state-of-the-art methods either require quadratic time complexity or resort to approximation via sampling. In this paper, we show that a conceptually simple but overlooked approach --- data duplication --- can be applied to this problem, yielding a natural variant of weighted $k$NN-Shapley. However, a straightforward application of the data-duplication idea leads to increased data size and prohibitive computational and memory costs. We develop an efficient algorithm that avoids materializing the duplicated dataset by exploiting the structural properties of weighted $k$NN models, reducing the complexity to near-linear time in the original data size. Besides, we establish theoretical foundations for this approach through axiomatic characterization of the resulting values, and empirically validate the effectiveness and efficiency of our method.

View full details

Poster

One-Step is Enough: Sparse Autoencoders for Text-to-Image Diffusion Models

Viacheslav Surkov · Chris Wendler · Antonio Mari · Mikhail Terekhov · Justin Deschenaux · Robert West · Caglar Gulcehre · David Bau

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

For large language models (LLMs), sparse autoencoders (SAEs) have been shown to decompose intermediate representations that often are not interpretable directly into sparse sums of interpretable features, facilitating better control and subsequent analysis. However, similar analyses and approaches have been lacking for text-to-image models. We investigate the possibility of using SAEs to learn interpretable features for SDXL Turbo, a few-step text-to-image diffusion model. To this end, we train SAEs on the updates performed by transformer blocks within SDXL Turbo's denoising U-net in its 1-step setting. Interestingly, we find that they generalize to 4-step SDXL Turbo and even to the multi-step SDXL base model (i.e., a different model) without additional training. In addition, we show that their learned features are interpretable, causally influence the generation process, and reveal specialization among the blocks. We do so by creating RIEBench, a representation-based image editing benchmark, for editing images while they are generated by turning on and off individual SAE features. This allows us to track which transformer blocks' features are the most impactful depending on the edit category. Our work is the first investigation of SAEs for interpretability in text-to-image diffusion models and our results establish SAEs as a promising approach for understanding and manipulating the internal mechanisms of text-to-image models.

View full details

Poster

Fast Rate Bounds for Multi-Task and Meta-Learning with Different Sample Sizes

Hossein Zakerinia · Christoph Lampert

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

We present new fast-rate PAC-Bayesian generalization bounds for multi-task and meta-learning in the unbalanced setting, i.e. when the tasks have training sets of different sizes, as is typically the case in real-world scenarios. Previously, only standard-rate bounds were known for this situation, while fast-rate bounds were limited to the setting where all training sets are of equal size. Our new bounds are numerically computable as well as interpretable, and we demonstrate their flexibility in handling a number of cases where they give stronger guarantees than previous bounds. Besides the bounds themselves, we also make conceptual contributions: we demonstrate that the unbalanced multi-task setting has different statistical properties than the balanced situation, specifically that proofs from the balanced situation do not carry over to the unbalanced setting. Additionally, we shed light on the fact that the unbalanced situation allows two meaningful definitions of multi-task risk, depending on whether all tasks should be considered equally important or if sample-rich tasks should receive more weight than sample-poor ones.

View full details

Poster

Toward a Unified Geometry Understanding : Riemannian Diffusion Framework for Graph Generation and Prediction

Yisen Gao · Xingcheng Fu · Qingyun Sun · Jianxin Li · Xianxian LI

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

Graph diffusion models have made significant progress in learning structured graph data and have demonstrated strong potential for predictive tasks. Existing approaches typically embed node, edge, and graph-level features into a unified latent space, modeling prediction tasks including classification and regression as a form of conditional generation. However, due to the non-Euclidean nature of graph data, features of different curvatures are entangled in the same latent space without releasing their geometric potential. To address this issue, we aim to construt an ideal Riemannian diffusion model to capture distinct manifold signatures of complex graph data and learn their distribution. This goal faces two challenges: numerical instability caused by exponential mapping during the encoding proces and manifold deviation during diffusion generation. To address these challenges, we propose GeoMancer: a novel Riemannian graph diffusion framework for both generation and prediction tasks. To mitigate numerical instability, we replace exponential mapping with an isometric-invariant Riemannian gyrokernel approach and decouple multi-level features onto their respective task-specific manifolds to learn optimal representations. To address manifold deviation, we introduce a manifold-constrained diffusion method and a self-guided strategy for unconditional generation, ensuring that the generated data remains aligned with the manifold signature. Extensive experiments validate the effectiveness of our approach, demonstrating superior performance across a variety of tasks.

View full details

Poster

IMPACT: Irregular Multi-Patch Adversarial Composition Based on Two‑Phase Optimization

Zenghui Yang · Xingquan Zuo · Hai Huang · Gang Chen · Xinchao Zhao · Tianle Zhang

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

Deep neural networks have become foundational in various applications but remain vulnerable to adversarial patch attacks. Crafting effective adversarial patches is inherently challenging due to the combinatorial complexity involved in jointly optimizing critical factors such as patch shape, location, number, and content. Existing approaches often simplify this optimization by addressing each factor independently, which limits their effectiveness. To tackle this significant challenge, we introduce a novel and flexible adversarial attack framework termed IMPACT (Irregular Multi-Patch Adversarial Composition based on Two-phase optimization). IMPACT uniquely enables comprehensive optimization of all essential patch factors using gradient-free methods. Specifically, we propose a novel dimensionality reduction encoding scheme that substantially lowers computational complexity while preserving expressive power. Leveraging this encoding, we further develop a two-phase optimization framework: phase 1 employs differential evolution for joint optimization of patch mask and content, while phase 2 refines patch content using an evolutionary strategy for enhanced precision. Additionally, we introduce a new aggregation algorithm explicitly designed to produce contiguous, irregular patches by merging localized regions, ensuring physical applicability. Extensive experiments demonstrate that our method significantly outperforms several state-of-the-art approaches, highlighting the critical benefit of jointly optimizing all patch factors in adversarial patch attacks.

View full details

Poster

Enhancing Privacy in Multimodal Federated Learning with Information Theory

Tianzhe Xiao · Yichen Li · Yining Qi · YI LIU · wangshi.ww · Haozhao Wang · Yi Wang · Ruixuan Li

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

Multimodal federated learning (MMFL) has gained increasing popularity due to its ability to leverage the correlation between various modalities, meanwhile preserving data privacy for different clients. However, recent studies show that correlation between modalities increase the vulnerability of federated learning against Gradient Inversion Attack (GIA). The complicated situation of MMFL privacy preserving can be summarized as follows: 1) different modality transmits different amounts of information, thus requires various protection strength; 2) correlation between modalities should be taken into account. This paper introduces an information theory perspective to analyze the leaked privacy in process of MMFL, and tries to propose a more reasonable protection method \textbf{Sec-MMFL} based on assessing different information leakage possibilities of each modality by conditional mutual information and adjust the corresponding protection strength. Moreover, we use mutual information to reduce the cross-modality information leakage in MMFL. Experiments have proven that our method can bring more balanced and comprehensive protection at an acceptable cost.

View full details

Poster

InstructHOI: Context-Aware Instruction for Multi-Modal Reasoning in Human-Object Interaction Detection

Jinguo Luo · Weihong Ren · Quanlong Zheng · Yanhao Zhang · Zhenlong Yuan · Zhiyong Wang · Haonan Lu · Honghai LIU

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

Recently, Large Foundation Models (LFMs), e.g., CLIP and GPT, have significantly advanced the Human-Object Interaction (HOI) detection, due to their superior generalization and transferability. Prior HOI detectors typically employ single- or multi-modal prompts to generate discriminative representations for HOIs from pretrained LFMs. However, such prompt-based approaches focus on transferring HOI-specific knowledge, but unexplore the potential reasoning capabilities of LFMs, which can provide informative context for ambiguous and open-world interaction recognition. In this paper, we propose InstructHOI, a novel method that leverages context-aware instructions to guide multi-modal reasoning for HOI detection. Specifically, to bridge knowledge gap and enhance reasoning abilities, we first perform HOI-domain fine-tuning on a pretrained multi-modal LFM, using a generated dataset with 140K interaction-reasoning image-text pairs. Then, we develop a Context-aware Instruction Generator (CIG) to guide interaction reasoning. Unlike traditional language-only instructions, CIG first mines visual interactive context at the human-object level, which is then fused with linguistic instructions, forming multi-modal reasoning guidance. Furthermore, an Interest Token Selector (ITS) is adopted to adaptively filter image tokens based on context-aware instructions, thereby aligning reasoning process with interaction regions. Extensive experiments on two public benchmarks demonstrate that our proposed method outperforms the state-of-the-art ones, under both supervised and zero-shot settings.

View full details

Poster

PT-MoE: An Efficient Finetuning Framework for Integrating Mixture-of-Experts into Prompt Tuning

Zongqian Li · Yixuan Su · Nigel Collier

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

Parameter-efficient fine-tuning (PEFT) methods have shown promise in adapting large language models, yet existing approaches exhibit counter-intuitive phenomena: integrating either matrix decomposition or mixture-of-experts (MoE) individually decreases performance across tasks, though decomposition improves results on specific domains despite reducing parameters, while MoE increases parameter count without corresponding decrease in training efficiency. Motivated by these observations and the modular nature of PT, we propose PT-MoE, a novel framework that integrates matrix decomposition with MoE routing for efficient PT. Evaluation results across 17 datasets demonstrate that PT-MoE achieves state-of-the-art performance in both question answering (QA) and mathematical problem solving tasks, improving F1 score by 1.49 points over PT and 2.13 points over LoRA in QA tasks, while improving mathematical accuracy by 10.75 points over PT and 0.44 points over LoRA, all while using 25% fewer parameters than LoRA. Our analysis reveals that while PT methods generally excel in QA tasks and LoRA-based methods in math datasets, the integration of matrix decomposition and MoE in PT-MoE yields complementary benefits: decomposition enables efficient parameter sharing across experts while MoE provides dynamic adaptation, collectively enabling PT-MoE to demonstrate cross-task consistency and generalization abilities. These findings, along with ablation studies on routing mechanisms and architectural components, provide insights for future PEFT methods.

View full details

Poster

Unifying and Enhancing Graph Transformers via a Hierarchical Mask Framework

Yujie Xing · Xiao Wang · Bin Wu · Hai Huang · Chuan Shi

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

Graph Transformers (GTs) have emerged as a powerful paradigm for graph representation learning due to their ability to model diverse node interactions. However, existing GTs often rely on intricate architectural designs tailored to specific interactions, limiting their flexibly. To address this, we propose a unified hierarchical mask framework that reveals an underlying equivalence between model architecture and attention mask construction. This framework enables a consistent modeling paradigm by capturing diverse interactions through carefully designed attention masks. Theoretical analysis under this framework demonstrates that the probability of correct classification positively correlates with the receptive field size and label consistency, leading to a fundamental design principle: An effective attention mask should ensure both a sufficiently large receptive field and a high level of label consistency. While no single existing mask satisfies this principle across all scenarios, our analysis reveals that hierarchical masks offer complementary strengths—motivating their effective integration. Then, we introduce M$^3$Dphormer, a Mixture-of-Experts based Graph Transformer with Multi-Level Masking and Dual Attention Computation. M$^3$Dphormer incorporates three theoretically grounded hierarchical masks and employs a bi-level expert routing mechanism to adaptively integrate multi-level interaction information. To ensure scalability, we further introduce a dual attention computation scheme that dynamically switches between dense and sparse modes based on local mask sparsity. Extensive experiments across multiple benchmarks demonstrate that M$^3$Dphormer achieves state-of-the-art performance, validating the effectiveness of our unified framework and model design.

View full details

Poster

HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning

Chuhao Zhou · Jianfei Yang

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

Embodied agents operating in smart homes must understand human behavior through diverse sensory inputs and communicate via natural language. While Vision-Language Models (VLMs) have enabled impressive language-grounded perception, their reliance on visual data limits robustness in real-world scenarios with occlusions, poor lighting, or privacy constraints. In this paper, we introduce HoloLLM, a Multimodal Large Language Model (MLLM) that integrates uncommon but powerful sensing modalities, such as LiDAR, infrared, mmWave radar, and WiFi, to enable seamless human perception and reasoning across heterogeneous environments. We address two key challenges: (1) the scarcity of aligned modality-text data for rare sensors, and (2) the heterogeneity of their physical signal representations. To overcome these, we design a Universal Modality-Injection Projector (UMIP) that enhances pre-aligned modality embeddings with fine-grained, text-aligned features from tailored encoders via coarse-to-fine cross-attention without introducing significant alignment overhead. We further introduce a human-VLM collaborative data curation pipeline to generate paired textual annotations for sensing datasets. Extensive experiments on two newly constructed benchmarks show that HoloLLM significantly outperforms existing MLLMs, improving language-grounded human sensing accuracy by up to 30\%. This work establishes a new foundation for real-world, language-informed multisensory embodied intelligence.

View full details

Poster

Improving the Straight-Through Estimator with Zeroth-Order Information

Ningfeng Yang · Tor Aamodt

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

We study the problem of training neural networks with quantized parameters. Learning low-precision quantized parameters by enabling computation of gradients via the Straight-Through Estimator (STE) can be challenging. While the STE enables back-propagation, which is a first-order method, recent works have explored the use of zeroth-order (ZO) gradient descent for fine-tuning. We note that the STE provides high-quality biased gradients, and ZO gradients are unbiased but can be expensive. We thus propose First-Order-Guided Zeroth-Order Gradient Descent (FOGZO) that reduces STE bias while reducing computations relative to ZO methods. Empirically, we show FOGZO improves the tradeoff between quality and training time in Quantization-Aware Pre-Training. Specifically, versus STE at the same number of iterations, we show a 1-8% accuracy improvement for DeiT Tiny/Small, 1-2% accuracy improvement on ResNet 18/50, and 1-22 perplexity point improvement for LLaMA models with up to 0.3 billion parameters. For the same loss, FOGZO yields a 796$\times$ reduction in computation versus n-SPSA for a 2-layer MLP on MNIST. Code is available at [https://github.com/1733116199/fogzo](https://github.com/1733116199/fogzo).

View full details

Poster

Global Prompt Refinement with Non-Interfering Attention Masking for One-Shot Federated Learning

Zhuang Qi · Yu Pan · Lei Meng · Sijin Zhou · Han Yu · Xiaoxiao Li · Xiangxu Meng

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

Federated Prompt Learning (FPL) enables communication-efficient adaptation by tuning lightweight prompts on top of frozen pre-trained models. Existing FPL methods typically rely on global information, which is only available after the second training round, to facilitate collaboration among client models. Therefore, they are inherently dependent on multi-round communication to fully exhibit their strengths. Moreover, existing one-shot federated learning methods typically focus on fitting seen tasks, but lack cross-task generalization. To bridge this gap, we propose the global prompt refinement with non-interfering attention masking (GPR-NIAM) method for one-shot FPL. The core idea is to design a masking mechanism that restricts excessive interaction between the original text embeddings and the learnable prompt embeddings. GPR-NIAM achieves this through the collaboration of two key modules. Firstly, the attention isolation module suppresses attention from the learnable prompt tokens to the original text tokens, and reweights the reverse attention which preserves generalization across tasks. Secondly, the cross-silo collaborative refinement module integrates decentralized visual knowledge into a unified base and calibrates the global prompt through multi-source cross-modal knowledge alignment, further mitigating the inconsistency caused by data heterogeneity. Extensive experiments conducted on ten benchmark datasets under two tasks show that GPR-NIAM outperforms eight state-of-the-art methods in both class-level and domain-level generalization.

View full details

Poster

Soft-consensual Federated Learning for Data Heterogeneity via Multiple Paths

Sheng Huang · Lele Fu · Fanghua Ye · Tianchi Liao · Bowen Deng · zhangchuanfu · Chuan Chen

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

Federated learning enables collaborative training while preserving the privacy of all participants. However, the heterogeneity in data distribution across multiple training nodes poses significant challenges to the construction of federated models. Prior studies were dedicated to mitigating the effects of data heterogeneity by using global information as a blueprint and restricting the local update of the model for reaching a "hard consensus". But this practice makes it difficult to balance local and global information, and it neglects to negotiate amicably between local and global models to reach mutually agreeable results, called ``soft consensus". In this paper, a multiple-path solving method is proposed to balance global and local features and combine these two feature preference paths to reach a soft consensus. Rather than relying on global information as the sole criterion, a negotiation process is employed to address the same objective by accommodating diverse feature preferences, thereby facilitating the discovery of a more plausible solution through multiple distinct pathways. Considering the overwhelming power of local features during local training, a swapping strategy is applied to weaken them to balance the solution paths. Moreover, to minimize the additional communication cost caused by the introduction of multiple paths, the solution of the task network is converted into data adaptation to reduce the amount of parameter transmission. Extensive experiments are conducted to demonstrate the advantages of the proposed method.

View full details

Poster

Train with Perturbation, Infer after Merging: A Two-Stage Framework for Continual Learning

Haomiao Qiu · Miao Zhang · Ziyue Qiao · Liqiang Nie

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

Continual Learning (CL) aims to enable models to continuously acquire new knowledge from a sequence of tasks with avoiding the forgetting of learned information. However, existing CL methods only rely on the parameters of the most recent task for inference, which makes them susceptible to catastrophic forgetting. Inspired by the recent success of model merging techniques, we propose Perturb-and-Merge (P\&M), a novel continual learning framework that integrates model merging into the CL paradigm to mitigate forgetting. Specifically, after training on each task, P\&M constructs a new model by forming a convex combination of the previous model and the newly trained task-specific model. Through theoretical analysis, We minimize the total loss increase across all tasks and derive a closed-form solution for the merging coefficient under mild assumptions. To further improve the performance of the merged model, we observe that the degradation introduced during merging can be alleviated by a regularization term composed of the task vector and the Hessian matrix of the loss function. Interestingly, we show that this term can be efficiently approximated using second-order symmetric finite differences, and a stochastic perturbation strategy along the task vector direction is accordingly devised which incurs no additional forward or backward passes while providing an effective approximation of the regularization term. Finally, we combine P\&M with LoRA, a parameter-efficient fine-tuning method, to reduce memory overhead. Our proposed approach achieves state-of-the-art performance on several continual learning benchmark datasets. The code is available at \url{https://github.com/qhmiao/P-M-for-Continual-Learning}.

View full details

Poster

Dynamic Diffusion Schrödinger Bridge in Astrophysical Observational Inversions

Ye Zhu · Duo Xu · Zhiwei Deng · Jonathan Tan · Olga Russakovsky

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

We study Diffusion Schrödinger Bridge (DSB) models in the context of dynamical astrophysical systems, specifically tackling observational inverse prediction tasks within Giant Molecular Clouds (GMCs) for star formation. We introduce the Astro-DSB model, a variant of DSB with the pairwise domain assumption tailored for astrophysical dynamics. By investigating its learning process and prediction performance in both physically simulated data and in real observations (the Taurus B213 data), we present two main takeaways. First, from the astrophysical perspective, our proposed paired DSB method improves interpretability, learning efficiency, and prediction performance over conventional astrostatistical and other machine learning methods. Second, from the generative modeling perspective, probabilistic generative modeling reveals improvements over discriminative pixel-to-pixel modeling in Out-Of-Distribution (OOD) testing cases of physical simulations with unseen initial conditions and different dominant physical processes. Our study expands research into diffusion models beyond the traditional visual synthesis application and provides evidence of the models' learning abilities beyond pure data statistics, paving a path for future physics-aware generative models which can align dynamics between machine learning and real (astro)physical systems.

View full details

Poster

Statistical inference for Linear Stochastic Approximation with Markovian Noise

Sergey Samsonov · Marina Sheshukova · Eric Moulines · Alexey Naumov

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

In this paper we derive non-asymptotic Berry–Esseen bounds for Polyak–Ruppert averaged iterates of the Linear Stochastic Approximation (LSA) algorithm driven by the Markovian noise. Our analysis yields $O(n^{-1/4})$ convergence rates to the Gaussian limit in the Kolmogorov distance. We further establish the non-asymptotic validity of a multiplier block bootstrap procedure for constructing the confidence intervals, guaranteeing consistent inference under Markovian sampling. Our work provides the first non-asymptotic guarantees on the rate of convergence of bootstrap-based confidence intervals for stochastic approximation with Markov noise. Moreover, we recover the classical rate of order $\mathcal{O}(n^{-1/8})$ up to logarithmic factors for estimating the asymptotic variance of the iterates of the LSA algorithm.

View full details

Poster

Revealing Multimodal Causality with Large Language Models

Jin Li · Shoujin Wang · Qi Zhang · Feng Liu · Tongliang Liu · Longbing Cao · Shui Yu · Fang Chen

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

Uncovering cause-and-effect mechanisms from data is fundamental to scientific progress. While large language models (LLMs) show promise for enhancing causal discovery (CD) from unstructured data, their application to the increasingly prevalent multimodal setting remains a critical challenge. Even with the advent of multimodal LLMs (MLLMs), their efficacy in multimodal CD is hindered by two primary limitations: (1) difficulty in exploring intra- and inter-modal interactions for comprehensive causal variable identification; and (2) insufficiency to handle structural ambiguities with purely observational data. To address these challenges, we propose MLLM-CD, a novel framework for multimodal causal discovery from unstructured data. It consists of three key components: (1) a novel contrastive factor discovery module to identify genuine multimodal factors based on the interactions explored from contrastive sample pairs; (2) a statistical causal structure discovery module to infer causal relationships among discovered factors; and (3) an iterative multimodal counterfactual reasoning module to refine the discovery outcomes iteratively by incorporating the world knowledge and reasoning capabilities of MLLMs. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of the proposed MLLM-CD in revealing genuine factors and causal relationships among them from multimodal unstructured data. The implementation code and data are available at https://github.com/JinLi-i/MLLM-CD.

View full details

Poster

Computational Hardness of Reinforcement Learning with Partial $q^{\pi}$-Realizability

Shayan Karimi · Xiaoqi Tan

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

This paper investigates the computational complexity of reinforcement learning within a novel linear function approximation regime, termed partial $q^{\pi}$-realizability. In this framework, the objective is to learn an $\epsilon$-optimal policy with respect to a predefined policy set $\Pi$, under the assumption that all value functions corresponding to policies in $\Pi$ are linearly realizable. This framework adopts assumptions that are weaker than those in the $q^{\pi}$-realizability setting yet stronger than those in the q*-realizability setup. As a result, it provides a more practical model for reinforcement learning scenarios where function approximation naturally arise. We prove that learning an $\epsilon$-optimal policy in this newly defined setting is computationally hard. More specifically, we establish NP-hardness under a parameterized greedy policy set (i.e., argmax) and, further, show that—unless NP = RP—an exponential lower bound (exponential in feature vector dimension) holds when the policy set contains softmax policies, under the Randomized Exponential Time Hypothesis. Our hardness results mirror those obtained in the $q^*$-realizability settings, and suggest that computational difficulty persists even when the policy class $ \Pi $ is expanded beyond the optimal policy, reinforcing the unbreakable nature of the computational hardness result regarding partial $ q^{\pi} $-realizability under two important policy sets. To establish our negative result, our primary technical contribution is a reduction from two complexity problems, $\delta$-Max-3SAT and $\delta$-Max-3SAT($b$), to instances of our problem settings: GLinear-$\kappa$-RL (under the greedy policy set) and SLinear-$\kappa$-RL (under the softmax policy set), respectively. Our findings indicate that positive computational results are generally unattainable in the context of partial $ q^{\pi} $-realizability, in sharp contrast to the $ q^{\pi} $-realizability setting under a generative access model.

View full details

Poster

Beyond Pairwise Connections: Extracting High-Order Functional Brain Network Structures under Global Constraints

Ling Zhan · Junjie Huang · Xiaoyao Yu · Wenyu Chen · Tao Jia

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

Functional brain network (FBN) modeling often relies on local pairwise interactions, whose limitation in capturing high-order dependencies is theoretically analyzed in this paper. Meanwhile, the computational burden and heuristic nature of current hypergraph modeling approaches hinder end-to-end learning of FBN structures directly from data distributions. To address this, we propose to extract high-order FBN structures under global constraints, and implement this as a Global Constraints oriented Multi-resolution (GCM) FBN structure learning framework. It incorporates 4 types of global constraint (signal synchronization, subject identity, expected edge numbers, and data labels) to enable learning FBN structures for 4 distinct levels (sample/subject/group/project) of modeling resolution. Experimental results demonstrate that GCM achieves up to a 30.6% improvement in relative accuracy and a 96.3% reduction in computational time across 5 datasets and 2 task settings, compared to 9 baselines and 10 state-of-the-art methods. Extensive experiments validate the contributions of individual components and highlight the interpretability of GCM. This work offers a novel perspective on FBN structure learning and provides a foundation for interdisciplinary applications in cognitive neuroscience. Code is publicly available on https://github.com/lzhan94swu/GCM.

View full details

Poster

ECO: Evolving Core Knowledge for Efficient Transfer

Fu Feng · Yucheng Xie · Ruixiao Shi · Jianlu Shen · Jingq Wang · Xin Geng

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

Knowledge in modern neural networks is often entangled and structurally opaque, making current transfer methods—typically based on reusing entire parameter sets—inefficient and inflexible. Efforts to improve flexibility by reusing partial parameters frequently depend on handcrafted heuristics or rigid structural assumptions, which constrain generalization. In contrast, biological evolution enables efficient knowledge transfer by encoding only essential information into genes through iterative refinement under environmental pressure. Inspired by this principle, we propose ECO, a framework that Evolves COre knowledge into modular, reusable neural components—termed learngenes—through similar evolutionary dynamics. To this end, we redefine learngenes as neural circuits and introduce Genetic Transfer Learning (GTL), a biologically inspired paradigm that establishes a genetic mechanism within neural networks in the context of supervised learning. GTL simulates evolutionary processes by generating diverse network populations, selecting high-performing individuals, and transferring their learngenes to subsequent generations. Through iterative refinement, GTL enables learngenes to accumulate transferable common knowledge. Extensive experiments show that ECO achieves efficient initialization and strong generalization across diverse models and tasks, while significantly reducing computational and memory costs compared to conventional methods.

View full details

Poster

Spurious-Aware Prototype Refinement for Reliable Out-of-Distribution Detection

Reihaneh Zohrabi · Hosein Hasani · Mahdieh Soleymani · Anna Rohrbach · Marcus Rohrbach · Mohammad Hossein Rohban

Dec 4, 4:30 PM - 7:30 PM Don Alberto 4

Out-of-distribution (OOD) detection is crucial for ensuring the reliability and safety of machine learning models in real-world applications, where they frequently face data distributions unseen during training. Despite progress, existing methods are often vulnerable to spurious correlations that mislead models and compromise robustness. To address this, we propose SPROD, a novel prototype-based OOD detection approach that explicitly addresses the challenge posed by unknown spurious correlations. Our post-hoc method refines class prototypes to mitigate bias from spurious features without additional data or hyperparameter tuning, and is broadly applicable across diverse backbones and OOD detection settings. We conduct a comprehensive spurious correlation OOD detection benchmarking, comparing our method against existing approaches and demonstrating its superior performance across challenging OOD datasets, such as CelebA, Waterbirds, UrbanCars, Spurious Imagenet, and the newly introduced Animals MetaCoCo. On average, SPROD improves AUROC by 4.8% and FPR@95 by 9.4% over the second best.

View full details

Poster

S-Crescendo: A Nested Transformer Weaving Framework for Scalable Nonlinear System in S-Domain Representation

Junlang Huang · Chen Hao · Li Luo · Yong Cai · Lexin Zhang · Tianhao Ma · Yitian Zhang · Zhong Guan

Dec 4, 4:30 PM - 7:30 PM Don Alberto 4

Simulation of high-order nonlinear system requires extensive computational resources, especially in modern VLSI backend design where bifurcation-induced instability and chaos-like transient behaviors pose challenges. We present S-Crescendo - a nested transformer weaving framework that synergizes S-domain with neural operators for scalable time-domain prediction in high-order nonlinear networks, alleviating the computational bottlenecks of conventional solvers via Newton-Raphson method. By leveraging the partial-fraction decomposition of an n-th order transfer function into first-order modal terms with repeated poles and residues, our method bypasses the conventional Jacobian matrix-based iterations and efficiently reduces computational complexity from cubic $O(n^3)$ to linear $O(n)$.The proposed architecture seamlessly integrates an S-domain encoder with an attention-based correction operator to simultaneously isolate dominant response and adaptively capture higher-order non-linearities. Validated on order-1 to order-10 networks, our method achieves up to 0.99 test-set \(R^2\) accuracy against HSPICE golden waveforms and accelerates simulation by up to 18\(\times\), providing a scalable, physics-aware framework for high-dimensional nonlinear modeling.

View full details

Poster

Rope to Nope and Back Again: A New Hybrid Attention Strategy

Bowen Yang · Bharat Venkitesh · Dwaraknath Gnaneshwar Talupuru · Hangyu Lin · David Cairuz · Phil Blunsom · Acyr Locatelli

Dec 4, 4:30 PM - 7:30 PM Don Alberto 4

Long-context large language models (LLMs) have achieved remarkable advancements, driven by techniques like Rotary Position Embedding (RoPE) (Su et al., 2023) and its extensions (Chen et al., 2023; Liu et al., 2024c; Peng et al., 2023). By adjusting RoPE parameters and incorporating training data with extended contexts, we can train performant models with considerably longer input sequences. However, existing RoPE-based methods exhibit performance limitations when applied to extended context lengths. This paper presents a comprehensive analysis of various attention mechanisms, including RoPE, No Positional Embedding (NoPE), and Query-Key Normalization (QK-Norm), identifying their strengths and shortcomings in long-context modeling. Our investigation identifies distinctive attention patterns in these methods and highlights their impact on long-context performance, providing valuable insights for architectural design. on long context performance, providing valuable insights for architectural design. Building on these findings, we propose a novel architecture featuring a hybrid attention mechanism that integrates global and local attention spans. This design not only surpasses conventional RoPE-based transformer models with full attention in both long and short context tasks but also delivers substantial efficiency gains during training and inference.

View full details

Poster

GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution

Fengxiang Wang · Mingshuo Chen · Yueying Li · Di Wang · Haotian Wang · Zonghao Guo · Zefan Wang · Shan Boqi · Long Lan · Yulin Wang · Hongzhen Wang · Wenjing Yang · Bo Du · Jing Zhang

Dec 4, 4:30 PM - 7:30 PM Don Alberto 4

Ultra-high-resolution (UHR) remote sensing (RS) imagery offers valuable data for Earth observation but pose challenges for existing multimodal foundation models due to two key bottlenecks: (1) limited availability of UHR training data, and (2) token explosion caused by the large image size. To address data scarcity, we introduce **SuperRS-VQA** (avg. 8,376$\times$8,376) and **HighRS-VQA** (avg. 2,000$\times$1,912), the highest-resolution vision-language datasets in RS to date, covering 22 real-world dialogue tasks. To mitigate token explosion, our pilot studies reveal significant redundancy in RS images: crucial information is concentrated in a small subset of object-centric tokens, while pruning background tokens (e.g., ocean or forest) can even improve performance. Motivated by these findings, we propose two strategies: *Background Token Pruning* and *Anchored Token Selection*, to reduce the memory footprint while preserving key semantics. Integrating these techniques, we introduce **GeoLLaVA-8K**, the first RS-focused multimodal large language model capable of handling inputs up to 8K$\times$8K resolution, built on the LLaVA framework. Trained on SuperRS-VQA and HighRS-VQA, GeoLLaVA-8K sets a new state-of-the-art on the XLRS-Bench. Datasets and code were released at https://github.com/MiliLab/GeoLLaVA-8K.

View full details

Poster

Robust Graph Condensation via Classification Complexity Mitigation

Jiayi Luo · Qingyun Sun · Beining Yang · Haonan Yuan · Xingcheng Fu · Yanbiao Ma · Jianxin Li · Philip S Yu

Dec 4, 4:30 PM - 7:30 PM Don Alberto 4

Graph condensation (GC) has gained significant attention for its ability to synthesize smaller yet informative graphs. However, existing studies often overlook the robustness of GC in scenarios where the original graph is corrupted. In such cases, we observe that the performance of GC deteriorates significantly, while existing robust graph learning technologies offer only limited effectiveness. Through both empirical investigation and theoretical analysis, we reveal that GC is inherently an intrinsic-dimension-reducing process, synthesizing a condensed graph with lower classification complexity. Although this property is critical for effective GC performance, it remains highly vulnerable to adversarial perturbations. To tackle this vulnerability and improve GC robustness, we adopt the geometry perspective of graph data manifold and propose a novel Manifold-constrained Robust Graph Condensation framework named MRGC. Specifically, we introduce three graph data manifold learning modules that guide the condensed graph to lie within a smooth, low-dimensional manifold with minimal class ambiguity, thereby preserving the classification complexity reduction capability of GC and ensuring robust performance under universal adversarial attacks. Extensive experiments demonstrate the robustness of MRGC across diverse attack scenarios.

View full details

Poster

How Different from the Past? Spatio-Temporal Time Series Forecasting with Self-Supervised Deviation Learning

Haotian Gao · Zheng Dong · Jiawei Yong · Shintaro Fukushima · Kenjiro Taura · Renhe Jiang

Dec 4, 4:30 PM - 7:30 PM Don Alberto 4

Spatio-temporal forecasting is essential for real-world applications such as traffic management and urban computing. Although recent methods have shown improved accuracy, they often fail to account for dynamic deviations between current inputs and historical patterns. These deviations contain critical signals that can significantly affect model performance. To fill this gap, we propose $\textbf{ST-SSDL}$, a $\underline{S}$patio-$\underline{T}$emporal time series forecasting framework that incorporates a $\underline{S}$elf-$\underline{S}$upervised $\underline{D}$eviation $\underline{L}$earning scheme to capture and utilize such deviations. ST-SSDL anchors each input to its historical average and discretizes the latent space using learnable prototypes that represent typical spatio-temporal patterns. Two auxiliary objectives are proposed to refine this structure: a contrastive loss that enhances inter-prototype discriminability and a deviation loss that regularizes the distance consistency between input representations and corresponding prototypes to quantify deviation. Optimized jointly with the forecasting objective, these components guide the model to organize its hidden space and improve generalization across diverse input conditions. Experiments on six benchmark datasets show that ST-SSDL consistently outperforms state-of-the-art baselines across multiple metrics. Visualizations further demonstrate its ability to adaptively respond to varying levels of deviation in complex spatio-temporal scenarios. Our code and datasets are available at https://github.com/Jimmy-7664/ST-SSDL.

View full details

Poster

Breakthrough Sensor-Limited Single View: Towards Implicit Temporal Dynamics for Time Series Domain Adaptation

Mingyang Liu · Xinyang Chen · Xiucheng Li · Weili Guan · Liqiang Nie

Dec 4, 4:30 PM - 7:30 PM Don Alberto 4

Unsupervised domain adaptation has emerged as a pivotal paradigm for mitigating distribution shifts in time series analysis. The fundamental challenge in time series domain adaptation arises from the entanglement of domain shifts and intricate temporal patterns. Crucially, the latent continuous-time dynamics, which are often inaccessible due to sensor constraints, are only partially observable through discrete time series from an explicit sensor-limited single view. This partial observability hinders the modeling of intricate temporal patterns, impeding domain invariant representation learning. To mitigate the limitation, we propose EDEN (multiple Explicit Domain Enhanced adaptation Network), expanding the raw dataset to multi-scale explicit domains, multi-subspace explicit domains and multi-segment explicit domains. EDEN enhances domain adaptation with three coordinated modules tailored to integrate multiple explicit domains: (1) Multi-Scale Curriculum Adaptation implements progressive domain alignment from coarse-scale to fine-scale. (2) Quality-Aware Feature Fusion evaluates feature quality in multi-subspace explicit domains and adaptively integrates temporal-frequency features. (3) Temporal Coherence Learning enforces segment-level consistency with multi-segment explicit domains. The representation enriched by multiple explicit domains bridges the gap between partially observed discrete samples and the underlying implicit temporal dynamics, enabling more accurate approximation of implicit temporal patterns for effective cross-domain adaptation. Our comprehensive evaluation across 6 time series benchmarks demonstrates EDEN's consistent superiority, achieving average accuracy improvements of 4.8% over state-of-the-art methods in cross-domain scenarios. Code is available at the anonymous link: .

View full details

Poster

SPICED: A Synaptic Homeostasis-Inspired Framework for Unsupervised Continual EEG Decoding

Yangxuan Zhou · Sha Zhao · Jiquan Wang · Haiteng Jiang · Shijian Li · Tao Li · Gang Pan

Dec 4, 4:30 PM - 7:30 PM Don Alberto 4

Human brain achieves dynamic stability-plasticity balance through synaptic homeostasis, a self-regulatory mechanism that stabilizes critical memory traces while preserving optimal learning capacities. Inspired by this biological principle, we propose SPICED: a neuromorphic framework that integrates the synaptic homeostasis mechanism for unsupervised continual EEG decoding, particularly addressing practical scenarios where new individuals with inter-individual variability emerge continually. SPICED comprises a novel synaptic network that enables dynamic expansion during continual adaptation through three bio-inspired neural mechanisms: (1) critical memory reactivation, which mimics brain functional specificity, selectively activates task-relevant memories to facilitate adaptation; (2) synaptic consolidation, which strengthens these reactivated critical memory traces and enhances their replay prioritizations for further adaptations and (3) synaptic renormalization, which are periodically triggered to weaken global memory traces to preserve learning capacities. The interplay within synaptic homeostasis dynamically strengthens task-discriminative memory traces and weakens detrimental memories. By integrating these mechanisms with continual learning system, SPICED preferentially replays task-discriminative memory traces that exhibit strong associations with newly emerging individuals, thereby achieving robust adaptations. Meanwhile, SPICED effectively mitigates catastrophic forgetting by suppressing the replay prioritization of detrimental memories during long-term continual learning. Validated on three EEG datasets, SPICED show its effectiveness. More importantly, SPICED bridges biological neural mechanisms and artificial intelligence through synaptic homeostasis, providing insights into the broader applicability of bio-inspired principles.

View full details

Poster

RankMatch: A Novel Approach to Semi-Supervised Label Distribution Learning Leveraging Rank Correlation between Labels

Zhiqiang Kou · Yucheng Xie · Hailin Wang · Junyang Chen · Jingq Wang · Ming-Kun Xie · Shuo Chen · Yuheng Jia · Tongliang Liu · Xin Geng

Dec 4, 4:30 PM - 7:30 PM Don Alberto 4

Pseudo label based semi-supervised learning (SSL) for single-label and multi-label classification tasks has been extensively studied; however, semi-supervised label distribution learning (SSLDL) remains a largely unexplored area. Existing SSL methods fail in SSLDL because the pseudo-labels they generate only ensure overall similarity to the ground truth but do not preserve the ranking relationships between true labels, as they rely solely on KL divergence as the loss function during training. These skewed pseudo-labels lead the model to learn incorrect semantic relationships, resulting in reduced performance accuracy. To address these issues, we propose a novel SSLDL method called \textit{RankMatch}. \textit{RankMatch} fully considers the ranking relationships between different labels during the training phase with labeled data to generate higher-quality pseudo-labels. Furthermore, our key observation is that a flexible utilization of pseudo-labels can enhance SSLDL performance. Specifically, focusing solely on the ranking relationships between labels while disregarding their margins helps prevent model overfitting. Theoretically, we prove that incorporating ranking correlations enhances SSLDL performance and establish generalization error bounds for \textit{RankMatch}. Finally, extensive real-world experiments validate its effectiveness.

View full details

Poster

Local Curvature Descent: Squeezing More Curvature out of Standard and Polyak Gradient Descent

Peter Richtarik · Simone Maria Giancola · Dymitr Lubczyk · Robin Yadav

Dec 4, 4:30 PM - 7:30 PM Don Alberto 4

We contribute to the growing body of knowledge on more powerful and adaptive stepsizes for convex optimization, empowered by local curvature information. We do not go the route of fully-fledged second-order methods, which require the expensive computation of the Hessian. Instead, our key observation is that, for some problems (e.g., when minimizing the sum of squares of absolutely convex functions), local curvature information is readily available, and can be used to obtain surprisingly powerful matrix-valued stepsizes, and meaningful theory. In particular, we develop three new methods — LCD1, LCD2, and LCD3 — where the abbreviation stands for local curvature descent. While LCD1 generalizes gradient descent with fixed stepsize, LCD2 generalizes gradient descent with Polyak stepsize. Our methods enhance these classical gradient descent baselines with local curvature information, and our theory recovers the known rates in the special case when no curvature information is used. Our last method, LCD3, is a variable-metric version of LCD2; this feature leads to a closed-form expression for the iterates. Our empirical results are encouraging and show that the local curvature descent improves upon gradient descent.

View full details

Poster

LLM at Network Edge: A Layer-wise Efficient Federated Fine-tuning Approach

Jinglong Shen · Nan Cheng · Wenchao Xu · Haozhao Wang · Yifan guo · Jiajie Xu

Dec 4, 4:30 PM - 7:30 PM Don Alberto 4

Fine-tuning large language models (LLMs) poses significant computational burdens, especially in federated learning (FL) settings. We introduce Layer-wise Efficient Federated Fine-tuning (LEFF), a novel method designed to enhance the efficiency of FL fine-tuning while preserving model performance and minimizing client-side computational overhead. LEFF strategically selects layers for fine-tuning based on client computational capacity, thereby mitigating the straggler effect prevalent in heterogeneous environments. Furthermore, LEFF incorporates an importance-driven layer sampling mechanism, prioritizing layers with greater influence on model performance. Theoretical analysis demonstrates that LEFF achieves a convergence rate of $\mathcal{O}(1/\sqrt{T})$. Extensive experiments on diverse datasets demonstrate that LEFF attains superior computational efficiency and model performance compared to existing federated fine-tuning methods, particularly under heterogeneous conditions.

View full details

Poster

Leveraging robust optimization for llm alignment under distribution shifts

Mingye Zhu · Yi Liu · Zheren Fu · Yongdong Zhang · Zhendong Mao

Dec 4, 4:30 PM - 7:30 PM Don Alberto 4

Preference alignment methods are increasingly critical for steering large language models (LLMs) to generate outputs consistent with human values. While recent approaches often rely on synthetic data generated by LLMs for scalability and cost-efficiency reasons, this reliance can introduce distributional shifts that undermine the nuanced representation of human preferences needed for desirable outputs. In this paper, we propose a novel distribution-aware optimization framework that improves preference alignment despite such shifts. Our approach first leverages well-learned classifiers to assign a calibration value to each training sample, quantifying its alignment with the target human-preferred distribution. These values are then incorporated into a robust optimization objective that minimizes the worst-case loss over regions of the data space most relevant to human preferences. By explicitly focusing optimization on the target distribution, our approach mitigates the impact of distributional mismatch and improves the generation of responses that better reflect intended values.

View full details

Poster

Unsupervised Federated Graph Learning

Lele Fu · Tianchi Liao · Sheng Huang · Bowen Deng · zhangchuanfu · Shirui Pan · Chuan Chen

Dec 4, 4:30 PM - 7:30 PM Don Alberto 4

Federated graph learning (FGL) is a privacy-preserving paradigm for modeling distributed graph data, designed to train a powerful global graph neural network. Existing FGL methods predominantly rely on label information during training, effective FGL in an unsupervised setting remains largely unexplored territory. In this paper, we address two key challenges in unsupervised FGL: 1) Local models tend to converge in divergent directions due to the lack of shared semantic information across clients. Then, how to align representation spaces among multiple clients is the first challenge. 2) Conventional federated weighted aggregation easily results in degrading the performance of the global model, then which raises another challenge, namely how to adaptively learn the global model parameters. In response to the two questions, we propose a tailored framework named FedPAM, which is composed of two modules: Representation Space Alignment (RSA) and Adaptive Global Parameter Learning (AGPL). RSA leverages a set of learnable anchors to define the global representation space, then local subgraphs are aligned with them through the fused Gromov-Wasserstein optimal transport, achieving the representation space alignment across clients. AGPL stacks local model parameters into third-order tensors, and adaptively integrates the global model parameters in a low-rank tensor space, which facilitates to fuse the high-order knowledge among clients. Extensive experiments on eight graph datasets are conducted, the results demonstrate that the proposed FedPAM is superior over classical and SOTA compared methods.

View full details

Poster

Learning Grouped Lattice Vector Quantizers for Low-Bit LLM Compression

Xi Zhang · Xiaolin Wu · Jiamang Wang · Weisi Lin

Dec 4, 4:30 PM - 7:30 PM Don Alberto 4

Large Language Models (LLMs) have demonstrated remarkable capabilities but typically require extensive computational resources and memory for inference. Post-training quantization (PTQ) can effectively reduce these demands by storing weights in lower bit-width formats. However, standard uniform quantization often leads to notable performance degradation, particularly in low-bit scenarios. In this work, we introduce a Grouped Lattice Vector Quantization (GLVQ) framework that assigns each group of weights a customized lattice codebook, defined by a learnable generation matrix. To address the non-differentiability of the quantization process, we adopt Babai rounding to approximate nearest-lattice-point search during training, which enables stable optimization of the generation matrices. Once trained, decoding reduces to a simple matrix-vector multiplication, yielding an efficient and practical quantization pipeline. Experiments on multiple benchmarks show that our approach achieves a better trade-off between model size and accuracy compared to existing post-training quantization baselines, highlighting its effectiveness in deploying large models under stringent resource constraints. Our source code is available on GitHub repository: https://github.com/xzhang9308/GLVQ.

View full details

Poster

Pin the Tail on the Model: Blindfolded Repair of User-Flagged Failures in Text-to-Image Services

Gefei Tan · Ali Shahin Shamsabadi · Ellen Kolesnikova · Hamed Haddadi · Xiao Wang

Dec 4, 4:30 PM - 7:30 PM Don Alberto 4

Diffusion models are increasingly deployed in real-world text-to-image services. These models, however, encode implicit assumptions about the world based on web-scraped image-caption pairs used during training. Over time, such assumptions may become outdated, incorrect, or socially biased--leading to failures where the generated images misalign with users' expectations or evolving societal norms. Identifying and fixing such failures is challenging and, thus, a valuable asset for service providers, as failures often emerge post-deployment and demand specialized expertise and resources to resolve them. In this work, we introduce $\textit{SURE}$, the first end‑to‑end framework that $\textbf{S}$ec$\textbf{U}$rely $\textbf{RE}$pairs failures flagged by users of diffusion-based services. $\textit{SURE}$ enables the service provider to securely collaborate with an external third-party specialized in model repairing (i.e., Model Repair Institute) without compromising the confidentiality of user feedback, the service provider’s proprietary model, or the Model Repair Institute’s proprietary repairing knowledge. To achieve the best possible efficiency, we propose a co-design of a model editing algorithm with a customized two-party cryptographic protocol. Our experiments show that $\textit{SURE}$ is highly practical: $\textit{SURE}$ securely and effectively repairs all 32 layers of {Stable Diffusion v1.4} in under 17 seconds (four orders of magnitude more efficient than a general baseline). Our results demonstrate that practical, secure model repair is attainable for large-scale, modern diffusion services.

View full details

Poster

Explainably Safe Reinforcement Learning

Sabine Rieder · Stefan Pranger · Debraj Chakraborty · Jan Kretinsky · Bettina Könighofer

Dec 4, 4:30 PM - 7:30 PM Don Alberto 4

Trust in a decision-making system requires both safety guarantees and the ability to interpret and understand its behavior. This is particularly important for learned systems, whose decision-making processes are often highly opaque. Shielding is a prominent model-based technique for enforcing safety in reinforcement learning. However, because shields are automatically synthesized using rigorous formal methods, their decisions are often similarly difficult for humans to interpret. Recently, decision trees became customary to represent controllers and policies. However, since shields are inherently non-deterministic, their decision tree representations become too large to be explainable in practice. To address this challenge, we propose a novel approach for explainable safe RL that enhances trust by providing human-interpretable explanations of the shield's decisions. Our method represents the shielding policy as a hierarchy of decision trees, offering top-down, case-based explanations. At design time, we use a world model to analyze the safety risks of executing actions in given states. Based on this risk analysis, we construct both the shield and a high-level decision tree that classifies states into risk categories (safe, critical, dangerous, unsafe), providing an initial explanation of why a given situation may be safety-critical. At runtime, we generate localized decision trees that explain which actions are allowed and why others are deemed unsafe. Altogether, our method facilitates the explainability of the safety aspect in the safe-by-shielding reinforcement learning. Our framework requires no additional information beyond what is already used for shielding, incurs minimal overhead, and can be readily integrated into existing shielded RL pipelines. In our experiments, we compute explanations using decision trees that are several orders of magnitude smaller than the original shield.

View full details

Poster

Trained Mamba Emulates Online Gradient Descent in In-Context Linear Regression

Jiarui Jiang · Wei Huang · Miao Zhang · Taiji Suzuki · Liqiang Nie

Dec 4, 4:30 PM - 7:30 PM Don Alberto 4

State-space models (SSMs), particularly Mamba, emerge as an efficient Transformer alternative with linear complexity for long-sequence modeling. Recent empirical works demonstrate Mamba's in-context learning (ICL) capabilities competitive with Transformers, a critical capacity for large foundation models. However, theoretical understanding of Mamba’s ICL remains limited, restricting deeper insights into its underlying mechanisms. Even fundamental tasks such as linear regression ICL, widely studied as a standard theoretical benchmark for Transformers, have not been thoroughly analyzed in the context of Mamba. To address this gap, we study the training dynamics of Mamba on the linear regression ICL task. By developing novel techniques tackling non-convex optimization with gradient descent related to Mamba's structure, we establish an exponential convergence rate to ICL solution, and derive a loss bound that is comparable to Transformer's. Importantly, our results reveal that Mamba can perform a variant of \textit{online gradient descent} to learn the latent function in context. This mechanism is different from that of Transformer, which is typically understood to achieve ICL through gradient descent emulation. The theoretical results are verified by experimental simulation.

View full details

Poster

Reasoning Is Not a Race: When Stopping Early Beats Going Deeper

Mohan Zhang · Jiaxuan Gao · Shusheng Xu · YI WU

Dec 4, 4:30 PM - 7:30 PM Don Alberto 4

We study the use of Process Reward Models (PRMs) for guiding Long Chain-of-Thought (CoT) reasoning in large language models. Although PRMs deliver fine-grained feedback in standard tasks, PRM-guided beam search does not consistently outperform PRM-free approaches in long CoT reasoning. We trace this shortfall to a "step quality degradation''—the expected step quality shows concave behavior, yielding unimodal or monotonically declining trends. To counteract this, we propose Z-Score Guided Early Stopping (ZGES), which halts search at the detected quality peak using local PRM-reward z-scores. Across multiple math benchmarks and model scales, ZGES outperforms both standard PRM-guided beam search and the PRM-free methods. Ablation studies further highlight the advantages and robustness of ZGES’s adaptive stopping mechanism.

View full details

Poster

Leveraging Importance Sampling to Detach Alignment Modules from Large Language Models

Yi Liu · Dianqing Liu · Mingye Zhu · Junbo Guo · Yongdong Zhang · Zhendong Mao

Dec 4, 4:30 PM - 7:30 PM Don Alberto 4

The widespread adoption of large language models (LLMs) across industries has increased the demand for high-quality and customizable outputs. However, traditional alignment methods often require retraining large pretrained models, making it difficult to quickly adapt and optimize LLMs for diverse applications. To address this limitation, we propose a novel \textit{Residual Alignment Model} (\textit{RAM}) that formalizes the alignment process as a type of importance sampling. In this framework, the unaligned upstream model serves as the proposal distribution, while the alignment process is framed as secondary sampling based on an autoregressive alignment module that acts as an estimator of the importance weights. This design enables a natural detachment of the alignment module from the target aligned model, improving flexibility and scalability. Based on this model, we derive an efficient sequence-level training strategy for the alignment module, which operates independently of the proposal module. Additionally, we develop a resampling algorithm with iterative token-level decoding to address the common first-token latency issue in comparable methods. Experimental evaluations on two leading open-source LLMs across diverse tasks, including instruction following, domain adaptation, and preference optimization, demonstrate that our approach consistently outperforms baseline models.

View full details

Poster

PhysDrive: A Multimodal Remote Physiological Measurement Dataset for In-vehicle Driver Monitoring

Wang · Xiao Yang · Qingyong Hu · Jack Tang · Can Liu · Dengbo He · Yuntao Wang · Yingcong Chen · Kaishun Wu

Dec 4, 4:30 PM - 7:30 PM Don Alberto 4

Robust and unobtrusive in-vehicle physiological monitoring is crucial for ensuring driving safety and user experience. While remote physiological measurement (RPM) offers a promising non-invasive solution, its translation to real-world driving scenarios is critically constrained by the scarcity of comprehensive datasets. Existing resources are often limited in scale, modality diversity, the breadth of biometric annotations, and the range of captured conditions, thereby omitting inherent real-world challenges in driving. Here, we present PhysDrive, the first large-scale multimodal dataset for contactless in-vehicle physiological sensing with dedicated consideration of various modality settings and driving factors. PhysDrive collects data from 48 drivers, including synchronized RGB, near-infrared camera, and raw mmWave radar data, accompanied by six synchronized ground truths (ECG, BVP, Respiration, HR, RR, and SpO2). It covers a wide spectrum of naturalistic driving conditions, including driver motions, dynamic natural light, vehicle types, and road conditions. We extensively evaluate both signal‑processing and deep‑learning methods on PhysDrive, establishing a comprehensive benchmark across all modalities, and release full open‑source code with compatibility for mainstream public toolboxes. We envision PhysDrive will serve as a foundational resource and accelerate research on multimodal driver monitoring and smart‑cockpit systems.

View full details

Poster

Robust Federated Finetuning of LLMs via Alternating Optimization of LoRA

Shuangyi Chen · Yuanxin Guo · Yue Ju · Hardik Dalal · Zhongwen Zhu · Ashish Khisti

Dec 4, 4:30 PM - 7:30 PM Don Alberto 4

Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) optimize federated training by reducing computational and communication costs. We propose RoLoRA, a federated framework using alternating optimization to fine-tune LoRA adapters. Our approach emphasizes the importance of learning up and down projection matrices to enhance expressiveness and robustness. We use both theoretical analysis and extensive experiments to demonstrate the advantages of RoLoRA over prior approaches that either generate imperfect model updates or limit expressiveness of the model. We provide a theoretical analysis on a linear model to highlight the importance of learning both the down-projection and up-projection matrices in LoRA. We validate the insights on a non-linear model and separately provide a convergence proof under general conditions. To bridge theory and practice, we conducted extensive experimental evaluations on language models including RoBERTa-Large, Llama-2-7B on diverse tasks and FL settings to demonstrate the advantages of RoLoRA over other methods.

View full details

Poster

MSTAR: Box-free Multi-query Scene Text Retrieval with Attention Recycling

Liang Yin · Xudong Xie · Zhang Li · Xiang Bai · Yuliang Liu

Dec 4, 4:30 PM - 7:30 PM Don Alberto 4

Scene text retrieval has made significant progress with the assistance of accurate text localization. However, existing approaches typically require costly bounding box annotations for training. Besides, they mostly adopt a customized retrieval strategy but struggle to unify various types of queries to meet diverse retrieval needs. To address these issues, we introduce Multi-query Scene Text retrieval with Attention Recycling (MSTAR), a box-free approach for scene text retrieval. It incorporates progressive vision embedding to dynamically capture the multi-grained representation of texts and harmonizes free-style text queries with style-aware instructions. Additionally, a multi-instance matching module is integrated to enhance vision-language alignment. Furthermore, we build the Multi-Query Text Retrieval (MQTR) dataset, the first benchmark designed to evaluate the multi-query scene text retrieval capability of models, comprising four query types and $16k$ images. Extensive experiments demonstrate the superiority of our method across seven public datasets and the MQTR dataset. Notably, MSTAR marginally surpasses the previous state-of-the-art model by 6.4\% in MAP on Total-Text while eliminating box annotation costs. Moreover, on the MQTR benchmark, MSTAR significantly outperforms the previous models by an average of 8.5\%. The code and datasets are available at \href{https://github.com/yingift/MSTAR}{https://github.com/yingift/MSTAR}.

View full details

Poster

More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models

Hongkai Lin · Dingkang Liang · Mingyang Du · Xin Zhou · Xiang Bai

Dec 5, 11:00 AM - 2:00 PM Don Alberto 4

Generative depth estimation methods leverage the rich visual priors stored in pretrained text-to-image diffusion models, demonstrating astonishing zero-shot capability. However, parameter updates during training lead to catastrophic degradation in the image generation capability of the pretrained model. We introduce MERGE, a unified model for image generation and depth estimation, starting from a fixed-parameters pretrained text-to-image model. MERGE demonstrates that the pretrained text-to-image model can do more than image generation but also expand to depth estimation effortlessly. Specifically, MERGE introduces a plug-and-play framework that enables seamless switching between image generation and depth estimation modes through simple and pluggable converters. Meanwhile, we propose a Group Reuse Mechanism to encourage parameter reuse and improve the utilization of the additional learnable parameter. MERGE unleashes the powerful depth estimation capability of the pretrained text-to-image model while preserving its original image generation ability. Compared to other unified models for image generation and depth estimation, MERGE achieves state-of-the-art performance across multiple depth estimation benchmarks. The code and model will be made available.

View full details

Poster

Light-Weight Diffusion Multiplier and Uncertainty Quantification for Fourier Neural Operators

Albert Matveev · Sanmitra Ghosh · Aamal Hussain · James-Michael Leahy · Michalis Michaelides

Dec 5, 11:00 AM - 2:00 PM Don Alberto 4

Operator learning is a powerful paradigm for solving partial differential equations, with Fourier Neural Operators serving as a widely adopted foundation. However, FNOs face significant scalability challenges due to overparameterization and offer no native uncertainty quantification -- a key requirement for reliable scientific and engineering applications. Instead, neural operators rely on post hoc UQ methods that ignore geometric inductive biases. In this work, we introduce DINOZAUR: a diffusion-based neural operator parametrization with uncertainty quantification. Inspired by the structure of the heat kernel, DINOZAUR replaces the dense tensor multiplier in FNOs with a dimensionality-independent diffusion multiplier that has a single learnable time parameter per channel, drastically reducing parameter count and memory footprint without compromising predictive performance. By defining priors over those time parameters, we cast DINOZAUR as a Bayesian neural operator to yield spatially correlated outputs and calibrated uncertainty estimates. Our method achieves competitive or superior performance across several PDE benchmarks while providing efficient uncertainty quantification.

View full details

Poster

AngleRoCL: Angle-Robust Concept Learning for Physically View-Invariant Adversarial Patches

Wenjun Ji · Yuxiang Fu · Luyang Ying · Deng-Ping Fan · Yuyi Wang · Ming-Ming Cheng · Ivor Tsang · Qing Guo

Dec 5, 11:00 AM - 2:00 PM Don Alberto 4

Cutting-edge works have demonstrated that text-to-image (T2I) diffusion models can generate adversarial patches that mislead state-of-the-art object detectors in the physical world, revealing detectors' vulnerabilities and risks. However, these methods neglect the T2I patches' attack effectiveness when observed from different views in the physical world (i.e., angle robustness of the T2I adversarial patches). In this paper, we study the angle robustness of T2I adversarial patches comprehensively, revealing their angle-robust issues, demonstrating that texts affect the angle robustness of generated patches significantly, and task-specific linguistic instructions fail to enhance the angle robustness. Motivated by the studies, we introduce Angle-Robust Concept Learning (AngleRoCL), a simple and flexible approach that learns a generalizable concept (i.e., text embeddings in implementation) representing the capability of generating angle-robust patches. The learned concept can be incorporated into textual prompts and guides T2I models to generate patches with their attack effectiveness inherently resistant to viewpoint variations. Through extensive simulation and physical-world experiments on five SOTA detectors across multiple views, we demonstrate that AngleRoCL significantly enhances the angle robustness of T2I adversarial patches compared to baseline methods. Our patches maintain high attack success rates even under challenging viewing conditions, with over 50% average relative improvement in attack effectiveness across multiple angles. This research advances the understanding of physically angle-robust patches and provides insights into the relationship between textual concepts and physical properties in T2I-generated contents. We released our code at https://github.com/tsingqguo/anglerocl.

View full details

Poster

Nearly-Linear Time Private Hypothesis Selection with the Optimal Approximation Factor

Maryam Aliakbarpour · Zhan Shi · Ria Stevens · Vincent Wang

Dec 5, 11:00 AM - 2:00 PM Don Alberto 4

Estimating the density of a distribution from its samples is a fundamental problem in statistics. \emph{Hypothesis selection} addresses the setting where, in addition to a sample set, we are given $n$ candidate distributions---referred to as \emph{hypotheses}---and the goal is to determine which one best describes the underlying data distribution. This problem is known to be solvable very efficiently, requiring roughly $O(\log n)$ samples and running in $\tilde{O}(n)$ time. The quality of the output is measured via the total variation distance to the unknown distribution, and the approximation factor of the algorithm determines how large this distance is compared to the optimal distance achieved by the best candidate hypothesis. It is known that $\alpha = 3$ is the optimal approximation factor for this problem. We study hypothesis selection under the constraint of \emph{differential privacy}. We propose a differentially private algorithm in the central model that runs in nearly linear time with respect to the number of hypotheses, achieves the optimal approximation factor, and incurs only a modest increase in sample complexity, which remains polylogarithmic in $n$. This resolves an open question posed by [Bun, Kamath, Steinke, Wu, NeurIPS 2019]. Prior to our work, existing upper bounds required quadratic time.

View full details

Poster

SHAP values via sparse Fourier representation

Ali Gorji · Andisheh Amrollahi · Andreas Krause

Dec 5, 11:00 AM - 2:00 PM Don Alberto 4

SHAP (SHapley Additive exPlanations) values are a widely used method for local feature attribution in interpretable and explainable AI. We propose an efficient two-stage algorithm for computing SHAP values in both black-box setting and tree-based models. Motivated by spectral bias in real-world predictors, we first approximate models using compact Fourier representations, exactly for trees and approximately for black-box models. In the second stage, we introduce a closed-form formula for {\em exactly} computing SHAP values using the Fourier representation, that ``linearizes'' the computation into a simple summation and is amenable to parallelization. As the Fourier approximation is computed only once, our method enables amortized SHAP value computation, achieving significant speedups over existing methods and a tunable trade-off between efficiency and precision.

View full details

Poster

The Hawthorne Effect in Reasoning Models: Evaluating and Steering Test Awareness

Sahar Abdelnabi · Ahmed Salem

Dec 5, 11:00 AM - 2:00 PM Don Alberto 4

Reasoning-focused LLMs sometimes alter their behavior when they detect that they are being evaluated—which can lead them to optimize for test-passing performance or to comply more readily with harmful prompts if real-world consequences appear absent. We present the first quantitative study of how such “test awareness” impacts model behavior, particularly its performance on safety-related tasks. We introduce a white-box probing framework that (i) linearly identifies awareness-related activations and (ii) steers models toward or away from test awareness while monitoring downstream performance. We apply our method to different state-of-the-art open-weight reasoning LLMs across both realistic and hypothetical tasks (denoting tests or simulations). Our results demonstrate that test awareness significantly impacts safety alignment (such as compliance with harmful requests and conforming to stereotypes) with effects varying in both magnitude and direction across models. By providing control over this latent effect, our work aims to provide a stress-test mechanism and increase trust in how we perform safety evaluations.

View full details

Poster

Mysteries of the Deep: Role of Intermediate Representations in Out of Distribution Detection

Ignacio Meza De la Jara · Cristian Rodriguez-Opazo · Damien Teney · Damith Ranasinghe · Ehsan Abbasnejad

Dec 5, 11:00 AM - 2:00 PM Don Alberto 4

Out-of-distribution (OOD) detection is essential for reliably deploying machine learning models in the wild. Yet, most methods treat large pre-trained models as monolithic encoders and rely solely on their final-layer representations for detection. We challenge this wisdom. We reveal the intermediate layers of pre-trained models, shaped by residual connections that subtly transform input projections, can encode surprisingly rich and diverse signals for detecting distributional shifts. Importantly, to exploit latent representation diversity across layers, we introduce an entropy-based criterion to automatically identify layers offering the most complementary information in a training-free setting, without access to OOD data. We show that selectively incorporating these intermediate representations can increase the accuracy of OOD detection by up to $10\%$ in far-OOD and over $7\%$ in near-OOD benchmarks compared to state-of-the-art training-free methods across various model architectures and training objectives. Our findings reveal a new avenue for OOD detection research and uncover the impact of various training objectives and model architectures on confidence-based OOD detection methods.

View full details

Poster

Reward-Aware Proto-Representations in Reinforcement Learning

Hon Tik Tse · Siddarth Chandrasekar · Marlos C. Machado

Dec 5, 11:00 AM - 2:00 PM Don Alberto 4

In recent years, the successor representation (SR) has attracted increasing attention in reinforcement learning (RL), and it has been used to address some of its key challenges, such as exploration, credit assignment, and generalization. The SR can be seen as representing the underlying credit assignment structure of the environment by implicitly encoding its induced transition dynamics. However, the SR is reward-agnostic. In this paper, we discuss a similar representation that also takes into account the reward dynamics of the problem. We study the default representation (DR), a recently proposed representation with limited theoretical (and empirical) analysis. Here, we lay some of the theoretical foundation underlying the DR in the tabular case by (1) deriving dynamic programming and (2) temporal-difference methods to learn the DR, (3) characterizing the basis for the vector space of the DR, and (4) formally extending the DR to the function approximation case through default features. Empirically, we analyze the benefits of the DR in many of the settings in which the SR has been applied, including (1) reward shaping, (2) option discovery, (3) exploration, and (4) transfer learning. Our results show that, compared to the SR, the DR gives rise to qualitatively different, reward-aware behaviour and quantitatively better performance in several settings.

View full details

Poster

GLNCD: Graph-Level Novel Category Discovery

Bowen Deng · Lele Fu · Sheng Huang · Tianchi Liao · Jialong Chen · Zhang Tao · Chuan Chen

Dec 5, 11:00 AM - 2:00 PM Don Alberto 4

Graph classification has long assumed a closed-world setting, limiting its applicability to real-world scenarios where new categories often emerge. To address this limitation, we introduce Graph-Level Novel Category Discovery (GLNCD), a new task aimed at identifying unseen graph categories without supervision from novel classes. We first adapt classical Novel Category Discovery (NCD) methods for images to the graph domain and evaluate these baseline methods on four diverse graph datasets curated for the GLNCD task. Our analysis reveals that these methods suffer a notable performance degradation compared to their image-based counterparts, due to two key challenges: (1) insufficient utilization of structural information in graph self-supervised learning (SSL), and (2) ineffective pseudo-labeling strategies based on ranking statistics (RS) that neglect graph structure. To alleviate these issues, we propose ProtoFGW-NCD, a framework consisting of two core components: ProtoFGW-CL, a novel graph SSL framework, and FGW-RS, a structure-aware pseudo-labeling method. Both components employ a differentiable Fused Gromov-Wasserstein (FGW) distance to effectively compare graphs by incorporating structural information. These components are built upon learnable prototype graphs, which enable efficient, parallel FGW-based graph comparisons and capture representative patterns within graph datasets. Experiments on four GLNCD benchmark datasets demonstrate the effectiveness of ProtoFGW-NCD.

View full details

Poster

When Less Language is More: Language-Reasoning Disentanglement Makes LLMs Better Multilingual Reasoners

Weixiang Zhao · Jiahe Guo · Yang Deng · Tongtong Wu · Wenxuan Zhang · Yulin Hu · Xingyu Sui · Yanyan Zhao · Wanxiang Che · Bing Qin · Tat-Seng Chua · Ting Liu

Dec 5, 11:00 AM - 2:00 PM Don Alberto 4

Multilingual reasoning remains a significant challenge for large language models (LLMs), with performance disproportionately favoring high-resource languages. Drawing inspiration from cognitive neuroscience, which suggests that human reasoning functions largely independently of language processing, we hypothesize that LLMs similarly encode reasoning and language as separable components that can be disentangled to enhance multilingual reasoning. To evaluate this, we perform a causal intervention by ablating language-specific representations at inference time. Experiments on 10 open-weight LLMs spanning 11 typologically diverse languages show that this language-specific ablation consistently boosts multilingual reasoning performance. Layer-wise analyses further confirm that language and reasoning representations can be effectively disentangled throughout the model, yielding improved multilingual reasoning capabilities, while preserving top-layer language features remains essential for maintaining linguistic fidelity. Compared to post-training methods such as supervised fine-tuning or reinforcement learning, our training-free language-reasoning disentanglement achieves comparable or superior results with minimal computational overhead. These findings shed light on the internal mechanisms underlying multilingual reasoning in LLMs and suggest a lightweight and interpretable strategy for improving cross-lingual generalization.

View full details

Poster

Aligning Evaluation with Clinical Priorities: Calibration, Label Shift, and Error Costs

Gerardo Flores · Alyssa H. Smith · Julia Fukuyama · Ashia Wilson

Dec 5, 11:00 AM - 2:00 PM Don Alberto 4

Machine learning-based decision support systems are increasingly deployed in clinical settings, where probabilistic scoring functions are used to inform and prioritize patient management decisions. However, widely used scoring rules, such as accuracy and AUC-ROC, fail to adequately reflect key clinical priorities, including calibration, robustness to distributional shifts, and sensitivity to asymmetric error costs. In this work, we propose a principled yet practical evaluation framework for selecting calibrated thresholded classifiers that explicitly accounts for uncertainty in class prevalences and domain-specific cost asymmetries. Building on the theory of proper scoring rules, particularly the Schervish representation, we derive an adjusted variant of cross-entropy (log score) that averages cost-weighted performance over clinically relevant ranges of class balance. The resulting evaluation is simple to apply, sensitive to clinical deployment conditions, and designed to prioritize models that are both calibrated and robust to real-world variations.

View full details

Poster

DrivingRecon: Large 4D Gaussian Reconstruction Model For Autonomous Driving

Hao LU · Tianshuo Xu · Wenzhao Zheng · Yunpeng Zhang · Wei Zhan · Dalong Du · Masayoshi TOMIZUKA · Kurt Keutzer · Yingcong Chen

Dec 5, 11:00 AM - 2:00 PM Don Alberto 4

Large reconstruction model has remarkable progress, which can directly predict 3D or 4D representations for unseen scenes and objects. However, current work has not systematically explored the potential of large reconstruction models in the field of autonomous driving. To achieve this, we introduce the Large 4D Gaussian Reconstruction Model (DrivingRecon). With an elaborate and simple framework design, it not only ensures efficient and high-quality reconstruction, but also provides potential for downstream tasks. There are two core contributions: firstly, the Prune and Dilate Block (PD-Block) is proposed to prune redundant and overlapping Gaussian points and dilate Gaussian points for complex objects. Then, dynamic and static decoupling is tailored to better learn the temporary-consistent geometry across different time. Experimental results demonstrate that DrivingRecon significantly improves scene reconstruction quality compared to existing methods. Furthermore, we explore applications of DrivingRecon in model pre-training, vehicle type adaptation, and scene editing. Our code will be available.

View full details

Poster

CymbaDiff: Structured Spatial Diffusion for Sketch-based 3D Semantic Urban Scene Generation

Li Liang · Bo Miao · Xinyu Wang · NAVEED AKHTAR · Jordan Vice · Ajmal Mian

Dec 5, 11:00 AM - 2:00 PM Don Alberto 4

Outdoor 3D semantic scene generation produces realistic and semantically rich environments for applications such as urban simulation and autonomous driving. However, advances in this direction are constrained by the absence of publicly available, well-annotated datasets. We introduce SketchSem3D, the first large‑scale benchmark for generating 3D outdoor semantic scenes from abstract freehand sketches and pseudo‑labeled annotations of satellite images. SketchSem3D includes two subsets, Sketch-based SemanticKITTI and Sketch-based KITTI-360 (containing LiDAR voxels along with their corresponding sketches and annotated satellite images), to enable standardized, rigorous, and diverse evaluations. We also propose Cylinder Mamba Diffusion (CymbaDiff) that significantly enhances spatial coherence in outdoor 3D scene generation. CymbaDiff imposes structured spatial ordering, explicitly captures cylindrical continuity and vertical hierarchy, and preserves both physical neighborhood relationships and global context within the generated scenes. Extensive experiments on SketchSem3D demonstrate that CymbaDiff achieves superior semantic consistency, spatial realism, and cross-dataset generalization. The code and dataset will be available at here.

View full details

Poster

Mesh Interpolation Graph Network for Dynamic and Spatially Irregular Global Weather Forecasting

Zinan Zheng · Yang Liu · Jia Li

Dec 5, 11:00 AM - 2:00 PM Don Alberto 4

Graph neural networks have shown promising results in weather forecasting, which is critical for human activity such as agriculture planning and extreme weather preparation. However, most studies focus on finite and local areas for training, overlooking the influence of broader areas and limiting their ability to generalize effectively. Thus, in this work, we study global weather forecasting that is irregularly distributed and dynamically varying in practice, requiring the model to generalize to unobserved locations. To address such challenges, we propose a general Mesh Interpolation Graph Network (MIGN) that models the irregular weather station forecasting, consisting of two key designs: (1) learning spatially irregular data with regular mesh interpolation network to align the data; (2) leveraging parametric spherical harmonics location embedding to further enhance spatial generalization ability. Extensive experiments on an up-to-date observation dataset show that MIGN significantly outperforms existing data-driven models. Besides, we show that MIGN has spatial generalization ability, and is capable of generalizing to previously unseen stations.

View full details

Poster

The Quest for Universal Master Key Filters in DS-CNNs

Zahra Babaiee · Peyman M. Kiasari · Daniela Rus · Radu Grosu

Dec 5, 11:00 AM - 2:00 PM Don Alberto 4

A recent study has proposed the ``Master Key Filters Hypothesis" for convolutional neural network filters. This paper extends this hypothesis by radically constraining its scope to a single set of just 8 universal filters that depthwise separable convolutional networks inherently converge to. While conventional DS-CNNs employ thousands of distinct trained filters, our analysis reveals these filters are predominantly linear shifts (ax+b) of our discovered universal set. Through systematic unsupervised search, we extracted these fundamental patterns across different architectures and datasets. Remarkably, networks initialized with these 8 unique frozen filters achieve over 80\% ImageNet accuracy, and even outperform models with thousands of trainable parameters when applied to smaller datasets. The identified master key filters closely match Difference of Gaussians (DoGs), Gaussians, and their derivatives, structures that are not only fundamental to classical image processing but also strikingly similar to receptive fields in mammalian visual systems. Our findings provide compelling evidence that depthwise convolutional layers naturally gravitate toward this fundamental set of spatial operators regardless of task or architecture. This work offers new insights for understanding generalization and transfer learning through the universal language of these master key filters.

View full details

Poster

AI Progress Should Be Measured by Capability-Per-Resource, Not Scale Alone: A Framework for Gradient-Guided Resource Allocation in LLMs

David McCoy · Yulun Wu · Zachary Butzin-Dozier

Dec 5, 11:00 AM - 2:00 PM Don Alberto 4

This position paper challenges the "scaling fundamentalism" dominating AI research, where unbounded growth in model size and computation has led to unsustainable environmental impacts and widening resource inequality. We argue that LLM development should be fundamentally reoriented toward capability-per-resource rather than capability alone. We present a theoretical framework demonstrating that resource-allocation decisions guided by gradient influence patterns can dramatically improve efficiency throughout the AI lifecycle. Our analysis shows that in transformer-based models, where a small fraction of parameters exert outsized influence (following heavy-tailed distributions), three critical insights emerge: (1) updating only high-influence parameters strictly outperforms full-parameter tuning on a performance-per-resource basis; (2) simple gradient norms provide computationally efficient proxies for identifying these high-influence components; and (3) coordinated parameter and data selection yields multiplicative efficiency gains, potentially reducing resource requirements by orders of magnitude. Building on these theoretical foundations, we propose a two-stage paradigm—marginal-return pretraining for foundation developers and influence-guided adaptation for downstream users—bridged by gradient blueprints, metadata describing which parameters matter most for various tasks. This capability-per-resource perspective transforms what were once considered pragmatic hardware workarounds into theoretically optimal strategies, democratizing access to cutting-edge AI capabilities while significantly reducing environmental impact. By embedding resource consciousness into how we develop, adapt, and evaluate models, we can reshape AI progress toward a more sustainable and equitable future.

View full details

Poster

HypoBootstrap: A Bootstrapping Framework for Inductive Reasoning

Si Chen · Yifei Li · Richong Zhang

Dec 5, 11:00 AM - 2:00 PM Don Alberto 4

Inductive reasoning infers general rules from observed evidence, which is one of the most critical intelligence abilities. Previous works have succeeded in formal languages but suffer from onerous and error-prone conversions between a particular formal language and the working language. As large language models (LLMs) have emerged, direct reasoning with various kinds of languages, especially natural languages, without formal language involvement has become feasible. However, existing LLM-based inductive reasoning usually relies on LLM's intrinsic generation ability, which is prone to LLM's hallucination and lacks systematic guidance according to the nature of inductive reasoning. To this end, we propose HypoBootstrap, an integrated framework for inductive reasoning that generates and confirms hypotheses both in a bootstrapping manner. Regarding hypothesis generation, we propose a novel bootstrapping generation strategy, bootstrapping object hypotheses, relational hypotheses, and functional hypotheses successively, which assists LLM in observing the evidence from trivial patterns to non-trivial patterns. Regarding hypothesis confirmation, we utilize Glymour's theory of bootstrap confirmation, a hypothesis confirmation theory from the philosophy of science that can confirm a set of hypotheses. We use its principles to confirm the object hypotheses, relational hypotheses, and functional hypotheses. Empirical studies on four inductive reasoning scenarios of different natures, involving causal induction, concept learning, grammar learning, and abstract reasoning, demonstrate that HypoBootstrap significantly outperforms existing methods.

View full details

Poster

COALA: Numerically Stable and Efficient Framework for Context-Aware Low-Rank Approximation

Uliana Parkina · Maxim Rakhuba

Dec 5, 11:00 AM - 2:00 PM Don Alberto 4

Recent studies suggest that context-aware low-rank approximation is a useful tool for compression and fine-tuning of modern large-scale neural networks. In this type of approximation, a norm is weighted by a matrix of input activations, significantly improving metrics over the unweighted case. Nevertheless, existing methods for neural networks suffer from numerical instabilities due to their reliance on classical formulas involving explicit Gram matrix computation and their subsequent inversion. We demonstrate that this can degrade the approximation quality or cause numerically singular matrices. To address these limitations, we propose a novel inversion-free regularized framework that is based entirely on stable decompositions and overcomes the numerical pitfalls of prior art. Our method can handle all possible challenging scenarios: (1) when calibration matrices exceed GPU memory capacity, (2) when input activation matrices are nearly singular, and even (3) when insufficient data prevents unique approximation. For the latter, we prove that our solution converges to a desired approximation and derive explicit error bounds.

View full details

Poster

A Closer Look at Graph Transformers: Cross-Aggregation and Beyond

Jiaming Zhuo · Ziyi Ma · Yintong Lu · Yuwei Liu · Kun Fu · Di Jin · Chuan Wang · Wu Wenning · Zhen Wang · Xiaochun Cao · Liang Yang

Dec 5, 11:00 AM - 2:00 PM Don Alberto 4

Graph Transformers (GTs), which effectively capture long-range dependencies and structural biases simultaneously, have recently emerged as promising alternatives to traditional Graph Neural Networks (GNNs). Advanced approaches for GTs to leverage topology information involve integrating GNN modules or modulating node attributes using positional encodings. Unfortunately, the underlying mechanism driving their effectiveness remains insufficiently understood. In this paper, we revisit these strategies and uncover a shared underlying mechanism—Cross Aggregation—that effectively captures the interaction between graph topology and node attributes. Building on this insight, we propose the Universal Graph Cross-attention Transformer (UGCFormer), a universal GT framework with linear computational complexity. The idea is to interactively learn the representations of graph topology and node attributes through a linearized Dual Cross-attention (DCA) module. In theory, this module can adaptively capture interactions between these two types of graph information, thereby achieving effective aggregation. To alleviate overfitting arising from the dual-channel design, we introduce a consistency constraint that enforces representational alignment. Extensive evaluations on multiple benchmark datasets demonstrate the effectiveness and efficiency of UGCFormer.

View full details

Poster

Bridging Time and Linguistics: LLMs as Time Series Analyzer through Symbolization and Segmentation

Jianyang Qin · Chaoyang Li · Jinhao Cui · Lingzhi Wang · Zhao Liu · Qing Liao

Dec 5, 11:00 AM - 2:00 PM Don Alberto 4

Recent studies reveal that Large Language Models (LLMs) exhibit strong sequential reasoning capabilities, allowing them to replace specialized time-series models and serve as foundation models for complex time-series analysis. To activate the capabilities of LLMs for time-series tasks, numerous studies have attempted to bridge the gap between time series and linguistics by aligning textual representations with time-series patterns. However, it is a non-trivial endeavor to losslessly capture the infinite time-domain variability using natural language, leading to suboptimal alignment performance. Beyond representation, contextual differences, where semantics in time series are conveyed by consecutive points, unlike in text by individual tokens, are often overlooked by existing methods. To address these, we propose S$^2$TS-LLM, a simple yet effective framework to repurpose LLMs for universal time series analysis through the following two main paradigms: (i) a spectral symbolization paradigm transforms time series into frequency-domain representations characterized by a fixed number of components and prominent amplitudes, which enables a limited set of symbols to effectively abstract key frequency features; (ii) a contextual segmentation paradigm partitions the sequence into blocks based on temporal patterns and reassigns positional encodings accordingly, thereby mitigating the structural mismatch between time series and natural language. Together, these paradigms bootstrap the LLMs' perception of temporal patterns and structures, effectively bridging time series and linguistics. Extensive experiments show that S$^2$TS-LLM can serve as a powerful time series analyzer, outperforming state-of-the-art methods across time series tasks.

View full details

Poster

UniDomain: Pretraining a Unified PDDL Domain from Real-World Demonstrations for Generalizable Robot Task Planning

Haoming Ye · Yunxiao Xiao · Cewu Lu · Panpan Cai

Dec 5, 11:00 AM - 2:00 PM Don Alberto 4

Robotic task planning in real-world environments requires reasoning over implicit constraints from language and vision. While LLMs and VLMs offer strong priors, they struggle with long-horizon structure and symbolic grounding. Existing meth- ods that combine LLMs with symbolic planning often rely on handcrafted or narrow domains, limiting generalization. We propose UniDomain, a framework that pre-trains a PDDL domain from robot manipulation demonstrations and applies it for online robotic task planning. It extracts atomic domains from 12,393 manipulation videos to form a unified domain with 3137 operators, 2875 predicates, and 16481 causal edges. Given a target class of tasks, it retrieves relevant atomics from the unified domain and systematically fuses them into high-quality meta-domains for zero-shot planning. Experiments on diverse real-world tasks show that UniDomain solves complex, unseen tasks in a zero-shot manner, achieving up to 58% higher task success and 160% improvement in plan optimality over state-of-the-art LLM and LLM-PDDL baselines.

View full details

Poster

SSIMBaD: Sigma Scaling with SSIM-Guided Balanced Diffusion for AnimeFace Colorization

Junpyo Seo · HanbinKoo · jieun yook · Byung-Ro Moon

Dec 5, 11:00 AM - 2:00 PM Don Alberto 4

We propose a novel diffusion-based framework for automatic colorization of Anime-style facial sketches, which preserves the structural fidelity of the input sketch while effectively transferring stylistic attributes from a reference image. Our approach builds upon recent continuous-time diffusion models, but departs from traditional methods that rely on predefined noise schedules, which often fail to maintain perceptual consistency across the generative trajectory. To address this, we introduce SSIMBaD (Sigma Scaling with SSIM-Guided Balanced Diffusion), a sigma-space transformation that ensures linear alignment of perceptual degradation, as measured by structural similarity. This perceptual scaling enforces uniform visual difficulty across timesteps, enabling more balanced and faithful reconstructions. On a large-scale Anime face dataset, SSIMBaD attains state-of-the-art structural fidelity and strong perceptual quality, with robust generalization to diverse styles and structural variations.

View full details

Poster

THD-BAR: Topology Hierarchical Derived Brain Autoregressive Modeling for EEG Generic Representations

Wenchao Yang · Weidong Yan · Wenkang Liu · Yulan Ma · Yang Li

Dec 5, 11:00 AM - 2:00 PM Don Alberto 4

Large-scale pre-trained models hold significant potential for learning universal EEG representations. However, most existing methods, particularly autoregressive (AR) frameworks, primarily rely on straightforward temporal sequencing of multi-channel EEG data, which fails to capture the rich physiological characteristics inherent to EEG signals. Moreover, their time-centered modeling approach also limits the effective representation of the dynamic spatial topology of brain activity. To address these challenges and fully exploit the potential of large-scale EEG models, we propose a novel Topology Hierarchical Derived Brain Autoregressive Modeling (THD-BAR) for EEG generic representations. The core innovation of THD-BAR lies in the introduction of the Brain Topology Hierarchy (BTH), which establishes a multi-scale spatial order for EEG channels. This hierarchical structure enables a redefinition of autoregressive learning as a "next-scale-time prediction" problem, effectively capturing both spatial and temporal dynamics. Based on BTH, we design a Topology-Hierarchical Vector Quantized-Variational Autoencoder (THVQ-VAE) for multi-scale tokenization and develop an enhanced Brain Autoregressive (BAR) module with specialized masking strategies for prediction. Through extensive large-scale pre-training on 17 datasets, followed by rigorous validation on 10 downstream datasets spanning 5 distinct tasks, THD-BAR consistently outperforms existing methods. These results highlight the superior generalization and modeling capabilities of our proposed approach.

View full details

Poster

Resource-Constrained Federated Continual Learning: What Does Matter?

Yichen Li · Yuying Wang · Jiahua Dong · Haozhao Wang · Yining Qi · Rui Zhang · Ruixuan Li

Dec 5, 11:00 AM - 2:00 PM Don Alberto 4

Federated Continual Learning (FCL) aims to enable sequential privacy-preserving model training on streams of incoming data that vary in edge devices by preserving previous knowledge while adapting to new data. Current FCL literature focuses on restricted data privacy and access to previously seen data while imposing no constraints on the training overhead. This is unreasonable for FCL applications in real-world scenarios, where edge devices are primarily constrained by resources such as storage, computational budget, and label rate. We revisit this problem with a large-scale benchmark and analyze the performance of state-of-the-art FCL approaches under different resource-constrained settings. Various typical FCL techniques and six datasets in two incremental learning scenarios (Class-IL and Domain-IL) are involved in our experiments. Through extensive experiments amounting to a total of over 1,000+ GPU hours, we find that, under limited resource-constrained settings, existing FCL approaches, with no exception, fail to achieve the expected performance. Our conclusions are consistent in the sensitivity analysis. This suggests that most existing FCL methods are particularly too resource-dependent for real-world deployment. Moreover, we study the performance of typical FCL techniques with resource constraints and shed light on future research directions in FCL.

View full details

Poster

Second-order Optimization under Heavy-Tailed Noise: Hessian Clipping and Sample Complexity Limits

Abdurakhmon Sadiev · Peter Richtarik · Ilyas Fatkhullin

Dec 5, 11:00 AM - 2:00 PM Don Alberto 4

Heavy-tailed noise is pervasive in modern machine learning applications, arising from data heterogeneity, outliers, and non-stationary stochastic environments. While second-order methods can significantly accelerate convergence in light-tailed or bounded-noise settings, such algorithms are often brittle and lack guarantees under heavy-tailed noise—precisely the regimes where robustness is most critical. In this work, we take a first step toward a theoretical understanding of second-order optimization under heavy-tailed noise. We consider a setting where stochastic gradients and Hessians have only bounded $p$-th moments, for some $p\in (1,2]$, and establish tight lower bounds on the sample complexity of any second-order method. We then develop a variant of normalized stochastic gradient descent that leverages second-order information and provably matches these lower bounds. To address the instability caused by large deviations, we introduce a novel algorithm based on gradient and Hessian clipping, and prove high-probability upper bounds that nearly match the fundamental limits. Our results provide the first comprehensive sample complexity characterization for second-order optimization under heavy-tailed noise. This positions Hessian clipping as a robust and theoretically sound strategy for second-order algorithm design in heavy-tailed regimes.

View full details

Poster

3D Gaussian Flats: Hybrid 2D/3D Photometric Scene Reconstruction

Maria Taktasheva · Lily Goli · Alessandro Fiorini · Zhen Li · Daniel Rebain · Andrea Tagliasacchi

Dec 5, 11:00 AM - 2:00 PM Don Alberto 4

Recent advances in radiance fields and novel view synthesis enable creation of realistic digital twins from photographs. However, current methods struggle with flat, texture-less surfaces, creating uneven and semi-transparent reconstructions, due to an ill-conditioned photometric reconstruction objective. Surface reconstruction methods solve this issue but sacrifice visual quality. We propose a novel hybrid 2D/3D representation that jointly optimizes constrained planar (2D) Gaussians for modeling flat surfaces and freeform (3D) Gaussians for the rest of the scene. Our end-to-end approach dynamically detects and refines planar regions, improving both visual fidelity and geometric accuracy. It achieves state-of-the-art depth estimation on ScanNet++ and ScanNetv2, and excels at mesh extraction without overfitting to a specific camera model, showing its effectiveness in producing high-quality reconstruction of indoor scenes.

View full details

Poster

Teaching Language Models to Evolve with Users: Dynamic Profile Modeling for Personalized Alignment

Weixiang Zhao · Xingyu Sui · Yulin Hu · Jiahe Guo · Haixiao Liu · Biye Li · Yanyan Zhao · Bing Qin · Ting Liu

Dec 5, 11:00 AM - 2:00 PM Don Alberto 4

Personalized alignment is essential for enabling large language models (LLMs) to engage effectively in user-centric dialogue. While recent prompt-based and offline optimization methods offer preliminary solutions, they fall short in cold-start scenarios and long-term personalization due to their inherently static and shallow designs. In this work, we introduce the Reinforcement Learning for Personalized Alignment (RLPA) framework, in which an LLM interacts with a simulated user model to iteratively infer and refine user profiles through dialogue. The training process is guided by a dual-level reward structure: the Profile Reward encourages accurate construction of user representations, while the Response Reward incentivizes generation of responses consistent with the inferred profile. We instantiate RLPA by fine-tuning Qwen-2.5-3B-Instruct, resulting in Qwen-RLPA, which achieves state-of-the-art performance in personalized dialogue. Empirical evaluations demonstrate that Qwen-RLPA consistently outperforms prompting and offline fine-tuning baselines, and even surpasses advanced commercial models such as Claude-3.5 and GPT-4o. Further analysis highlights Qwen-RLPA's robustness in reconciling conflicting user preferences, sustaining long-term personalization and delivering more efficient inference compared to recent reasoning-focused LLMs. These results emphasize the potential of dynamic profile inference as a more effective paradigm for building personalized dialogue systems.

View full details

Poster

Connectome-Based Modelling Reveals Orientation Maps in the Drosophila Optic Lobe

Jia Nuo Liew · Shenghan Lin · Bowen Chen · Wei Zhang · Xiaowei Zhu · Wei Zhang · Xiaolin Hu

Dec 5, 11:00 AM - 2:00 PM Don Alberto 4

The ability to extract oriented edges from visual input is a core computation across animal vision systems. Orientation maps, long associated with the layered architecture of the mammalian visual cortex, systematically organise neurons by their preferred edge orientation. Despite lacking cortical structures, the Drosophila melanogaster brain contains feature-selective neurons and exhibits complex visual detection capacity, raising the question of whether map-like vision representations can emerge without cortical infrastructure. We integrate a complete fruit fly brain connectome with biologically grounded spiking neuron models to simulate neuroprocessing in the fly visual system. By driving the network with oriented stimuli and analysing downstream responses, we show that coherent orientation maps can emerge from purely connectome-constrained dynamics. These results suggest that species of independent origin could evolve similar visual structures.

View full details

Poster

Tru-POMDP: Task Planning Under Uncertainty via Tree of Hypotheses and Open-Ended POMDPs

Wenjing Tang · Xinyu He · Yongxi Huang · Yunxiao Xiao · Cewu Lu · Panpan Cai

Dec 5, 11:00 AM - 2:00 PM Don Alberto 4

Task planning under uncertainty is essential for home-service robots operating in the real world. Tasks involve ambiguous human instructions, hidden or unknown object locations, and open-vocabulary object types, leading to significant open-ended uncertainty and a boundlessly large planning space. To address these challenges, we propose Tru-POMDP, a planner that combines structured belief generation using Large Language Models (LLMs) with principled POMDP planning. Tru-POMDP introduces a hierarchical Tree of Hypotheses (TOH), which systematically queries an LLM to construct high-quality particle beliefs over possible world states and human goals. We further formulate an open-ended POMDP model that enables rigorous Bayesian belief tracking and efficient belief-space planning over these LLM-generated hypotheses. Experiments on complex object rearrangement tasks across diverse kitchen environments show that Tru-POMDP significantly outperforms state-of-the-art LLM-based and LLM-tree-search hybrid planners, achieving higher success rates with significantly better plans, stronger robustness to ambiguity and occlusion, and greater planning efficiency.

View full details

Poster

Rectifying Soft-Label Entangled Bias in Long-Tailed Dataset Distillation

Chenyang Jiang · Hang Zhao · Xinyu Zhang · Zhengcen Li · Qiben Shan · Shaocong Wu · Jingyong Su

Dec 5, 11:00 AM - 2:00 PM Don Alberto 4

Dataset distillation compresses large-scale datasets into compact, highly informative synthetic data, significantly reducing storage and training costs. However, existing research primarily focuses on balanced datasets and struggles to perform under real-world long-tailed distributions. In this work, we emphasize the critical role of soft labels in long-tailed dataset distillation and uncover the underlying mechanisms contributing to performance degradation. Specifically, we derive an imbalance-aware generalization bound for model trained on distilled dataset. We then identify two primary sources of soft-label bias, which originate from the distillation model and the distilled images, through systematic perturbation of the data imbalance levels. To address this, we propose ADSA, an Adaptive Soft-label Alignment module that calibrates the entangled biases. This lightweight module integrates seamlessly into existing distillation pipelines and consistently improves performance. On ImageNet-1k-LT with EDC and IPC=50, ADSA improves tail-class accuracy by up to 11.8\% and raises overall accuracy to 41.4\%. Extensive experiments demonstrate that ADSA provides a robust and generalizable solution under limited label budgets and across a range of distillation techniques.

View full details

Poster

Sample-Efficient Tabular Self-Play for Offline Robust Reinforcement Learning

Na Li · Zewu Zheng · Wei Ni · Hangguan Shan · Wenjie Zhang · Xinyu Li

Dec 5, 11:00 AM - 2:00 PM Don Alberto 4

Multi-agent reinforcement learning (MARL), as a thriving field, explores how multiple agents independently make decisions in a shared dynamic environment. Due to environmental uncertainties, policies in MARL must remain robust to tackle the sim-to-real gap. We focus on robust two-player zero-sum Markov games (TZMGs) in offline settings, specifically on tabular robust TZMGs (RTZMGs). We propose a model-based algorithm (RTZ-VI-LCB) for offline RTZMGs, which is optimistic robust value iteration combined with a data-driven Bernstein-style penalty term for robust value estimation. By accounting for distribution shifts in the historical dataset, the proposed algorithm establishes near-optimal sample complexity guarantees under partial coverage and environmental uncertainty. An information-theoretic lower bound is developed to confirm the tightness of our algorithm's sample complexity, which is optimal regarding both state and action spaces. To the best of our knowledge, RTZ-VI-LCB is the first to attain this optimality, sets a new benchmark for offline RTZMGs, and is validated experimentally.

View full details

Poster

Memory-Augmented Potential Field Theory: A Framework for Adaptive Control in Non-Convex Domains

Dongzhe Zheng · Wenjie Mei

Dec 5, 11:00 AM - 2:00 PM Don Alberto 4

Stochastic optimal control methods often struggle in complex non-convex landscapes, frequently becoming trapped in local optima due to their inability to learn from historical trajectory data. This paper introduces Memory-Augmented Potential Field Theory, a unified mathematical framework that integrates historical experience into stochastic optimal control. Our approach dynamically constructs memory-based potential fields that identify and encode key topological features of the state space, enabling controllers to automatically learn from past experiences and adapt their optimization strategy. We provide a theoretical analysis showing that memory-augmented potential fields possess non-convex escape properties, asymptotic convergence characteristics, and computational efficiency. We implement this theoretical framework in a Memory-Augmented Model Predictive Path Integral (MPPI) controller that demonstrates significantly improved performance in challenging non-convex environments. The framework represents a generalizable approach to experience-based learning within control systems (especially robotic dynamics), enhancing their ability to navigate complex state spaces without requiring specialized domain knowledge or extensive offline training.

View full details

Poster

Reduction-based Pseudo-label Generation for Instance-dependent Partial Label Learning

Congyu Qiao · Ning Xu · Yihao Hu · Xin Geng

Dec 5, 11:00 AM - 2:00 PM Don Alberto 4

Instance-dependent Partial Label Learning (ID-PLL) aims to learn a multi-class predictive model given training instances annotated with candidate labels related to features, among which correct labels are hidden fixed but unknown. The previous works involve leveraging the identification capability of the training model itself to iteratively refine supervision information. However, these methods overlook a critical aspect of ID-PLL: within the original label space, the model may fail to distinguish some incorrect candidate labels that are strongly correlated with features from correct labels. This leads to poor-quality supervision signals and creates a bottleneck in the training process. In this paper, we propose to leverage reduction-based pseudo-labels to alleviate the influence of incorrect candidate labels and train our predictive model to overcome this bottleneck. Specifically, reduction-based pseudo-labels are generated by performing weighted aggregation on the outputs of a multi-branch auxiliary model, with each branch trained in a label subspace that excludes certain labels. This approach ensures that each branch explicitly avoids the disturbance of the excluded labels, allowing the pseudo-labels provided for instances troubled by these excluded labels to benefit from the unaffected branches. Theoretically, we demonstrate that reduction-based pseudo-labels exhibit greater consistency with the Bayes optimal classifier compared to pseudo-labels directly generated from the training predictive model.

View full details

Poster

Compressed and Smooth Latent Space for Text Diffusion Modeling

Viacheslav Meshchaninov · Egor Chimbulatov · Alexander Shabalin · Aleksandr Abramov · Dmitry Vetrov

Dec 5, 11:00 AM - 2:00 PM Don Alberto 4

Autoregressive language models dominate modern text generation, yet their sequential nature introduces fundamental limitations: decoding is slow, and maintaining global coherence remains challenging. Diffusion models offer a promising alternative by enabling parallel generation and flexible control; however, their application to text generation is hindered by the high dimensionality of token-level representations. We introduce Cosmos, a novel approach to text generation that operates entirely in a compressed, smooth latent space tailored specifically for diffusion. This space is learned using an autoencoder trained simultaneously for token-level reconstruction and alignment with frozen activations from a pretrained language encoder, providing robust semantic grounding and enabling effective perturbation‑based augmentations. Empirically, we demonstrate that text representations can be compressed up to $8\times$ while maintaining generation quality comparable to token‑level diffusion models. Furthermore, increasing the latent sequence length allows \textsc{Cosmos} to surpass both diffusion‑based and autoregressive baselines. We evaluate Cosmos on four diverse generative tasks including story generation, question generation, summarization, and detoxification and compare it with various generative paradigms. Cosmos achieves comparable or superior generation quality while offering more than $2\times$ faster inference. Code is released at https://github.com/MeshchaninovViacheslav/cosmos.

View full details

Poster

AdaLRS: Loss-Guided Adaptive Learning Rate Search for Efficient Foundation Model Pretraining

Hongyuan Dong · Dingkang Yang · Xiao Liang · ChaoFeng · Ran Jiao

Dec 5, 11:00 AM - 2:00 PM Don Alberto 4

Learning rate is widely regarded as crucial for effective foundation model pretraining. Recent research explores and demonstrates the transferability of learning rate configurations across varying model and dataset sizes, etc. Nevertheless, these approaches are constrained to specific training scenarios and typically necessitate extensive hyperparameter tuning on proxy models. In this work, we propose \textbf{AdaLRS}, a plug-in-and-play adaptive learning rate search algorithm that conducts online optimal learning rate search via optimizing loss descent velocities. We provide theoretical and experimental analyzes to show that foundation model pretraining loss and its descent velocity are both convex and share the same optimal learning rate. Relying solely on training loss dynamics, AdaLRS involves few extra computations to guide the search process, and its convergence is guaranteed via theoretical analysis. Experiments on both LLM and VLM pretraining show that AdaLRS adjusts suboptimal learning rates to the neighborhood of optimum with marked efficiency and effectiveness, with model performance improved accordingly. We also show the robust generalizability of AdaLRS across varying training scenarios, such as different model sizes, training paradigms, base learning rate scheduler choices, and hyperparameter settings.

View full details

Poster

Navigating the MIL Trade-Off: Flexible Pooling for Whole Slide Image Classification

Hossein Jafarinia · Danial Hamdi · Amirhossein Alamdar · Elahe Zahiri · Soroush Vafaie Tabar · Alireza Alipanah · Nahal Mirzaie · Saeed Razavi · Amir Najafi · Mohammad Hossein Rohban

Dec 5, 11:00 AM - 2:00 PM Don Alberto 4

Multiple Instance Learning (MIL) is a standard weakly supervised approach for Whole Slide Image (WSI) classification, where performance hinges on both feature representation and MIL pooling strategies. Recent research has predominantly focused on Transformer-based architectures adapted for WSIs. However, we argue that this trend faces a fundamental limitation: data scarcity. In typical settings, Transformer models yield only marginal gains without access to large-scale datasets—resources that are virtually inaccessible to all but a few well-funded research labs. Motivated by this, we revisit simple, non-attention MIL with unsupervised slide features and analyze temperature-$\beta$-controlled log-sum-exp (LSE) pooling. For slides partitioned into $N$ patches, we theoretically show that LSE has a smooth transition at a critical $\beta_{\mathrm{crit}}=\mathcal{O}(\log N)$ threshold, interpolating between mean-like aggregation (stable, better generalization but less sensitive) and max-like aggregation (more sensitive but looser generalization bounds). Grounded in this analysis, we introduce Maxsoft—a novel MIL pooling function that enables flexible control over this trade-off, allowing adaptation to specific tasks and datasets. To further tackle real-world deployment challenges such as specimen heterogeneity, we propose PerPatch augmentation—a simple yet effective technique that enhances model robustness. Empirically, Maxsoft achieves state-of-the-art performance in low-data regimes across four major benchmarks (CAMELYON16, CAMELYON17, TCGA-Lung, and SICAP-MIL), often matching or surpassing large-scale foundation models. When combined with PerPatch augmentation, this performance is further improved through increased robustness. Code is available at \href{https://github.com/jafarinia/maxsoft}{\texttt{https://github.com/jafarinia/maxsoft}}

View full details

Poster

Towards foundational LiDAR world models with efficient latent flow matching

Tianran Liu · Shengwen Zhao · Nicholas Rhinehart

Dec 5, 11:00 AM - 2:00 PM Don Alberto 4

LiDAR-based world models offer more structured and geometry-aware representations than their image-based counterparts. However, existing LiDAR world models are narrowly trained; each model excels only in the domain for which it was built. This raises a critical question: can we develop LiDAR world models that exhibit strong transferability across multiple domains? To answer this, we conduct the first systematic domain transfer study across three demanding scenarios: (i) outdoor to indoor generalization, (ii) sparse- to dense-beam adaptation, and (iii) non-semantic to semantic transfer. Given different amounts of fine-tuning data, our experiments show that a single pretrained model can achieve up to 11\% absolute improvement (83\% relative) over training from scratch and outperforms training from scratch in 30/36 of our comparisons. This transferability significantly reduces the reliance on manually annotated data for semantic occupancy forecasting: our method exceeds previous baselines with only 5\% of the labeled training data of prior work. We also observed inefficiencies of current generative-model-based LiDAR world models, mainly through their under-compression of LiDAR data and inefficient training objectives. To address these issues, we propose a latent conditional flow matching (CFM)-based framework that achieves state-of-the-art reconstruction accuracy using only half the training data and a compression ratio 6 times higher than that of prior methods. Our model also achieves SOTA performance on semantic occupancy forecasting while being 1.98x-23x more computationally efficient (a 1.1x-3.9x FPS speedup) than previous methods.

View full details

Poster

Tight Bounds on the Distortion of Randomized and Deterministic Distributed Voting

Mohammad Abam · Davoud Kareshki · Marzieh Nilipour · MohammadHossein Paydar · Masoud Seddighin

Dec 5, 11:00 AM - 2:00 PM Don Alberto 4

We study metric distortion in distributed voting, where $n$ voters are partitioned into $k$ groups, each selecting a local representative, and a final winner is chosen from these representatives (or from the entire set of candidates). This setting models systems like U.S. presidential elections, where state-level decisions determine the national outcome. We focus on four cost objectives from Anshelevich \et~\cite{anshelevich2022distortion}: $\avgavg$, $\avgmax$, $\maxavg$, and $\maxmax$. We present improved distortion bounds for both deterministic and randomized mechanisms, offering a near-complete characterization of distortion in this model. For deterministic mechanisms, we reduce the upper bound for $\avgmax$ from $11$ to $7$, establish a tight lower bound of $5$ for $\maxavg$ (improving on $2+\sqrt{5}$), and tighten the upper bound for $\maxmax$ from $5$ to $3$. For randomized mechanisms, we consider two settings: (i) only the second stage is randomized, and (ii) both stages may be randomized. In case (i), we prove tight bounds: $5\!-\!2/k$ for $\avgavg$, $3$ for $\avgmax$ and $\maxmax$, and $5$ for $\maxavg$. In case (ii), we show tight bounds of $3$ for $\maxavg$ and $\maxmax$, and nearly tight bounds for $\avgavg$ and $\avgmax$ within $[3\!-\!2/n,\ 3\!-\!2/(kn^*)]$ and $[3\!-\!2/n,\ 3]$, respectively, where $n^*$ denotes the largest group size.

View full details

Poster

User-Instructed Disparity-aware Defocus Control

Yudong Han · Yan Yang · Hao Yang · Liyuan Pan

Dec 5, 11:00 AM - 2:00 PM Don Alberto 4

In photography, an All-in-Focus (AiF) image may not always effectively convey the creator’s intent. Professional photographers manipulate Depth of Field (DoF) to control which regions appear sharp or blurred, achieving compelling artistic effects. For general users, the ability to flexibly adjust DoF enhances creative expression and image quality. In this paper, we propose UiD, a User-Instructed DoF control framework, that allows users to specify refocusing regions using text, box, or point prompts, and our UiD automatically simulates in-focus and out-of-focus (OoF) regions in the given images. However, controlling defocus blur in a single-lens camera remains challenging due to the difficulty in estimating depth-aware aberrations and the suboptimal quality of reconstructed AiF images. To address this, we leverage dual-pixel (DP) sensors, commonly found in DSLR-style and mobile cameras. DP sensors provide a small-baseline stereo pair in a single snapshot, enabling depth-aware aberration estimation. Our approach first establishes an invertible mapping between OoF and AiF images to learn spatially varying defocus kernels and the disparity features. These depth-aware kernels enable bidirectional image transformation—deblurring out-of-focus (OoF) images into all-in-focus (AiF) representations, and conversely reblurring AiF images into OoF outputs—by seamlessly switching between the kernel and its inverse form. These depth-aware kernels enable both deblurring of OoF images into AiF representations and reblurring AiF images into OoF representations by flexibly switching its original form to its inverse one. For user-guided refocusing, we first generate masks based on user prompts using SAM, which modulates disparity features in closed form, allowing dynamic kernel re-estimation for reblurring. This achieves user-controlled refocusing effects. Extensive experiments on both common datasets and the self-collected dataset demonstrate that UiD offers superior flexibility and quality in DoF manipulation imaging.

View full details

Poster

SEMPO: Lightweight Foundation Models for Time Series Forecasting

Hui He · Kun Yi · Yuanchi Ma · Qi Zhang · Zhendong Niu · Guansong Pang

Dec 5, 4:30 PM - 7:30 PM Don Alberto 4

The recent boom of large pre-trained models witnesses remarkable success in developing foundation models (FMs) for time series forecasting. Despite impressive performance across diverse downstream forecasting tasks, existing time series FMs possess massive network architectures and require substantial pre-training on large-scale datasets, which significantly hinders their deployment in resource-constrained environments. In response to this growing tension between versatility and affordability, we propose SEMPO, a novel lightweight foundation model that requires pretraining on relatively small-scale data, yet exhibits strong general time series forecasting. Concretely, SEMPO comprises two key modules: 1) energy-aware SpEctral decomposition module, that substantially improves the utilization of pre-training data by modeling not only the high-energy frequency signals but also the low-energy yet informative frequency signals that are ignored in current methods; and 2) Mixture-of-PrOmpts enabled Transformer, that learns heterogeneous temporal patterns through small dataset-specific prompts and adaptively routes time series tokens to prompt-based experts for parameter-efficient model adaptation across different datasets and domains. Equipped with these modules, SEMPO significantly reduces both pre-training data scale and model size, while achieving strong generalization. Extensive experiments on two large-scale benchmarks covering 16 datasets demonstrate the superior performance of SEMPO in both zero-shot and few-shot forecasting scenarios compared with state-of-the-art methods. Code and data are available at https://github.com/mala-lab/SEMPO.

View full details

Poster

Aligning What Matters: Masked Latent Adaptation for Text-to-Audio-Video Generation

Jiyang Zheng · Siqi Pan · Yu Yao · Zhaoqing Wang · Dadong Wang · Tongliang Liu

Dec 5, 4:30 PM - 7:30 PM Don Alberto 4

Text-to-Audio-Video (T2AV) generation aims to produce temporally and semantically aligned visual and auditory content from natural language descriptions. While recent progress in text-to-audio and text-to-video models has improved generation quality within each modality, jointly modeling them remains challenging due to incomplete and asymmetric correspondence: audio often reflects only a subset of the visual scene, and vice versa. Naively enforcing full alignment introduces semantic noise and temporal mismatches. To address this, we propose a novel framework that performs selective cross-modal alignment through a learnable masking mechanism, enabling the model to isolate and align only the shared latent components relevant to both modalities. This mechanism is integrated into an adaptation module that interfaces with pretrained encoders and decoders from latent video and audio diffusion models, preserving their generative capacity with reduced training overhead. Theoretically, we show that our masked objective provably recovers the minimal set of shared latent variables across modalities. Empirically, our method achieves state-of-the-art performance on standard T2AV benchmarks, demonstrating significant improvements in audiovisual synchronization and semantic consistency.

View full details

Poster

GraphKeeper: Graph Domain-Incremental Learning via Knowledge Disentanglement and Preservation

Zihao Guo · Qingyun Sun · Ziwei Zhang · Haonan Yuan · HUIPING ZHUANG · Xingcheng Fu · Jianxin Li

Dec 5, 4:30 PM - 7:30 PM Don Alberto 4

Graph incremental learning (GIL), which continuously updates graph models by sequential knowledge acquisition, has garnered significant interest recently. However, existing GIL approaches focus on task-incremental and class-incremental scenarios within a single domain. Graph domain-incremental learning (Domain-IL), aiming at updating models across multiple graph domains, has become critical with the development of graph foundation models (GFMs), but remains unexplored in the literature. In this paper, we propose Graph Domain-Incremental Learning via Knowledge Disentanglement and Preservation (GraphKeeper), to address catastrophic forgetting in Domain-IL scenario from the perspectives of embedding shifts and decision boundary deviations. Specifically, to prevent embedding shifts and confusion across incremental graph domains, we first propose the domain-specific parameter-efficient fine-tuning together with intra- and inter-domain disentanglement objectives. Consequently, to maintain a stable decision boundary, we introduce deviation-free knowledge preservation to continuously fit incremental domains. Additionally, for graphs with unobservable domains, we perform domain-aware distribution discrimination to obtain precise embeddings. Extensive experiments demonstrate the proposed GraphKeeper achieves state-of-the-art results with 6.5%\~16.6% improvement over the runner-up with negligible forgetting. Moreover, we show GraphKeeper can be seamlessly integrated with various representative GFMs, highlighting its broad applicative potential.

View full details

Poster

Integral Imprecise Probability Metrics

Siu Lun (Alan) Chau · Michele Caprio · Krikamol Muandet

Dec 5, 4:30 PM - 7:30 PM Don Alberto 4

Quantifying differences between probability distributions is fundamental to statistics and machine learning, primarily for comparing statistical uncertainty. In contrast, epistemic uncertainty---due to incomplete knowledge---requires richer representations than those offered by classical probability. Imprecise probability (IP) theory offers such models, capturing ambiguity and partial belief. This has driven growing interest in imprecise probabilistic machine learning (IPML), where inference and decision-making rely on broader uncertainty models---highlighting the need for metrics beyond classical probability. This work introduces the Integral Imprecise Probability Metric (IIPM) framework, a Choquet integral-based generalisation of classical Integral Probability Metric to the setting of capacities---a broad class of IP models encompassing many existing ones, including lower probabilities, probability intervals, belief functions, and more. Theoretically, we establish conditions under which IIPM serves as a valid metric and metrises a form of weak convergence of capacities. Practically, IIPM not only enables comparison across different IP models but also supports the quantification of epistemic uncertainty~(EU) within a single IP model. In particular, by comparing an IP model with its conjugate, IIPM gives rise to a new class of epistemic uncertainty measures---Maximum Mean Imprecision (MMI) ---which satisfy key axiomatic properties proposed in the Uncertainty Quantification literature. We validate MMI through selective classification experiments, demonstrating strong empirical performance against established EU measures, and outperforming them when classical methods struggle to scale to a large number of classes. Our work advances both theory and practice in Imprecise Probabilistic Machine Learning, offering a principled framework for comparing and quantifying epistemic uncertainty under imprecision.

View full details

Poster

YEAST: Yet Another Sequential Test

Alexey Kurennoy · Majed Dodin · Tural Gurbanov · Ana Peleteiro Ramallo

Dec 5, 4:30 PM - 7:30 PM Don Alberto 4

The online evaluation of machine learning models is typically conducted through A/B experiments. Sequential statistical tests are valuable tools for analysing these experiments, as they enable researchers to stop data collection early without increasing the risk of false discoveries. However, existing sequential tests either limit the number of interim analyses or suffer from low statistical power. In this paper, we introduce a novel sequential test designed for the continuous monitoring of A/B experiments. We validate our method using semi-synthetic simulations and demonstrate that it outperforms current state-of-the-art sequential testing approaches. Our method is derived using a new technique that inverts a bound on the probability of threshold crossing, based on a classical maximal inequality.

View full details

Poster

Diversity Is All You Need for Contrastive Learning: Spectral Bounds on Gradient Magnitudes

Peter Ochieng

Dec 5, 4:30 PM - 7:30 PM Don Alberto 4

Contrastive learning thrives—or fails—based on how we construct \emph{positive} and \emph{negative} pairs. In the absence of explicit labels, models must infer semantic structure from these proxy signals. Early work on Siamese networks \citep{chopra2005learning,hadsell2006dimensionality} already showed that pair construction directly shapes learned representations. In modern contrastive frameworks, poor pair selection remains a primary failure mode: it either causes collapse, where all embeddings converge to a point, or wastes the representational capacity of the space \citep{chen2020simple,tian2020makes,khosla2020supervised}. Contemporary methods typically generate positives via semantic-preserving augmentations (crop, jitter, view transform), while negatives are drawn from other elements in the mini-batch under the assumption that different images are semantically dissimilar. But this assumption breaks down in fine-grained, low-diversity, or high-resolution settings \citep{kalantidis2020hard,robinson2020contrastive,chuang2020debiased}, motivating techniques such as hard-negative mining and debiased losses \citep{bachman2019learning,tian2020makes}. \paragraph{Beyond pairs: batch-level diversity.} While most prior work focuses on \emph{which} individual negatives to select, we study the geometry of the entire batch. Our central observation is this: the overall \emph{diversity} of the batch embedding space strongly governs both training dynamics and representational quality. If diversity is too low, the model sees nearly identical negatives and gradients vanish—leading to collapse. If diversity is too high, negatives become almost orthogonal, but the resulting gradients shrink in magnitude, and learning slows. Optimal training thus occurs within a \emph{moderate diversity window}: high enough to avoid collapse, low enough to preserve update strength.

View full details

Poster

LoRA-EnVar: Parameter-Efficient Hybrid Ensemble Variational Assimilation for Weather Forecasting

Yi Xiao · Hang Fan · Kun Chen · Ye Cao · Ben Fei · Wei Xue · LEI BAI

Dec 5, 4:30 PM - 7:30 PM Don Alberto 4

Accurate estimation of background error (i.e., forecast error) distribution is critical for effective data assimilation (DA) in numerical weather prediction (NWP). In state-of-the-art operational DA systems, it is common to account for the temporal evolution of background errors by employing hybrid methods, which blend a static climatological covariance with a flow-dependent ensemble-derived component. While effective to some extent, these methods typically assume Gaussian-distributed errors and rely heavily on hand-crafted covariance structures and domain expertise, limiting their ability to capture the complex, non-Gaussian nature of atmospheric dynamics. In this work, we propose LoRA-EnVar, a novel hybrid ensemble variational DA algorithm that integrates low-rank adaptation (LoRA) into a deep generative modeling framework. We first learn a climatological background error distribution using a variational autoencoder (VAE) trained on historical data. To incorporate flow-dependent uncertainty, we introduce LoRA modules that efficiently adapt the learned distribution in response to flow-dependent ensemble perturbations. Our approach supports online finetuning, enabling dynamic updates of the background error distribution without catastrophic forgetting. We validate LoRA-EnVar in high-resolution assimilation settings using the FengWu forecast model and simulated observations from ERA5 reanalysis. Experimental results show that LoRA-EnVar significantly improves assimilation accuracy over models assuming static background error distribution and achieves comparable or better performance than full finetuning while reducing the number of trainable parameters by three orders of magnitude. This demonstrates the potential of parameter-efficient adaptation for scalable, non-Gaussian DA in operational meteorology.

View full details

Poster

EAReranker: Efficient Embedding Adequacy Assessment for Retrieval Augmented Generation

Dongyang Zeng · Yaping Liu · Wei Zhang · Shuo Zhang · Xinwang Liu · Binxing Fang

Dec 5, 4:30 PM - 7:30 PM Don Alberto 4

With the increasing adoption of Retrieval-Augmented Generation (RAG) systems for knowledge-intensive tasks, ensuring the adequacy of retrieved documents has become critically important for generation quality. Traditional reranking approaches face three significant challenges: substantial computational overhead that scales with document length, dependency on plain text that limits application in sensitive scenarios, and insufficient assessment of document value beyond simple relevance metrics. We propose EAReranker, an efficient embedding-based adequacy assessment framework that evaluates document utility for RAG systems without requiring access to original text content. The framework quantifies document adequacy through a comprehensive scoring methodology considering verifiability, coverage, completeness and structural aspects, providing interpretable adequacy classifications for downstream applications. EAReranker employs a Decoder-Only Transformer architecture that introduces embedding dimension expansion method and bin-aware weighted loss, designed specifically to predict adequacy directly from embedding vectors. Our comprehensive evaluation across four public benchmarks demonstrates that EAReranker achieves competitive performance with state-of-the-art plaintext rerankers while maintaining constant memory usage ($\sim$550MB) regardless of input length and processing 2-3x faster than traditional approaches. The semantic bin adequacy prediction accuracy of 92.85\% LACC@10 and 86.12\% LACC@25 demonstrates its capability to effectively filter out inadequate documents that could potentially mislead or adversely impact RAG system performance, thereby ensuring only high-utility information serves as generation context. These results establish EAReranker as an efficient and practical solution for enhancing RAG system performance through improved context selection while addressing the computational and privacy challenges of existing methods.

View full details

Poster

Afterburner: Reinforcement Learning Facilitates Self-Improving Code Efficiency Optimization

Mingzhe Du · Anh Tuan Luu · Yue Liu · Yuhao Qing · Dong HUANG · Xinyi He · Qian Liu · Zejun MA · See-Kiong Ng

Dec 5, 4:30 PM - 7:30 PM Don Alberto 4

Large Language Models (LLMs) generate functionally correct solutions but often fall short in code efficiency, a critical bottleneck for real-world deployment. In this paper, we introduce a novel test-time iterative optimization framework to address this, employing a closed-loop system where LLMs iteratively refine code based on empirical performance feedback from an execution sandbox. We explore three training strategies: Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Group Relative Policy Optimization~(GRPO). Experiments on our Venus dataset and the APPS benchmark show that SFT and DPO rapidly saturate in efficiency gains. In contrast, GRPO, using reinforcement learning (RL) with execution feedback, continuously optimizes code performance, significantly boosting both pass@1 (from 47% to 62%) and the likelihood of outperforming human submissions in efficiency (from 31% to 45%). Our work demonstrates effective test-time code efficiency improvement and critically reveals the power of RL in teaching LLMs to truly self-improve code efficiency. We released our code and data at https://github.com/Elfsong/Afterburner.

View full details

Poster

More Than Just Functional: LLM-as-a-Critique for Efficient Code Generation

Derui Zhu · Dingfan Chen · jinfu chen · Jens Grossklags · Alexander Pretschner · Weiyi Shang

Dec 5, 4:30 PM - 7:30 PM Don Alberto 4

Large language models (LLMs) have demonstrated remarkable progress in generating functional code, leading to numerous AI-based coding program tools. However, their reliance on the perplexity objective during both training and inference primarily emphasizes functionality, often at the expense of efficiency—an essential consideration for real-world coding tasks. Perhaps interestingly, we observed that well-trained LLMs inherently possess knowledge about code efficiency, but this potential remains underutilized with standard decoding approaches. To address this, we design strategic prompts to activate the model’s embedded efficiency understanding, effectively using LLMs as \textit{efficiency critiques} to guide code generation toward higher efficiency without sacrificing—and sometimes even improving—functionality, all without the need for costly real code execution. Extensive experiments on benchmark datasets (EffiBench, HumanEval+) across multiple representative code models demonstrate up to a 70.6\% reduction in average execution time and a 13.6\% decrease in maximum memory usage, highlighting the computational efficiency and practicality of our approach compared to existing alternatives.

View full details

Poster

Linguini: A benchmark for language-agnostic linguistic reasoning

Eduardo Sánchez · Belen Alastruey · Christophe Ropers · Arina Turkatenko · Pontus Lars Erik Saito Stenetorp · Mikel Artetxe · Marta Ruiz Costa-jussà

Dec 5, 4:30 PM - 7:30 PM Don Alberto 4

We propose a new benchmark to measure a language model's linguistic reasoning skills without relying on pre-existing language-specific knowledge. The test covers 894 questions grouped in 160 problems across 75 (mostly) extremely low-resource languages, extracted from the International Linguistic Olympiad corpus. To attain high accuracy on this benchmark, models don't need previous knowledge of the tested language, as all the information needed to solve the linguistic puzzle is presented in the context. We find that, while all analyzed models rank below 25% accuracy, there is a significant gap between open and closed models, with the best-performing proprietary model scoring 24.05% and the best-performing open model 8.84%.

View full details

Poster

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

Ling Fu · Zhebin Kuang · Jiajun Song · Mingxin Huang · Biao Yang · Yuzhe Li · Linghao Zhu · Qidi Luo · Xinyu Wang · Hao Lu · Zhang Li · Guozhi Tang · Bin Shan · Chunhui Lin · Qi Liu · Binghong Wu · Hao Feng · Hao Liu · Can Huang · Jingqun Tang · Wei Chen · Lianwen Jin · Yuliang Liu · Xiang Bai

Dec 5, 4:30 PM - 7:30 PM Don Alberto 4

Scoring the Optical Character Recognition (OCR) capabilities of Large Multimodal Models (LMMs) has witnessed growing interest. Existing benchmarks have highlighted the impressive performance of LMMs in text recognition; however, their abilities in certain challenging tasks, such as text localization, handwritten content extraction, and logical reasoning, remain underexplored. To bridge this gap, we introduce OCRBench v2, a large-scale bilingual text-centric benchmark with currently the most comprehensive set of tasks ($4\times$ more tasks than the previous multi-scene benchmark OCRBench), the widest coverage of scenarios ($31$ diverse scenarios), and thorough evaluation metrics, with $10,000$ human-verified question-answering pairs and a high proportion of difficult samples. Moreover, we construct a private test set with $1,500$ manually annotated images. The consistent evaluation trends observed across both public and private test sets validate the OCRBench v2's reliability. After carefully benchmarking state-of-the-art LMMs, we find that most LMMs score below $50$ ($100$ in total) and suffer from five-type limitations, including less frequently encountered text recognition, fine-grained perception, layout perception, complex element parsing, and logical reasoning. The benchmark and evaluation scripts are available at https://github.com/Yuliang-Liu/MultimodalOCR.

View full details

Poster

F-Adapter: Frequency-Adaptive Parameter-Efficient Fine-Tuning in Scientific Machine Learning

Hangwei Zhang · Chun Kang · Yan Wang · Difan Zou

Dec 5, 4:30 PM - 7:30 PM Don Alberto 4

Parameter-efficient fine-tuning (PEFT) powerful pre-trained models for complex downstream tasks has proven effective in vision and language processing, yet this paradigm remains unexplored in scientific machine learning, where the objective is to model complex physical systems. We conduct the first systematic study of PEFT for pre-trained Large Operator Models (LOMs) obtained by scaling variants of Fourier Neural Operator. We observe that the widely used Low-Rank Adaptation (LoRA) yields markedly poorer performance on LOMs than Adapter tuning. We further theoretically establish that stacked LoRA incurs a depth-amplified lower bound on approximation error within Fourier layers, whereas adapters retain universal approximation capacity and, by concentrating parameters on energy-dominant low-frequency modes, attain exponentially decaying error with bottleneck width in the Fourier domain. Motivated by the robust empirical gains of adapters and by our theoretical characterization of PDE solutions as spectrally sparse, we introduce Frequency-Adaptive Adapter (F-Adapter). F-Adapter allocates adapter capacity based on spectral complexity, assigning higher-dimension modules to low-frequency components and lower-dimension modules to high-frequency components. Our F-Adapters establish state-of-the-art results on multiple challenging 3D Navier–Stokes benchmarks, markedly enhancing both generalization and spectral fidelity over LoRA and other PEFT techniques commonly used in LLMs. To the best of our knowledge, this work is the first to explore PEFT for scientific machine-learning and establishes F-Adapter as an effective paradigm for this domain. The code will be made publicly available upon acceptance.

View full details

Poster

Storyboard-guided Alignment for Fine-grained Video Action Recognition

Enqi Liu · Liyuan Pan · Yan Yang · Yiran Zhong · Zhijing Wu · Xinxiao Wu · Liu Liu

Dec 5, 4:30 PM - 7:30 PM Don Alberto 4

Fine-grained video action recognition can be formulated as a video–text matching problem. Previous approaches primarily rely on global video semantics to consolidate video embeddings, often leading to misaligned video–text pairs due to inaccurate atomic-level action understanding. This inaccuracy arises due to i) videos with distinct global semantics may share similar atomic actions or visual appearances, and ii) atomic actions can be momentary, gradual, or not directly aligned with overarching video semantics. Inspired by storyboarding, where a script is segmented into individual shots, we propose a multi-granularity framework, SFAR. SFAR generates fine-grained descriptions of common atomic actions for each global semantic using a large language model. Unlike existing works that refine global semantics with auxiliary video frames, SFAR introduces a filtering metric to ensure correspondence between the descriptions and the global semantics, eliminating the need for direct video involvement and thereby enabling more nuanced recognition of subtle actions. By leveraging both global semantics and fine-grained descriptions, our SFAR effectively identifies prominent frames within videos, thereby improving the accuracy of embedding aggregation. Extensive experiments on various video action recognition datasets demonstrate the competitive performance of our SFAR in supervised, few-shot, and zero-shot settings.

View full details

Poster

Turning the Tables: Enabling Backward Transfer via Causal-Aware LoRA in Continual Learning

Chaoyang Li · Runze Ye · Jianyang Qin · Jinhao Cui · Lingzhi Wang · Ning Hu · Qing Liao

Dec 5, 4:30 PM - 7:30 PM Don Alberto 4

Current parameter-efficient fine-tuning (PEFT) methods have shown superior performance in continual learning. However, most existing PEFT-based methods focus on mitigating catastrophic forgetting by limiting modifications to the old task model caused by new tasks. This hinders backward knowledge transfer, as when new tasks have a strong positive correlation with old tasks, appropriately training on new tasks can transfer beneficial knowledge to old tasks. Critically, achieving backward knowledge transfer faces two fundamental challenges: (1) some parameters may be ineffective on task performance, which constrains the task solution space and model capacity; (2) since old task data are inaccessible, modeling task correlation via shared data is infeasible. To address these challenges, we propose CaLoRA, a novel \textbf{c}ausal-\textbf{a}ware \textbf{lo}w-\textbf{r}ank \textbf{a}daptation framework that is the first PEFT-based continual learning work with backward knowledge transfer. Specifically, we first propose \textbf{p}ar\textbf{a}meter-level \textbf{c}ounterfactual \textbf{a}ttribution (PaCA) that estimates the causal effect of LoRA parameters via counterfactual reasoning, identifying effective parameters from a causal view. Second, we propose \textbf{c}ross-t\textbf{a}sk \textbf{g}radient \textbf{a}daptation (CaGA) to quantify task correlation by gradient projection and evaluate task affinity based on gradient similarity. By incorporating causal effect, task correlation, and affinity, CaGA adaptively adjusts task gradients, facilitating backward knowledge transfer without relying on data replay. Extensive experiments across multiple benchmarks and continual learning settings show that CaLoRA outperforms state-of-the-art methods. In particular, CaLoRA better mitigates catastrophic forgetting by enabling positive backward knowledge transfer.

View full details

Poster

Semi-supervised Graph Anomaly Detection via Robust Homophily Learning

GUOGUO AI · Hezhe Qiao · Hui Yan · Guansong Pang

Dec 5, 4:30 PM - 7:30 PM Don Alberto 4

Current semi-supervised graph anomaly detection (GAD) methods utilizes a small set of labeled normal nodes to identify abnormal nodes from a large set of unlabeled nodes in a graph. These methods posit that 1) normal nodes share a similar level of homophily and 2) the labeled normal nodes can well represent the homophily patterns in the entire normal class. However, this assumption often does not hold well since normal nodes in a graph can exhibit diverse homophily in real-world GAD datasets. In this paper, we propose RHO, namely Robust Homophily Learning, to adaptively learn such homophily patterns. RHO consists of two novel modules, adaptive frequency response filters (AdaFreq) and graph normality alignment (GNA). AdaFreq learns a set of adaptive spectral filters that capture different frequency components of the labeled normal nodes with varying homophily in the channel-wise and cross-channel views of node attributes. GNA is introduced to enforce consistency between the channel-wise and cross-channel homophily representations to robustify the normality learned by the filters in the two views. Experiments on eight real-world GAD datasets show that RHO can effectively learn varying, often under-represented, homophily in the small labeled node set and substantially outperforms state-of-the-art competing methods. Code is available at \url{https://github.com/mala-lab/RHO}.

View full details

Poster

The Quotient Bayesian Learning Rule

Mykola Lukashchuk · Raphaël Trésor · Wouter Nuijten · Ismail Senoz · Bert Vries

Dec 5, 4:30 PM - 7:30 PM Don Alberto 4

This paper introduces the Quotient Bayesian Learning Rule, an extension of natural-gradient Bayesian updates to probability models that fall outside the exponential family. Building on the observation that many heavy-tailed and otherwise non-exponential distributions arise as marginals of minimal exponential families, we prove that such marginals inherit a unique Fisher–Rao information geometry via the quotient-manifold construction. Exploiting this geometry, we derive the Quotient Natural Gradient algorithm, which takes steepest-descent steps in the well-structured covering space, thereby guaranteeing parameterization-invariant optimization in the target space. Empirical results on the Student-$t$ distribution confirm that our method converges more rapidly and attains higher-quality solutions than previous variants of the Bayesian Learning Rule. These findings position quotient geometry as a unifying tool for efficient and principled inference across a broad class of latent-variable models.

View full details

Poster

AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models

Xinyi Wang · Xun Yang · Yanlong Xu · Yuchen Wu · Zhen Li · Na Zhao

Dec 5, 4:30 PM - 7:30 PM Don Alberto 4

Effective human-agent collaboration in physical environments requires understanding not only what to act upon, but also where the actionable elements are and how to interact with them. Existing approaches often operate at the object level or disjointedly handle fine-grained affordance reasoning, lacking coherent, instruction-driven grounding and reasoning. In this work, we introduce a new task: Fine-grained 3D Embodied Reasoning, which requires an agent to predict, for each referenced affordance element in a 3D scene, a structured triplet comprising its spatial location, motion type, and motion axis, based on a task instruction. To solve this task, we propose AffordBot, a novel framework that integrates Multimodal Large Language Models (MLLMs) with a tailored chain-of-thought (CoT) reasoning paradigm. To bridge the gap between 3D input and 2D-compatible MLLMs, we render surround-view images of the scene and project 3D element candidates into these views, forming a rich visual representation aligned with the scene geometry. Our CoT pipeline begins with an active perception stage, prompting the MLLM to select the most informative viewpoint based on the instruction, before proceeding with step-by-step reasoning to localize affordance elements and infer plausible interaction motions. Evaluated on the SceneFun3D dataset, AffordBot achieves state-of-the-art performance, demonstrating strong generalization and physically grounded reasoning with only 3D point cloud input and MLLMs.

View full details

Poster

UFO-RL: Uncertainty-Focused Optimization for Efficient Reinforcement Learning Data Selection

Yang Zhao · Kai Xiong · Xiao Ding · Li Du · Yangou Ouyang · Zhouhao Sun · Jiannan Guan · Wenbin Zhang · Bin Liu · Dong Hu · Bing Qin · Ting Liu

Dec 5, 4:30 PM - 7:30 PM Don Alberto 4

A primary impediment to scaling reinforcement learning (RL) for large language model (LLM) training is the substantial computational cost, predominantly arising from the necessity of multi-sampling for policy optimization and evaluation. This underscores the critical yet challenging nature of efficient training data selection. Drawing inspiration from the Zone of Proximal Development (ZPD) theory, which posits that learners acquire knowledge more effectively from tasks of intermediate difficulty, we hypothesize that LLMs exhibit optimal learning from data they have not yet mastered but demonstrate the potential to comprehend. Conventional methodologies for assessing data difficulty or informativeness typically rely on computationally intensive multi-sampling or iterative procedures. To address this limitation, we introduce UFO-RL (**U**ncertainty-**F**ocused **O**ptimization for **R**einforcement **L**earning), a novel framework that employs a computationally efficient single-pass uncertainty estimation technique to identify informative training instances. This method, requiring only a single forward pass and obviating the need for iterative next-token computation, achieves a significant acceleration (up to 185$\times$) in data evaluation compared to multi-sampling approaches. UFO-RL leverages this efficient metric to select data within the model's estimated ZPD for training. Extensive experimentation across diverse LLMs and mathematical benchmarks demonstrates that training with a mere 10\% of the data, carefully selected by UFO-RL, yields performance comparable to or even surpassing that of full-data training. Furthermore, this targeted data selection results in up to a 16$\times$ reduction in overall training time, concurrently enhancing training stability and improving generalization capabilities. Thus, UFO-RL presents a practical and highly efficient strategy for scaling RL fine-tuning of LLMs by focusing learning efforts on the most informative and valuable data, thereby mitigating the computational bottlenecks associated with traditional RL training.

View full details

Poster

RankSEG-RMA: An Efficient Segmentation Algorithm via Reciprocal Moment Approximation

Zixun Wang · Ben Dai

Dec 5, 4:30 PM - 7:30 PM Don Alberto 4

Semantic segmentation labels each pixel in an image with its corresponding class, and is typically evaluated using the Intersection over Union (IoU) and Dice metrics to quantify the overlap between predicted and ground-truth segmentation masks. In the literature, most existing methods estimate pixel-wise class probabilities, then apply argmax or thresholding to obtain the final prediction. These methods have been shown to generally lead to inconsistent or suboptimal results, as they do not directly maximize segmentation metrics. To address this issue, a novel consistent segmentation framework, RankSEG, has been proposed, which includes RankDice and RankIoU specifically designed to optimize the Dice and IoU metrics, respectively. Although RankSEG almost guarantees improved performance, it suffers from two major drawbacks. First, it is its computational expense—RankDice has a complexity of $\mathcal{O}(d \log d)$ with a substantial constant factor (where $d$ represents the number of pixels), while RankIoU exhibits even higher complexity $\mathcal{O}(d^2)$, thus limiting its practical application. For instance, in LiTS, prediction with RankSEG takes 16.33 seconds compared to just 0.01 seconds with the argmax rule. Second, RankSEG is only applicable to overlapping segmentation settings, where multiple classes can occupy the same pixel, which contrasts with standard benchmarks that typically assume non-overlapping segmentation. In this paper, we overcome these two drawbacks via a \textit{reciprocal moment approximation} (RMA) of RankSEG with the following contributions: (i) we improve RankSEG using RMA, namely RankSEG-RMA, reduces the complexity of both algorithms to $\mathcal{O}(d)$ while maintaining comparable performance; (ii) inspired by RMA, we develop a pixel-wise score function that allows efficient implementation for non-overlapping segmentation settings. We illustrate the effectiveness of our method across various datasets and state-of-the-art models. The code of our method is available in: \url{https://github.com/ZixunWang/RankSEG-RMA}.

View full details

Poster

A Dynamic Learning Strategy for Dempster-Shafer Theory with Applications in Classification and Enhancement

Linlin Fan · Xingyu Liu · Mingliang Zhou · Xuekai Wei · Weizhi Xian · Jielu Yan · Weijia Jia

Dec 5, 4:30 PM - 7:30 PM Don Alberto 4

Effective modelling of uncertain information is crucial for quantifying uncertainty. Dempster–Shafer evidence (DSE) theory is a widely recognized approach for handling uncertain information. However, current methods often neglect the inherent a priori information within data during modelling, and imbalanced data lead to insufficient attention to key information in the model. To address these limitations, this paper presents a dynamic learning strategy based on nonuniform splitting mechanism and Hilbert space mapping. First, the framework uses a nonuniform splitting mechanism to dynamically adjust the weights of data subsets and combines the diffusion factor to effectively incorporate the data a priori information, thereby flexibly addressing uncertainty and conflict. Second, the conflict in the information fusion process is reduced by Hilbert space mapping. Experimental results on multiple tasks show that the proposed method significantly outperforms state-of-the-art methods and effectively improves the performance of classification and low-light image enhancement (LLIE) tasks. The code is available at https://anonymous.4open.science/r/Third-ED16.

View full details

Poster

Cognitive Predictive Processing: A Human-inspired Framework for Adaptive Exploration in Open-World Reinforcement Learning

boheng liu · Ziyu Li · Chenghua Duan · YuTian Liu · Zhuo Wang · Xiuxing Li · Qing Li · Xia Wu

Dec 5, 4:30 PM - 7:30 PM Don Alberto 4

Open-world reinforcement learning challenges agents to develop intelligent behavior in vast exploration spaces. Recent approaches like LS-Imagine have advanced the field by extending imagination horizons through jumpy state transitions, yet remain limited by fixed exploration mechanisms and static jump thresholds that cannot adapt across changing task phases, resulting in inefficient exploration and lower completion rates. Humans demonstrate remarkable capabilities in open-world decision-making through a chain-like process of task decomposition, selective memory utilization, and adaptive uncertainty regulation. Inspired by human decision-making processes, we present Cognitive Predictive Processing (CPP), a novel framework that integrates three neurologically-inspired systems: a phase-adaptive cognitive controller that dynamically decomposes tasks into exploration, approach, and completion phases with adaptive parameters; a dual-memory integration system implementing dual-modal memory that balances immediate context with selective long-term storage; and an uncertainty-modulated prediction regulator that continuously updates environmental predictions to modulate exploration behavior. Comprehensive experiments in MineDojo demonstrate that these human-inspired decision-making strategies enhance performance over recent techniques, with success rates improving by an average of 4.6\% across resource collection tasks while reducing task completion steps by an average of 7.1\%. Our approach bridges cognitive neuroscience and reinforcement learning, excelling in complex scenarios that require sustained exploration and strategic adaptation while demonstrating how neural-inspired models can solve key challenges in open-world AI systems.

View full details

Poster

SPARTAN: A Sparse Transformer World Model Attending to What Matters

Anson Lei · Bernhard Schölkopf · Ingmar Posner

Dec 5, 4:30 PM - 7:30 PM Don Alberto 4

Capturing the interactions between entities in a structured way plays a central role in world models that flexibly adapt to changes in the environment. Recent works motivate the benefits of models that explicitly represent the structure of interactions and formulate the problem as discovering local causal structures. In this work, we demonstrate that reliably capturing these relationships in complex settings remains challenging. To remedy this shortcoming, we postulate that sparsity is a critical ingredient for the discovery of such local structures. To this end we present the SPARse TrANsformer World model (SPARTAN), a Transformer-based world model that learns context-dependent interaction structures between entities in a scene. By applying sparsity regularisation on the attention patterns between object-factored tokens, SPARTAN learns sparse, context-dependent interaction graphs that accurately predict future object states. We further extend our model to adapt to sparse interventions with unknown targets on the dynamics of the environment. This results in a highly interpretable world model that can efficiently adapt to changes. Empirically, we evaluate SPARTAN against the current state-of-the-art in object-centric world models on observation-based environments and demonstrate that our model can learn local causal graphs that accurately reflects the underlying interactions between objects and achieve significantly improved few-shot adaptation to dynamics changes as well as robustness against distractors.

View full details

Poster

Prompt Tuning Transformers for Data Memorization

Haiyu Wang · Yuanyuan Lin

Dec 5, 4:30 PM - 7:30 PM Don Alberto 4

Prompt tuning has emerged as a powerful parameter-efficient fine-tuning technique, allowing large pretrained Transformers to adapt to downstream tasks by optimizing a small set of prompt embeddings. Despite its empirical success, the extent to which prompt tuning can memorize data remains poorly understood. In this paper, we provide both theoretical and empirical analyses of data memorization ability of prompt-tuned Transformers. Building on recent theoretical frameworks, we derive an upper bound on the required prompt length for exact memorization of finite datasets and establish a trade-off between prompt length and the number of autoregressive generation steps. Specifically, we show that a constant-size Transformer can memorize $n$ input-output pairs with prompts of length $\tilde{O}(\sqrt{nN})$, where $N$ denotes the sequence length. Empirical results further demonstrate that prompt-tuned, randomly initialized Transformers are able to effectively memorize finite datasets. These models also capture the intrinsic low-rank structure of the data, leading to a reduction in the required prompt length. Finally, we analyze how the initialization of the Transformer backbone affects the performance of prompt tuning. Our findings provide new insights into the expressivity, efficiency, and underlying mechanisms of prompt tuning, bridging theoretical memorization limits with observed empirical behaviors.

View full details

Poster

Neural Attention Search

Difan Deng · Marius Lindauer

Dec 5, 4:30 PM - 7:30 PM Don Alberto 4

We present Neural Attention Search (NAtS), an end-to-end learnable sparse transformer that automatically evaluates the importance of each token within a sequence and determines if the corresponding token can be dropped after several steps. To this end, we design a search space that contains three token types: (i) Global Tokens will be preserved and queried by all the following tokens. (ii) Local Tokens survive until the next global token appears. (iii) Sliding Window Tokens have an impact on the inference of a fixed size of the next following tokens. Similar to the One-Shot Neural Architecture Search approach, this token-type information can be learned jointly with the architecture weights via a learnable attention mask. Experiments on both training a new transformer from scratch and fine-tuning existing large language models show that NAtS can efficiently reduce the KV cache and the inference costs for the transformer-based models while maintaining the models' performance.

View full details

Poster

UniGTE: Unified Graph–Text Encoding for Zero-Shot Generalization across Graph Tasks and Domains

Duo Wang · Yuan Zuo · Guangyue Lu · Junjie Wu

Don Alberto 4

Generalizing to unseen graph tasks without task-specific supervision is challenging: conventional graph neural networks are typically tied to a fixed label space, while large language models (LLMs) struggle to capture graph structure. We introduce UniGTE, an instruction-tuned encoder–decoder framework that unifies structural and semantic reasoning. The encoder augments a pretrained autoregressive LLM with learnable alignment tokens and a structure-aware graph–text attention mechanism, enabling it to attend jointly to a tokenized graph and a natural-language task prompt while remaining permutation-invariant to node order. This yields compact, task-aware graph representations. Conditioned solely on these representations, a frozen LLM decoder predicts and reconstructs: it outputs the task answer and simultaneously paraphrases the input graph in natural language. The reconstruction objective regularizes the encoder to preserve structural cues. UniGTE is instruction-tuned on five datasets spanning node-, edge-, and graph-level tasks across diverse domains, yet requires no fine-tuning at inference. It achieves new state-of-the-art zero-shot results on node classification, link prediction, graph classification and graph regression under cross-task and cross-domain settings, demonstrating that tight integration of graph structure with LLM semantics enables robust, transferable graph reasoning.

View full details