Poster Session
Mexico City Poster Session 5
Don Alberto 4
AngleRoCL: Angle-Robust Concept Learning for Physically View-Invariant Adversarial Patches
Wenjun Ji · Yuxiang Fu · Luyang Ying · Deng-Ping Fan · Yuyi Wang · Ming-Ming Cheng · Ivor Tsang · Qing Guo
Cutting-edge works have demonstrated that text-to-image (T2I) diffusion models can generate adversarial patches that mislead state-of-the-art object detectors in the physical world, revealing detectors' vulnerabilities and risks. However, these methods neglect the T2I patches' attack effectiveness when observed from different views in the physical world (i.e., angle robustness of the T2I adversarial patches). In this paper, we study the angle robustness of T2I adversarial patches comprehensively, revealing their angle-robust issues, demonstrating that texts affect the angle robustness of generated patches significantly, and task-specific linguistic instructions fail to enhance the angle robustness. Motivated by the studies, we introduce Angle-Robust Concept Learning (AngleRoCL), a simple and flexible approach that learns a generalizable concept (i.e., text embeddings in implementation) representing the capability of generating angle-robust patches. The learned concept can be incorporated into textual prompts and guides T2I models to generate patches with their attack effectiveness inherently resistant to viewpoint variations. Through extensive simulation and physical-world experiments on five SOTA detectors across multiple views, we demonstrate that AngleRoCL significantly enhances the angle robustness of T2I adversarial patches compared to baseline methods. Our patches maintain high attack success rates even under challenging viewing conditions, with over 50% average relative improvement in attack effectiveness across multiple angles. This research advances the understanding of physically angle-robust patches and provides insights into the relationship between textual concepts and physical properties in T2I-generated contents. We released our code at https://github.com/tsingqguo/anglerocl.
User-Instructed Disparity-aware Defocus Control
Yudong Han · Yan Yang · Hao Yang · Liyuan Pan
In photography, an All-in-Focus (AiF) image may not always effectively convey the creator’s intent. Professional photographers manipulate Depth of Field (DoF) to control which regions appear sharp or blurred, achieving compelling artistic effects. For general users, the ability to flexibly adjust DoF enhances creative expression and image quality. In this paper, we propose UiD, a User-Instructed DoF control framework, that allows users to specify refocusing regions using text, box, or point prompts, and our UiD automatically simulates in-focus and out-of-focus (OoF) regions in the given images. However, controlling defocus blur in a single-lens camera remains challenging due to the difficulty in estimating depth-aware aberrations and the suboptimal quality of reconstructed AiF images. To address this, we leverage dual-pixel (DP) sensors, commonly found in DSLR-style and mobile cameras. DP sensors provide a small-baseline stereo pair in a single snapshot, enabling depth-aware aberration estimation. Our approach first establishes an invertible mapping between OoF and AiF images to learn spatially varying defocus kernels and the disparity features. These depth-aware kernels enable bidirectional image transformation—deblurring out-of-focus (OoF) images into all-in-focus (AiF) representations, and conversely reblurring AiF images into OoF outputs—by seamlessly switching between the kernel and its inverse form. These depth-aware kernels enable both deblurring of OoF images into AiF representations and reblurring AiF images into OoF representations by flexibly switching its original form to its inverse one. For user-guided refocusing, we first generate masks based on user prompts using SAM, which modulates disparity features in closed form, allowing dynamic kernel re-estimation for reblurring. This achieves user-controlled refocusing effects. Extensive experiments on both common datasets and the self-collected dataset demonstrate that UiD offers superior flexibility and quality in DoF manipulation imaging.
HypoBootstrap: A Bootstrapping Framework for Inductive Reasoning
Si Chen · Yifei Li · Richong Zhang
Inductive reasoning infers general rules from observed evidence, which is one of the most critical intelligence abilities. Previous works have succeeded in formal languages but suffer from onerous and error-prone conversions between a particular formal language and the working language. As large language models (LLMs) have emerged, direct reasoning with various kinds of languages, especially natural languages, without formal language involvement has become feasible. However, existing LLM-based inductive reasoning usually relies on LLM's intrinsic generation ability, which is prone to LLM's hallucination and lacks systematic guidance according to the nature of inductive reasoning. To this end, we propose HypoBootstrap, an integrated framework for inductive reasoning that generates and confirms hypotheses both in a bootstrapping manner. Regarding hypothesis generation, we propose a novel bootstrapping generation strategy, bootstrapping object hypotheses, relational hypotheses, and functional hypotheses successively, which assists LLM in observing the evidence from trivial patterns to non-trivial patterns. Regarding hypothesis confirmation, we utilize Glymour's theory of bootstrap confirmation, a hypothesis confirmation theory from the philosophy of science that can confirm a set of hypotheses. We use its principles to confirm the object hypotheses, relational hypotheses, and functional hypotheses. Empirical studies on four inductive reasoning scenarios of different natures, involving causal induction, concept learning, grammar learning, and abstract reasoning, demonstrate that HypoBootstrap significantly outperforms existing methods.
When Less Language is More: Language-Reasoning Disentanglement Makes LLMs Better Multilingual Reasoners
Weixiang Zhao · Jiahe Guo · Yang Deng · Tongtong Wu · Wenxuan Zhang · Yulin Hu · Xingyu Sui · Yanyan Zhao · Wanxiang Che · Bing Qin · Tat-Seng Chua · Ting Liu
Multilingual reasoning remains a significant challenge for large language models (LLMs), with performance disproportionately favoring high-resource languages. Drawing inspiration from cognitive neuroscience, which suggests that human reasoning functions largely independently of language processing, we hypothesize that LLMs similarly encode reasoning and language as separable components that can be disentangled to enhance multilingual reasoning. To evaluate this, we perform a causal intervention by ablating language-specific representations at inference time. Experiments on 10 open-weight LLMs spanning 11 typologically diverse languages show that this language-specific ablation consistently boosts multilingual reasoning performance. Layer-wise analyses further confirm that language and reasoning representations can be effectively disentangled throughout the model, yielding improved multilingual reasoning capabilities, while preserving top-layer language features remains essential for maintaining linguistic fidelity. Compared to post-training methods such as supervised fine-tuning or reinforcement learning, our training-free language-reasoning disentanglement achieves comparable or superior results with minimal computational overhead. These findings shed light on the internal mechanisms underlying multilingual reasoning in LLMs and suggest a lightweight and interpretable strategy for improving cross-lingual generalization.
Teaching Language Models to Evolve with Users: Dynamic Profile Modeling for Personalized Alignment
Weixiang Zhao · Xingyu Sui · Yulin Hu · Jiahe Guo · Haixiao Liu · Biye Li · Yanyan Zhao · Bing Qin · Ting Liu
Personalized alignment is essential for enabling large language models (LLMs) to engage effectively in user-centric dialogue. While recent prompt-based and offline optimization methods offer preliminary solutions, they fall short in cold-start scenarios and long-term personalization due to their inherently static and shallow designs. In this work, we introduce the Reinforcement Learning for Personalized Alignment (RLPA) framework, in which an LLM interacts with a simulated user model to iteratively infer and refine user profiles through dialogue. The training process is guided by a dual-level reward structure: the Profile Reward encourages accurate construction of user representations, while the Response Reward incentivizes generation of responses consistent with the inferred profile. We instantiate RLPA by fine-tuning Qwen-2.5-3B-Instruct, resulting in Qwen-RLPA, which achieves state-of-the-art performance in personalized dialogue. Empirical evaluations demonstrate that Qwen-RLPA consistently outperforms prompting and offline fine-tuning baselines, and even surpasses advanced commercial models such as Claude-3.5 and GPT-4o. Further analysis highlights Qwen-RLPA's robustness in reconciling conflicting user preferences, sustaining long-term personalization and delivering more efficient inference compared to recent reasoning-focused LLMs. These results emphasize the potential of dynamic profile inference as a more effective paradigm for building personalized dialogue systems.
SHAP values via sparse Fourier representation
Ali Gorji · Andisheh Amrollahi · Andreas Krause
SHAP (SHapley Additive exPlanations) values are a widely used method for local feature attribution in interpretable and explainable AI. We propose an efficient two-stage algorithm for computing SHAP values in both black-box setting and tree-based models. Motivated by spectral bias in real-world predictors, we first approximate models using compact Fourier representations, exactly for trees and approximately for black-box models. In the second stage, we introduce a closed-form formula for {\em exactly} computing SHAP values using the Fourier representation, that ``linearizes'' the computation into a simple summation and is amenable to parallelization. As the Fourier approximation is computed only once, our method enables amortized SHAP value computation, achieving significant speedups over existing methods and a tunable trade-off between efficiency and precision.
Reward-Aware Proto-Representations in Reinforcement Learning
Hon Tik Tse · Siddarth Chandrasekar · Marlos C. Machado
In recent years, the successor representation (SR) has attracted increasing attention in reinforcement learning (RL), and it has been used to address some of its key challenges, such as exploration, credit assignment, and generalization. The SR can be seen as representing the underlying credit assignment structure of the environment by implicitly encoding its induced transition dynamics. However, the SR is reward-agnostic. In this paper, we discuss a similar representation that also takes into account the reward dynamics of the problem. We study the default representation (DR), a recently proposed representation with limited theoretical (and empirical) analysis. Here, we lay some of the theoretical foundation underlying the DR in the tabular case by (1) deriving dynamic programming and (2) temporal-difference methods to learn the DR, (3) characterizing the basis for the vector space of the DR, and (4) formally extending the DR to the function approximation case through default features. Empirically, we analyze the benefits of the DR in many of the settings in which the SR has been applied, including (1) reward shaping, (2) option discovery, (3) exploration, and (4) transfer learning. Our results show that, compared to the SR, the DR gives rise to qualitatively different, reward-aware behaviour and quantitatively better performance in several settings.
SSIMBaD: Sigma Scaling with SSIM-Guided Balanced Diffusion for AnimeFace Colorization
Junpyo Seo · HanbinKoo · jieun yook · Byung-Ro Moon
We propose a novel diffusion-based framework for automatic colorization of Anime-style facial sketches, which preserves the structural fidelity of the input sketch while effectively transferring stylistic attributes from a reference image. Our approach builds upon recent continuous-time diffusion models, but departs from traditional methods that rely on predefined noise schedules, which often fail to maintain perceptual consistency across the generative trajectory. To address this, we introduce SSIMBaD (Sigma Scaling with SSIM-Guided Balanced Diffusion), a sigma-space transformation that ensures linear alignment of perceptual degradation, as measured by structural similarity. This perceptual scaling enforces uniform visual difficulty across timesteps, enabling more balanced and faithful reconstructions. On a large-scale Anime face dataset, SSIMBaD attains state-of-the-art structural fidelity and strong perceptual quality, with robust generalization to diverse styles and structural variations.
Connectome-Based Modelling Reveals Orientation Maps in the Drosophila Optic Lobe
Jia Nuo Liew · Shenghan Lin · Bowen Chen · Wei Zhang · Xiaowei Zhu · Wei Zhang · Xiaolin Hu
The ability to extract oriented edges from visual input is a core computation across animal vision systems. Orientation maps, long associated with the layered architecture of the mammalian visual cortex, systematically organise neurons by their preferred edge orientation. Despite lacking cortical structures, the Drosophila melanogaster brain contains feature-selective neurons and exhibits complex visual detection capacity, raising the question of whether map-like vision representations can emerge without cortical infrastructure. We integrate a complete fruit fly brain connectome with biologically grounded spiking neuron models to simulate neuroprocessing in the fly visual system. By driving the network with oriented stimuli and analysing downstream responses, we show that coherent orientation maps can emerge from purely connectome-constrained dynamics. These results suggest that species of independent origin could evolve similar visual structures.
Aligning Evaluation with Clinical Priorities: Calibration, Label Shift, and Error Costs
Gerardo Flores · Alyssa H. Smith · Julia Fukuyama · Ashia Wilson
Machine learning-based decision support systems are increasingly deployed in clinical settings, where probabilistic scoring functions are used to inform and prioritize patient management decisions. However, widely used scoring rules, such as accuracy and AUC-ROC, fail to adequately reflect key clinical priorities, including calibration, robustness to distributional shifts, and sensitivity to asymmetric error costs. In this work, we propose a principled yet practical evaluation framework for selecting calibrated thresholded classifiers that explicitly accounts for uncertainty in class prevalences and domain-specific cost asymmetries. Building on the theory of proper scoring rules, particularly the Schervish representation, we derive an adjusted variant of cross-entropy (log score) that averages cost-weighted performance over clinically relevant ranges of class balance. The resulting evaluation is simple to apply, sensitive to clinical deployment conditions, and designed to prioritize models that are both calibrated and robust to real-world variations.
3D Gaussian Flats: Hybrid 2D/3D Photometric Scene Reconstruction
Maria Taktasheva · Lily Goli · Alessandro Fiorini · Zhen Li · Daniel Rebain · Andrea Tagliasacchi
Recent advances in radiance fields and novel view synthesis enable creation of realistic digital twins from photographs. However, current methods struggle with flat, texture-less surfaces, creating uneven and semi-transparent reconstructions, due to an ill-conditioned photometric reconstruction objective. Surface reconstruction methods solve this issue but sacrifice visual quality. We propose a novel hybrid 2D/3D representation that jointly optimizes constrained planar (2D) Gaussians for modeling flat surfaces and freeform (3D) Gaussians for the rest of the scene. Our end-to-end approach dynamically detects and refines planar regions, improving both visual fidelity and geometric accuracy. It achieves state-of-the-art depth estimation on ScanNet++ and ScanNetv2, and excels at mesh extraction without overfitting to a specific camera model, showing its effectiveness in producing high-quality reconstruction of indoor scenes.
A Closer Look at Graph Transformers: Cross-Aggregation and Beyond
Jiaming Zhuo · Ziyi Ma · Yintong Lu · Yuwei Liu · Kun Fu · Di Jin · Chuan Wang · Wu Wenning · Zhen Wang · Xiaochun Cao · Liang Yang
Graph Transformers (GTs), which effectively capture long-range dependencies and structural biases simultaneously, have recently emerged as promising alternatives to traditional Graph Neural Networks (GNNs). Advanced approaches for GTs to leverage topology information involve integrating GNN modules or modulating node attributes using positional encodings. Unfortunately, the underlying mechanism driving their effectiveness remains insufficiently understood. In this paper, we revisit these strategies and uncover a shared underlying mechanism—Cross Aggregation—that effectively captures the interaction between graph topology and node attributes. Building on this insight, we propose the Universal Graph Cross-attention Transformer (UGCFormer), a universal GT framework with linear computational complexity. The idea is to interactively learn the representations of graph topology and node attributes through a linearized Dual Cross-attention (DCA) module. In theory, this module can adaptively capture interactions between these two types of graph information, thereby achieving effective aggregation. To alleviate overfitting arising from the dual-channel design, we introduce a consistency constraint that enforces representational alignment. Extensive evaluations on multiple benchmark datasets demonstrate the effectiveness and efficiency of UGCFormer.
AdaLRS: Loss-Guided Adaptive Learning Rate Search for Efficient Foundation Model Pretraining
Hongyuan Dong · Dingkang Yang · Xiao Liang · ChaoFeng · Ran Jiao
Learning rate is widely regarded as crucial for effective foundation model pretraining. Recent research explores and demonstrates the transferability of learning rate configurations across varying model and dataset sizes, etc. Nevertheless, these approaches are constrained to specific training scenarios and typically necessitate extensive hyperparameter tuning on proxy models. In this work, we propose \textbf{AdaLRS}, a plug-in-and-play adaptive learning rate search algorithm that conducts online optimal learning rate search via optimizing loss descent velocities. We provide theoretical and experimental analyzes to show that foundation model pretraining loss and its descent velocity are both convex and share the same optimal learning rate. Relying solely on training loss dynamics, AdaLRS involves few extra computations to guide the search process, and its convergence is guaranteed via theoretical analysis. Experiments on both LLM and VLM pretraining show that AdaLRS adjusts suboptimal learning rates to the neighborhood of optimum with marked efficiency and effectiveness, with model performance improved accordingly. We also show the robust generalizability of AdaLRS across varying training scenarios, such as different model sizes, training paradigms, base learning rate scheduler choices, and hyperparameter settings.
AI Progress Should Be Measured by Capability-Per-Resource, Not Scale Alone: A Framework for Gradient-Guided Resource Allocation in LLMs
David McCoy · Yulun Wu · Zachary Butzin-Dozier
This position paper challenges the "scaling fundamentalism" dominating AI research, where unbounded growth in model size and computation has led to unsustainable environmental impacts and widening resource inequality. We argue that LLM development should be fundamentally reoriented toward capability-per-resource rather than capability alone. We present a theoretical framework demonstrating that resource-allocation decisions guided by gradient influence patterns can dramatically improve efficiency throughout the AI lifecycle. Our analysis shows that in transformer-based models, where a small fraction of parameters exert outsized influence (following heavy-tailed distributions), three critical insights emerge: (1) updating only high-influence parameters strictly outperforms full-parameter tuning on a performance-per-resource basis; (2) simple gradient norms provide computationally efficient proxies for identifying these high-influence components; and (3) coordinated parameter and data selection yields multiplicative efficiency gains, potentially reducing resource requirements by orders of magnitude. Building on these theoretical foundations, we propose a two-stage paradigm—marginal-return pretraining for foundation developers and influence-guided adaptation for downstream users—bridged by gradient blueprints, metadata describing which parameters matter most for various tasks. This capability-per-resource perspective transforms what were once considered pragmatic hardware workarounds into theoretically optimal strategies, democratizing access to cutting-edge AI capabilities while significantly reducing environmental impact. By embedding resource consciousness into how we develop, adapt, and evaluate models, we can reshape AI progress toward a more sustainable and equitable future.
Bridging Time and Linguistics: LLMs as Time Series Analyzer through Symbolization and Segmentation
Jianyang Qin · Chaoyang Li · Jinhao Cui · Lingzhi Wang · Zhao Liu · Qing Liao
Recent studies reveal that Large Language Models (LLMs) exhibit strong sequential reasoning capabilities, allowing them to replace specialized time-series models and serve as foundation models for complex time-series analysis. To activate the capabilities of LLMs for time-series tasks, numerous studies have attempted to bridge the gap between time series and linguistics by aligning textual representations with time-series patterns. However, it is a non-trivial endeavor to losslessly capture the infinite time-domain variability using natural language, leading to suboptimal alignment performance. Beyond representation, contextual differences, where semantics in time series are conveyed by consecutive points, unlike in text by individual tokens, are often overlooked by existing methods. To address these, we propose S$^2$TS-LLM, a simple yet effective framework to repurpose LLMs for universal time series analysis through the following two main paradigms: (i) a spectral symbolization paradigm transforms time series into frequency-domain representations characterized by a fixed number of components and prominent amplitudes, which enables a limited set of symbols to effectively abstract key frequency features; (ii) a contextual segmentation paradigm partitions the sequence into blocks based on temporal patterns and reassigns positional encodings accordingly, thereby mitigating the structural mismatch between time series and natural language. Together, these paradigms bootstrap the LLMs' perception of temporal patterns and structures, effectively bridging time series and linguistics. Extensive experiments show that S$^2$TS-LLM can serve as a powerful time series analyzer, outperforming state-of-the-art methods across time series tasks.
COALA: Numerically Stable and Efficient Framework for Context-Aware Low-Rank Approximation
Uliana Parkina · Maxim Rakhuba
Recent studies suggest that context-aware low-rank approximation is a useful tool for compression and fine-tuning of modern large-scale neural networks. In this type of approximation, a norm is weighted by a matrix of input activations, significantly improving metrics over the unweighted case. Nevertheless, existing methods for neural networks suffer from numerical instabilities due to their reliance on classical formulas involving explicit Gram matrix computation and their subsequent inversion. We demonstrate that this can degrade the approximation quality or cause numerically singular matrices. To address these limitations, we propose a novel inversion-free regularized framework that is based entirely on stable decompositions and overcomes the numerical pitfalls of prior art. Our method can handle all possible challenging scenarios: (1) when calibration matrices exceed GPU memory capacity, (2) when input activation matrices are nearly singular, and even (3) when insufficient data prevents unique approximation. For the latter, we prove that our solution converges to a desired approximation and derive explicit error bounds.
Compressed and Smooth Latent Space for Text Diffusion Modeling
Viacheslav Meshchaninov · Egor Chimbulatov · Alexander Shabalin · Aleksandr Abramov · Dmitry Vetrov
Autoregressive language models dominate modern text generation, yet their sequential nature introduces fundamental limitations: decoding is slow, and maintaining global coherence remains challenging. Diffusion models offer a promising alternative by enabling parallel generation and flexible control; however, their application to text generation is hindered by the high dimensionality of token-level representations. We introduce Cosmos, a novel approach to text generation that operates entirely in a compressed, smooth latent space tailored specifically for diffusion. This space is learned using an autoencoder trained simultaneously for token-level reconstruction and alignment with frozen activations from a pretrained language encoder, providing robust semantic grounding and enabling effective perturbation‑based augmentations. Empirically, we demonstrate that text representations can be compressed up to $8\times$ while maintaining generation quality comparable to token‑level diffusion models. Furthermore, increasing the latent sequence length allows \textsc{Cosmos} to surpass both diffusion‑based and autoregressive baselines. We evaluate Cosmos on four diverse generative tasks including story generation, question generation, summarization, and detoxification and compare it with various generative paradigms. Cosmos achieves comparable or superior generation quality while offering more than $2\times$ faster inference. Code is released at https://github.com/MeshchaninovViacheslav/cosmos.
CymbaDiff: Structured Spatial Diffusion for Sketch-based 3D Semantic Urban Scene Generation
Li Liang · Bo Miao · Xinyu Wang · NAVEED AKHTAR · Jordan Vice · Ajmal Mian
Outdoor 3D semantic scene generation produces realistic and semantically rich environments for applications such as urban simulation and autonomous driving. However, advances in this direction are constrained by the absence of publicly available, well-annotated datasets. We introduce SketchSem3D, the first large‑scale benchmark for generating 3D outdoor semantic scenes from abstract freehand sketches and pseudo‑labeled annotations of satellite images. SketchSem3D includes two subsets, Sketch-based SemanticKITTI and Sketch-based KITTI-360 (containing LiDAR voxels along with their corresponding sketches and annotated satellite images), to enable standardized, rigorous, and diverse evaluations. We also propose Cylinder Mamba Diffusion (CymbaDiff) that significantly enhances spatial coherence in outdoor 3D scene generation. CymbaDiff imposes structured spatial ordering, explicitly captures cylindrical continuity and vertical hierarchy, and preserves both physical neighborhood relationships and global context within the generated scenes. Extensive experiments on SketchSem3D demonstrate that CymbaDiff achieves superior semantic consistency, spatial realism, and cross-dataset generalization. The code and dataset will be available at here.
DrivingRecon: Large 4D Gaussian Reconstruction Model For Autonomous Driving
Hao LU · Tianshuo Xu · Wenzhao Zheng · Yunpeng Zhang · Wei Zhan · Dalong Du · Masayoshi TOMIZUKA · Kurt Keutzer · Yingcong Chen
Large reconstruction model has remarkable progress, which can directly predict 3D or 4D representations for unseen scenes and objects. However, current work has not systematically explored the potential of large reconstruction models in the field of autonomous driving. To achieve this, we introduce the Large 4D Gaussian Reconstruction Model (DrivingRecon). With an elaborate and simple framework design, it not only ensures efficient and high-quality reconstruction, but also provides potential for downstream tasks. There are two core contributions: firstly, the Prune and Dilate Block (PD-Block) is proposed to prune redundant and overlapping Gaussian points and dilate Gaussian points for complex objects. Then, dynamic and static decoupling is tailored to better learn the temporary-consistent geometry across different time. Experimental results demonstrate that DrivingRecon significantly improves scene reconstruction quality compared to existing methods. Furthermore, we explore applications of DrivingRecon in model pre-training, vehicle type adaptation, and scene editing. Our code will be available.
GLNCD: Graph-Level Novel Category Discovery
Bowen Deng · Lele Fu · Sheng Huang · Tianchi Liao · Jialong Chen · Zhang Tao · Chuan Chen
Graph classification has long assumed a closed-world setting, limiting its applicability to real-world scenarios where new categories often emerge. To address this limitation, we introduce Graph-Level Novel Category Discovery (GLNCD), a new task aimed at identifying unseen graph categories without supervision from novel classes. We first adapt classical Novel Category Discovery (NCD) methods for images to the graph domain and evaluate these baseline methods on four diverse graph datasets curated for the GLNCD task. Our analysis reveals that these methods suffer a notable performance degradation compared to their image-based counterparts, due to two key challenges: (1) insufficient utilization of structural information in graph self-supervised learning (SSL), and (2) ineffective pseudo-labeling strategies based on ranking statistics (RS) that neglect graph structure. To alleviate these issues, we propose ProtoFGW-NCD, a framework consisting of two core components: ProtoFGW-CL, a novel graph SSL framework, and FGW-RS, a structure-aware pseudo-labeling method. Both components employ a differentiable Fused Gromov-Wasserstein (FGW) distance to effectively compare graphs by incorporating structural information. These components are built upon learnable prototype graphs, which enable efficient, parallel FGW-based graph comparisons and capture representative patterns within graph datasets. Experiments on four GLNCD benchmark datasets demonstrate the effectiveness of ProtoFGW-NCD.
Light-Weight Diffusion Multiplier and Uncertainty Quantification for Fourier Neural Operators
Albert Matveev · Sanmitra Ghosh · Aamal Hussain · James-Michael Leahy · Michalis Michaelides
Operator learning is a powerful paradigm for solving partial differential equations, with Fourier Neural Operators serving as a widely adopted foundation. However, FNOs face significant scalability challenges due to overparameterization and offer no native uncertainty quantification -- a key requirement for reliable scientific and engineering applications. Instead, neural operators rely on post hoc UQ methods that ignore geometric inductive biases. In this work, we introduce DINOZAUR: a diffusion-based neural operator parametrization with uncertainty quantification. Inspired by the structure of the heat kernel, DINOZAUR replaces the dense tensor multiplier in FNOs with a dimensionality-independent diffusion multiplier that has a single learnable time parameter per channel, drastically reducing parameter count and memory footprint without compromising predictive performance. By defining priors over those time parameters, we cast DINOZAUR as a Bayesian neural operator to yield spatially correlated outputs and calibrated uncertainty estimates. Our method achieves competitive or superior performance across several PDE benchmarks while providing efficient uncertainty quantification.
Memory-Augmented Potential Field Theory: A Framework for Adaptive Control in Non-Convex Domains
Dongzhe Zheng · Wenjie Mei
Stochastic optimal control methods often struggle in complex non-convex landscapes, frequently becoming trapped in local optima due to their inability to learn from historical trajectory data. This paper introduces Memory-Augmented Potential Field Theory, a unified mathematical framework that integrates historical experience into stochastic optimal control. Our approach dynamically constructs memory-based potential fields that identify and encode key topological features of the state space, enabling controllers to automatically learn from past experiences and adapt their optimization strategy. We provide a theoretical analysis showing that memory-augmented potential fields possess non-convex escape properties, asymptotic convergence characteristics, and computational efficiency. We implement this theoretical framework in a Memory-Augmented Model Predictive Path Integral (MPPI) controller that demonstrates significantly improved performance in challenging non-convex environments. The framework represents a generalizable approach to experience-based learning within control systems (especially robotic dynamics), enhancing their ability to navigate complex state spaces without requiring specialized domain knowledge or extensive offline training.
Mesh Interpolation Graph Network for Dynamic and Spatially Irregular Global Weather Forecasting
Zinan Zheng · Yang Liu · Jia Li
Graph neural networks have shown promising results in weather forecasting, which is critical for human activity such as agriculture planning and extreme weather preparation. However, most studies focus on finite and local areas for training, overlooking the influence of broader areas and limiting their ability to generalize effectively. Thus, in this work, we study global weather forecasting that is irregularly distributed and dynamically varying in practice, requiring the model to generalize to unobserved locations. To address such challenges, we propose a general Mesh Interpolation Graph Network (MIGN) that models the irregular weather station forecasting, consisting of two key designs: (1) learning spatially irregular data with regular mesh interpolation network to align the data; (2) leveraging parametric spherical harmonics location embedding to further enhance spatial generalization ability. Extensive experiments on an up-to-date observation dataset show that MIGN significantly outperforms existing data-driven models. Besides, we show that MIGN has spatial generalization ability, and is capable of generalizing to previously unseen stations.
More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models
Hongkai Lin · Dingkang Liang · Mingyang Du · Xin Zhou · Xiang Bai
Generative depth estimation methods leverage the rich visual priors stored in pretrained text-to-image diffusion models, demonstrating astonishing zero-shot capability. However, parameter updates during training lead to catastrophic degradation in the image generation capability of the pretrained model. We introduce MERGE, a unified model for image generation and depth estimation, starting from a fixed-parameters pretrained text-to-image model. MERGE demonstrates that the pretrained text-to-image model can do more than image generation but also expand to depth estimation effortlessly. Specifically, MERGE introduces a plug-and-play framework that enables seamless switching between image generation and depth estimation modes through simple and pluggable converters. Meanwhile, we propose a Group Reuse Mechanism to encourage parameter reuse and improve the utilization of the additional learnable parameter. MERGE unleashes the powerful depth estimation capability of the pretrained text-to-image model while preserving its original image generation ability. Compared to other unified models for image generation and depth estimation, MERGE achieves state-of-the-art performance across multiple depth estimation benchmarks. The code and model will be made available.
Mysteries of the Deep: Role of Intermediate Representations in Out of Distribution Detection
Ignacio Meza De la Jara · Cristian Rodriguez-Opazo · Damien Teney · Damith Ranasinghe · Ehsan Abbasnejad
Out-of-distribution (OOD) detection is essential for reliably deploying machine learning models in the wild. Yet, most methods treat large pre-trained models as monolithic encoders and rely solely on their final-layer representations for detection. We challenge this wisdom. We reveal the intermediate layers of pre-trained models, shaped by residual connections that subtly transform input projections, can encode surprisingly rich and diverse signals for detecting distributional shifts. Importantly, to exploit latent representation diversity across layers, we introduce an entropy-based criterion to automatically identify layers offering the most complementary information in a training-free setting, without access to OOD data. We show that selectively incorporating these intermediate representations can increase the accuracy of OOD detection by up to $10\%$ in far-OOD and over $7\%$ in near-OOD benchmarks compared to state-of-the-art training-free methods across various model architectures and training objectives. Our findings reveal a new avenue for OOD detection research and uncover the impact of various training objectives and model architectures on confidence-based OOD detection methods.
Navigating the MIL Trade-Off: Flexible Pooling for Whole Slide Image Classification
Hossein Jafarinia · Danial Hamdi · Amirhossein Alamdar · Elahe Zahiri · Soroush Vafaie Tabar · Alireza Alipanah · Nahal Mirzaie · Saeed Razavi · Amir Najafi · Mohammad Hossein Rohban
Multiple Instance Learning (MIL) is a standard weakly supervised approach for Whole Slide Image (WSI) classification, where performance hinges on both feature representation and MIL pooling strategies. Recent research has predominantly focused on Transformer-based architectures adapted for WSIs. However, we argue that this trend faces a fundamental limitation: data scarcity. In typical settings, Transformer models yield only marginal gains without access to large-scale datasets—resources that are virtually inaccessible to all but a few well-funded research labs. Motivated by this, we revisit simple, non-attention MIL with unsupervised slide features and analyze temperature-$\beta$-controlled log-sum-exp (LSE) pooling. For slides partitioned into $N$ patches, we theoretically show that LSE has a smooth transition at a critical $\beta_{\mathrm{crit}}=\mathcal{O}(\log N)$ threshold, interpolating between mean-like aggregation (stable, better generalization but less sensitive) and max-like aggregation (more sensitive but looser generalization bounds). Grounded in this analysis, we introduce Maxsoft—a novel MIL pooling function that enables flexible control over this trade-off, allowing adaptation to specific tasks and datasets. To further tackle real-world deployment challenges such as specimen heterogeneity, we propose PerPatch augmentation—a simple yet effective technique that enhances model robustness. Empirically, Maxsoft achieves state-of-the-art performance in low-data regimes across four major benchmarks (CAMELYON16, CAMELYON17, TCGA-Lung, and SICAP-MIL), often matching or surpassing large-scale foundation models. When combined with PerPatch augmentation, this performance is further improved through increased robustness. Code is available at \href{https://github.com/jafarinia/maxsoft}{\texttt{https://github.com/jafarinia/maxsoft}}
Nearly-Linear Time Private Hypothesis Selection with the Optimal Approximation Factor
Maryam Aliakbarpour · Zhan Shi · Ria Stevens · Vincent Wang
Estimating the density of a distribution from its samples is a fundamental problem in statistics. \emph{Hypothesis selection} addresses the setting where, in addition to a sample set, we are given $n$ candidate distributions---referred to as \emph{hypotheses}---and the goal is to determine which one best describes the underlying data distribution. This problem is known to be solvable very efficiently, requiring roughly $O(\log n)$ samples and running in $\tilde{O}(n)$ time. The quality of the output is measured via the total variation distance to the unknown distribution, and the approximation factor of the algorithm determines how large this distance is compared to the optimal distance achieved by the best candidate hypothesis. It is known that $\alpha = 3$ is the optimal approximation factor for this problem. We study hypothesis selection under the constraint of \emph{differential privacy}. We propose a differentially private algorithm in the central model that runs in nearly linear time with respect to the number of hypotheses, achieves the optimal approximation factor, and incurs only a modest increase in sample complexity, which remains polylogarithmic in $n$. This resolves an open question posed by [Bun, Kamath, Steinke, Wu, NeurIPS 2019]. Prior to our work, existing upper bounds required quadratic time.
Rectifying Soft-Label Entangled Bias in Long-Tailed Dataset Distillation
Chenyang Jiang · Hang Zhao · Xinyu Zhang · Zhengcen Li · Qiben Shan · Shaocong Wu · Jingyong Su
Dataset distillation compresses large-scale datasets into compact, highly informative synthetic data, significantly reducing storage and training costs. However, existing research primarily focuses on balanced datasets and struggles to perform under real-world long-tailed distributions. In this work, we emphasize the critical role of soft labels in long-tailed dataset distillation and uncover the underlying mechanisms contributing to performance degradation. Specifically, we derive an imbalance-aware generalization bound for model trained on distilled dataset. We then identify two primary sources of soft-label bias, which originate from the distillation model and the distilled images, through systematic perturbation of the data imbalance levels. To address this, we propose ADSA, an Adaptive Soft-label Alignment module that calibrates the entangled biases. This lightweight module integrates seamlessly into existing distillation pipelines and consistently improves performance. On ImageNet-1k-LT with EDC and IPC=50, ADSA improves tail-class accuracy by up to 11.8\% and raises overall accuracy to 41.4\%. Extensive experiments demonstrate that ADSA provides a robust and generalizable solution under limited label budgets and across a range of distillation techniques.
Reduction-based Pseudo-label Generation for Instance-dependent Partial Label Learning
Congyu Qiao · Ning Xu · Yihao Hu · Xin Geng
Instance-dependent Partial Label Learning (ID-PLL) aims to learn a multi-class predictive model given training instances annotated with candidate labels related to features, among which correct labels are hidden fixed but unknown. The previous works involve leveraging the identification capability of the training model itself to iteratively refine supervision information. However, these methods overlook a critical aspect of ID-PLL: within the original label space, the model may fail to distinguish some incorrect candidate labels that are strongly correlated with features from correct labels. This leads to poor-quality supervision signals and creates a bottleneck in the training process. In this paper, we propose to leverage reduction-based pseudo-labels to alleviate the influence of incorrect candidate labels and train our predictive model to overcome this bottleneck. Specifically, reduction-based pseudo-labels are generated by performing weighted aggregation on the outputs of a multi-branch auxiliary model, with each branch trained in a label subspace that excludes certain labels. This approach ensures that each branch explicitly avoids the disturbance of the excluded labels, allowing the pseudo-labels provided for instances troubled by these excluded labels to benefit from the unaffected branches. Theoretically, we demonstrate that reduction-based pseudo-labels exhibit greater consistency with the Bayes optimal classifier compared to pseudo-labels directly generated from the training predictive model.
Resource-Constrained Federated Continual Learning: What Does Matter?
Yichen Li · Yuying Wang · Jiahua Dong · Haozhao Wang · Yining Qi · Rui Zhang · Ruixuan Li
Federated Continual Learning (FCL) aims to enable sequential privacy-preserving model training on streams of incoming data that vary in edge devices by preserving previous knowledge while adapting to new data. Current FCL literature focuses on restricted data privacy and access to previously seen data while imposing no constraints on the training overhead. This is unreasonable for FCL applications in real-world scenarios, where edge devices are primarily constrained by resources such as storage, computational budget, and label rate. We revisit this problem with a large-scale benchmark and analyze the performance of state-of-the-art FCL approaches under different resource-constrained settings. Various typical FCL techniques and six datasets in two incremental learning scenarios (Class-IL and Domain-IL) are involved in our experiments. Through extensive experiments amounting to a total of over 1,000+ GPU hours, we find that, under limited resource-constrained settings, existing FCL approaches, with no exception, fail to achieve the expected performance. Our conclusions are consistent in the sensitivity analysis. This suggests that most existing FCL methods are particularly too resource-dependent for real-world deployment. Moreover, we study the performance of typical FCL techniques with resource constraints and shed light on future research directions in FCL.
Sample-Efficient Tabular Self-Play for Offline Robust Reinforcement Learning
Na Li · Zewu Zheng · Wei Ni · Hangguan Shan · Wenjie Zhang · Xinyu Li
Multi-agent reinforcement learning (MARL), as a thriving field, explores how multiple agents independently make decisions in a shared dynamic environment. Due to environmental uncertainties, policies in MARL must remain robust to tackle the sim-to-real gap. We focus on robust two-player zero-sum Markov games (TZMGs) in offline settings, specifically on tabular robust TZMGs (RTZMGs). We propose a model-based algorithm (RTZ-VI-LCB) for offline RTZMGs, which is optimistic robust value iteration combined with a data-driven Bernstein-style penalty term for robust value estimation. By accounting for distribution shifts in the historical dataset, the proposed algorithm establishes near-optimal sample complexity guarantees under partial coverage and environmental uncertainty. An information-theoretic lower bound is developed to confirm the tightness of our algorithm's sample complexity, which is optimal regarding both state and action spaces. To the best of our knowledge, RTZ-VI-LCB is the first to attain this optimality, sets a new benchmark for offline RTZMGs, and is validated experimentally.
Second-order Optimization under Heavy-Tailed Noise: Hessian Clipping and Sample Complexity Limits
Abdurakhmon Sadiev · Peter Richtarik · Ilyas Fatkhullin
Heavy-tailed noise is pervasive in modern machine learning applications, arising from data heterogeneity, outliers, and non-stationary stochastic environments. While second-order methods can significantly accelerate convergence in light-tailed or bounded-noise settings, such algorithms are often brittle and lack guarantees under heavy-tailed noise—precisely the regimes where robustness is most critical. In this work, we take a first step toward a theoretical understanding of second-order optimization under heavy-tailed noise. We consider a setting where stochastic gradients and Hessians have only bounded $p$-th moments, for some $p\in (1,2]$, and establish tight lower bounds on the sample complexity of any second-order method. We then develop a variant of normalized stochastic gradient descent that leverages second-order information and provably matches these lower bounds. To address the instability caused by large deviations, we introduce a novel algorithm based on gradient and Hessian clipping, and prove high-probability upper bounds that nearly match the fundamental limits. Our results provide the first comprehensive sample complexity characterization for second-order optimization under heavy-tailed noise. This positions Hessian clipping as a robust and theoretically sound strategy for second-order algorithm design in heavy-tailed regimes.
THD-BAR: Topology Hierarchical Derived Brain Autoregressive Modeling for EEG Generic Representations
Wenchao Yang · Weidong Yan · Wenkang Liu · Yulan Ma · Yang Li
Large-scale pre-trained models hold significant potential for learning universal EEG representations. However, most existing methods, particularly autoregressive (AR) frameworks, primarily rely on straightforward temporal sequencing of multi-channel EEG data, which fails to capture the rich physiological characteristics inherent to EEG signals. Moreover, their time-centered modeling approach also limits the effective representation of the dynamic spatial topology of brain activity. To address these challenges and fully exploit the potential of large-scale EEG models, we propose a novel Topology Hierarchical Derived Brain Autoregressive Modeling (THD-BAR) for EEG generic representations. The core innovation of THD-BAR lies in the introduction of the Brain Topology Hierarchy (BTH), which establishes a multi-scale spatial order for EEG channels. This hierarchical structure enables a redefinition of autoregressive learning as a "next-scale-time prediction" problem, effectively capturing both spatial and temporal dynamics. Based on BTH, we design a Topology-Hierarchical Vector Quantized-Variational Autoencoder (THVQ-VAE) for multi-scale tokenization and develop an enhanced Brain Autoregressive (BAR) module with specialized masking strategies for prediction. Through extensive large-scale pre-training on 17 datasets, followed by rigorous validation on 10 downstream datasets spanning 5 distinct tasks, THD-BAR consistently outperforms existing methods. These results highlight the superior generalization and modeling capabilities of our proposed approach.
The Hawthorne Effect in Reasoning Models: Evaluating and Steering Test Awareness
Sahar Abdelnabi · Ahmed Salem
Reasoning-focused LLMs sometimes alter their behavior when they detect that they are being evaluated—which can lead them to optimize for test-passing performance or to comply more readily with harmful prompts if real-world consequences appear absent. We present the first quantitative study of how such “test awareness” impacts model behavior, particularly its performance on safety-related tasks. We introduce a white-box probing framework that (i) linearly identifies awareness-related activations and (ii) steers models toward or away from test awareness while monitoring downstream performance. We apply our method to different state-of-the-art open-weight reasoning LLMs across both realistic and hypothetical tasks (denoting tests or simulations). Our results demonstrate that test awareness significantly impacts safety alignment (such as compliance with harmful requests and conforming to stereotypes) with effects varying in both magnitude and direction across models. By providing control over this latent effect, our work aims to provide a stress-test mechanism and increase trust in how we perform safety evaluations.
The Quest for Universal Master Key Filters in DS-CNNs
Zahra Babaiee · Peyman M. Kiasari · Daniela Rus · Radu Grosu
A recent study has proposed the ``Master Key Filters Hypothesis" for convolutional neural network filters. This paper extends this hypothesis by radically constraining its scope to a single set of just 8 universal filters that depthwise separable convolutional networks inherently converge to. While conventional DS-CNNs employ thousands of distinct trained filters, our analysis reveals these filters are predominantly linear shifts (ax+b) of our discovered universal set. Through systematic unsupervised search, we extracted these fundamental patterns across different architectures and datasets. Remarkably, networks initialized with these 8 unique frozen filters achieve over 80\% ImageNet accuracy, and even outperform models with thousands of trainable parameters when applied to smaller datasets. The identified master key filters closely match Difference of Gaussians (DoGs), Gaussians, and their derivatives, structures that are not only fundamental to classical image processing but also strikingly similar to receptive fields in mammalian visual systems. Our findings provide compelling evidence that depthwise convolutional layers naturally gravitate toward this fundamental set of spatial operators regardless of task or architecture. This work offers new insights for understanding generalization and transfer learning through the universal language of these master key filters.
Tight Bounds on the Distortion of Randomized and Deterministic Distributed Voting
Mohammad Abam · Davoud Kareshki · Marzieh Nilipour · MohammadHossein Paydar · Masoud Seddighin
We study metric distortion in distributed voting, where $n$ voters are partitioned into $k$ groups, each selecting a local representative, and a final winner is chosen from these representatives (or from the entire set of candidates). This setting models systems like U.S. presidential elections, where state-level decisions determine the national outcome. We focus on four cost objectives from Anshelevich \et~\cite{anshelevich2022distortion}: $\avgavg$, $\avgmax$, $\maxavg$, and $\maxmax$. We present improved distortion bounds for both deterministic and randomized mechanisms, offering a near-complete characterization of distortion in this model. For deterministic mechanisms, we reduce the upper bound for $\avgmax$ from $11$ to $7$, establish a tight lower bound of $5$ for $\maxavg$ (improving on $2+\sqrt{5}$), and tighten the upper bound for $\maxmax$ from $5$ to $3$. For randomized mechanisms, we consider two settings: (i) only the second stage is randomized, and (ii) both stages may be randomized. In case (i), we prove tight bounds: $5\!-\!2/k$ for $\avgavg$, $3$ for $\avgmax$ and $\maxmax$, and $5$ for $\maxavg$. In case (ii), we show tight bounds of $3$ for $\maxavg$ and $\maxmax$, and nearly tight bounds for $\avgavg$ and $\avgmax$ within $[3\!-\!2/n,\ 3\!-\!2/(kn^*)]$ and $[3\!-\!2/n,\ 3]$, respectively, where $n^*$ denotes the largest group size.
Towards foundational LiDAR world models with efficient latent flow matching
Tianran Liu · Shengwen Zhao · Nicholas Rhinehart
LiDAR-based world models offer more structured and geometry-aware representations than their image-based counterparts. However, existing LiDAR world models are narrowly trained; each model excels only in the domain for which it was built. This raises a critical question: can we develop LiDAR world models that exhibit strong transferability across multiple domains? To answer this, we conduct the first systematic domain transfer study across three demanding scenarios: (i) outdoor to indoor generalization, (ii) sparse- to dense-beam adaptation, and (iii) non-semantic to semantic transfer. Given different amounts of fine-tuning data, our experiments show that a single pretrained model can achieve up to 11\% absolute improvement (83\% relative) over training from scratch and outperforms training from scratch in 30/36 of our comparisons. This transferability significantly reduces the reliance on manually annotated data for semantic occupancy forecasting: our method exceeds previous baselines with only 5\% of the labeled training data of prior work. We also observed inefficiencies of current generative-model-based LiDAR world models, mainly through their under-compression of LiDAR data and inefficient training objectives. To address these issues, we propose a latent conditional flow matching (CFM)-based framework that achieves state-of-the-art reconstruction accuracy using only half the training data and a compression ratio 6 times higher than that of prior methods. Our model also achieves SOTA performance on semantic occupancy forecasting while being 1.98x-23x more computationally efficient (a 1.1x-3.9x FPS speedup) than previous methods.
Tru-POMDP: Task Planning Under Uncertainty via Tree of Hypotheses and Open-Ended POMDPs
Wenjing Tang · Xinyu He · Yongxi Huang · Yunxiao Xiao · Cewu Lu · Panpan Cai
Task planning under uncertainty is essential for home-service robots operating in the real world. Tasks involve ambiguous human instructions, hidden or unknown object locations, and open-vocabulary object types, leading to significant open-ended uncertainty and a boundlessly large planning space. To address these challenges, we propose Tru-POMDP, a planner that combines structured belief generation using Large Language Models (LLMs) with principled POMDP planning. Tru-POMDP introduces a hierarchical Tree of Hypotheses (TOH), which systematically queries an LLM to construct high-quality particle beliefs over possible world states and human goals. We further formulate an open-ended POMDP model that enables rigorous Bayesian belief tracking and efficient belief-space planning over these LLM-generated hypotheses. Experiments on complex object rearrangement tasks across diverse kitchen environments show that Tru-POMDP significantly outperforms state-of-the-art LLM-based and LLM-tree-search hybrid planners, achieving higher success rates with significantly better plans, stronger robustness to ambiguity and occlusion, and greater planning efficiency.
UniDomain: Pretraining a Unified PDDL Domain from Real-World Demonstrations for Generalizable Robot Task Planning
Haoming Ye · Yunxiao Xiao · Cewu Lu · Panpan Cai
Robotic task planning in real-world environments requires reasoning over implicit constraints from language and vision. While LLMs and VLMs offer strong priors, they struggle with long-horizon structure and symbolic grounding. Existing meth- ods that combine LLMs with symbolic planning often rely on handcrafted or narrow domains, limiting generalization. We propose UniDomain, a framework that pre-trains a PDDL domain from robot manipulation demonstrations and applies it for online robotic task planning. It extracts atomic domains from 12,393 manipulation videos to form a unified domain with 3137 operators, 2875 predicates, and 16481 causal edges. Given a target class of tasks, it retrieves relevant atomics from the unified domain and systematically fuses them into high-quality meta-domains for zero-shot planning. Experiments on diverse real-world tasks show that UniDomain solves complex, unseen tasks in a zero-shot manner, achieving up to 58% higher task success and 160% improvement in plan optimality over state-of-the-art LLM and LLM-PDDL baselines.
YEAST: Yet Another Sequential Test
Alexey Kurennoy · Majed Dodin · Tural Gurbanov · Ana Peleteiro Ramallo
The online evaluation of machine learning models is typically conducted through A/B experiments. Sequential statistical tests are valuable tools for analysing these experiments, as they enable researchers to stop data collection early without increasing the risk of false discoveries. However, existing sequential tests either limit the number of interim analyses or suffer from low statistical power. In this paper, we introduce a novel sequential test designed for the continuous monitoring of A/B experiments. We validate our method using semi-synthetic simulations and demonstrate that it outperforms current state-of-the-art sequential testing approaches. Our method is derived using a new technique that inverts a bound on the probability of threshold crossing, based on a classical maximal inequality.