Skip to yearly menu bar Skip to main content


Poster Session

Mexico City Poster Session 3

Don Alberto 4
Thu 4 Dec 11 a.m. PST — 2 p.m. PST
Abstract:
Chat is not available.


{location} Poster
1
Dynamic Diffusion Schrödinger Bridge in Astrophysical Observational Inversions

Ye Zhu · Duo Xu · Zhiwei Deng · Jonathan Tan · Olga Russakovsky

We study Diffusion Schrödinger Bridge (DSB) models in the context of dynamical astrophysical systems, specifically tackling observational inverse prediction tasks within Giant Molecular Clouds (GMCs) for star formation. We introduce the Astro-DSB model, a variant of DSB with the pairwise domain assumption tailored for astrophysical dynamics. By investigating its learning process and prediction performance in both physically simulated data and in real observations (the Taurus B213 data), we present two main takeaways. First, from the astrophysical perspective, our proposed paired DSB method improves interpretability, learning efficiency, and prediction performance over conventional astrostatistical and other machine learning methods. Second, from the generative modeling perspective, probabilistic generative modeling reveals improvements over discriminative pixel-to-pixel modeling in Out-Of-Distribution (OOD) testing cases of physical simulations with unseen initial conditions and different dominant physical processes. Our study expands research into diffusion models beyond the traditional visual synthesis application and provides evidence of the models' learning abilities beyond pure data statistics, paving a path for future physics-aware generative models which can align dynamics between machine learning and real (astro)physical systems.


{location} Spotlight Poster
2
Near-Optimal Experiment Design in Linear non-Gaussian Cyclic Models

Ehsan Sharifian · Saber Salehkaleybar · Negar Kiyavash

We study the problem of causal structure learning from a combination of observational and interventional data generated by a linear non-Gaussian structural equation model that might contain cycles. Recent results show that using mere observational data identifies the causal graph only up to a permutation-equivalence class. We obtain a combinatorial characterization of this class by showing that each equivalence class corresponds to a perfect matching in a bipartite graph. This bipartite representation allows us to analyze how interventions modify or constrain the matchings. Specifically, we show that each atomic intervention reveals one edge of the true matching and eliminates all incompatible causal graphs. Consequently, we formalize the optimal experiment design task as an adaptive stochastic optimization problem over the set of equivalence classes with a natural reward function that quantifies how many graphs are eliminated from the equivalence class by an intervention. We show that this reward function is adaptive submodular and provide a greedy policy with a provable near-optimal performance guarantee. A key technical challenge is to efficiently estimate the reward function without having to explicitly enumerate all the graphs in the equivalence class. We propose a sampling-based estimator using random matchings and analyze its bias and concentration behavior. Our simulation results show that performing a small number of interventions guided by our stochastic optimization framework recovers the true underlying causal structure.


{location} Poster
3
Online Multi-Class Selection with Group Fairness Guarantee

Faraz Zargari · Hossein Jazi · Lyndon Hallett · Bo Sun · Xiaoqi Tan

We study the online multi-class selection problem with group fairness guarantees, where limited resources must be allocated to sequentially arriving agents. Our work addresses two key limitations in the existing literature. First, we introduce a novel lossless rounding scheme that ensures the integral algorithm achieves the same expected performance as any fractional solution. Second, we explicitly address the challenges introduced by agents who belong to multiple classes. To this end, we develop a randomized algorithm based on a relax-and-round framework. The algorithm first computes a fractional solution using a resource reservation approach---referred to as the set-aside mechanism---to enforce fairness across classes. The subsequent rounding step preserves these fairness guarantees without degrading performance. Additionally, we propose a learning-augmented variant that incorporates untrusted machine-learned predictions to better balance fairness and efficiency in practical settings.


{location} Poster
4
Test-Time Adaptation of Vision-Language Models for Open-Vocabulary Semantic Segmentation

Mehrdad Noori · David OSOWIECHI · Gustavo Vargas Hakim · Ali Bahri · Moslem Yazdanpanah · Sahar Dastani · Farzad Beizaee · Ismail Ayed · Christian Desrosiers

Recently, test-time adaptation has attracted wide interest in the context of vision-language models for image classification. However, to the best of our knowledge, the problem is completely overlooked in dense prediction tasks such as Open-Vocabulary Semantic Segmentation (OVSS). In response, we propose a novel TTA method tailored to adapting VLMs for segmentation during test time. Unlike TTA methods for image classification, our Multi-Level and Multi-Prompt (MLMP) entropy minimization integrates features from intermediate vision-encoder layers and is performed with different text-prompt templates at both the global CLS token and local pixel-wise levels. Our approach could be used as plug-and-play for any segmentation network, does not require additional training data or labels, and remains effective even with a single test sample. Furthermore, we introduce a comprehensive OVSS TTA benchmark suite, which integrates a rigorous evaluation protocol, nine segmentation datasets, 15 common synthetic corruptions, and additional real and rendered domain shifts, with a total of 87 distinct test scenarios, establishing a standardized and comprehensive testbed for future TTA research in open-vocabulary segmentation. Our experiments on this suite demonstrate that our segmentation-tailored method consistently delivers significant gains over direct adoption of TTA classification baselines. Code and data are available at https://github.com/dosowiechi/MLMP.


{location} Poster
5
When Kernels Multiply, Clusters Unify: Fusing Embeddings with the Kronecker Product

Youqi WU · Jingwei Zhang · Farzan Farnia

State-of-the-art embeddings often capture distinct yet complementary discriminative features: For instance, one image embedding model may excel at distinguishing fine-grained textures, while another focuses on object-level structure. Motivated by this observation, we propose a principled approach to fuse such complementary representations through kernel multiplication. Multiplying the kernel similarity functions of two embeddings allows their discriminative structures to interact, producing a fused representation whose kernel encodes the union of the clusters identified by each parent embedding. This formulation also provides a natural way to construct joint kernels for paired multi-modal data (e.g., image–text tuples), where the product of modality-specific kernels inherits structure from both domains. We highlight that this kernel product is mathematically realized via the Kronecker product of the embedding feature maps, yielding our proposed KrossFuse framework for embedding fusion. To address the computational cost of the resulting high-dimensional Kronecker space, we further develop RP-KrossFuse, a scalable variant that leverages random projections for efficient approximation. As a key application, we use this framework to bridge the performance gap between cross-modal embeddings (e.g., CLIP, BLIP) and unimodal experts (e.g., DINOv2, E5). Experiments show that RP-KrossFuse effectively integrates these models, enhancing modality-specific performance while preserving cross-modal alignment. The project code is available at https://github.com/yokiwuuu/KrossFuse.


{location} Spotlight Poster
6
StelLA: Subspace Learning in Low-rank Adaptation using Stiefel Manifold

Zhizhong Li · Sina Sajadmanesh · Jingtao Li · Lingjuan Lyu

Low-rank adaptation (LoRA) has been widely adopted as a parameter-efficient technique for fine-tuning large-scale pre-trained models. However, it still lags behind full fine-tuning in performance, partly due to its insufficient exploitation of the geometric structure underlying low-rank manifolds. In this paper, we propose a geometry-aware extension of LoRA that uses a three-factor decomposition $USV^\top$. Analogous to the structure of singular value decomposition (SVD), it separates the adapter's input and output subspaces, $V$ and $U$, from the scaling factor $S$. Our method constrains $U$ and $V$ to lie on the Stiefel manifold, ensuring their orthonormality throughout the training. To optimize on the Stiefel manifold, we employ a flexible and modular geometric optimization design that converts any Euclidean optimizer to a Riemannian one. It enables efficient subspace learning while remaining compatible with existing fine-tuning pipelines. Empirical results across a wide range of downstream tasks, including commonsense reasoning, math and code generation, image classification, and image generation, demonstrate the superior performance of our approach against the recent state-of-the-art variants of LoRA. Code is available at https://github.com/SonyResearch/stella.

Task-based functional magnetic resonance imaging (fMRI) provides invaluable insights into human cognition but faces critical hurdles—low signal-to-noise ratio, high dimensionality, limited sample sizes, and costly data acquisition—that are amplified when integrating datasets across subjects or sites. This paper introduces orthogonal contrastive learning (OCL), a unified multi-representation framework for multi-subject fMRI analysis that aligns neural responses without requiring temporal preprocessing or uniform time-series lengths across subjects or sites. OCL employs two identical encoders: an online network trained with a contrastive loss that pulls together same-stimulus responses and pushes apart different-stimulus responses, and a target network whose weights track the online network via exponential moving average to stabilize learning. Each OCL network layer combines QR decomposition for orthogonal feature extraction, locality-sensitive hashing (LSH) to produce compact subject-specific signatures, positional encoding to embed temporal structure alongside spatial features, and a transformer encoder to generate discriminative, stimulus-aligned embeddings. We further enhance OCL with an unsupervised pretraining stage on fMRI-like synthetic data and demonstrate a transfer-learning workflow for multi-site studies. Across extensive experiments on multi-subject and multi-site fMRI benchmarks, OCL consistently outperforms state-of-the-art alignment and analysis methods in both representation quality and downstream classification accuracy.


{location} Poster
8
Hawaii: Hierarchical Visual Knowledge Transfer for Efficient Vision-Language Models

Yimu Wang · Mozhgan Nasr Azadani · Sean Sedwards · Krzysztof Czarnecki

Improving the visual understanding ability of vision-language models (VLMs) is crucial for enhancing their performance across various tasks. While using multiple pretrained visual experts has shown great promise, it often incurs significant computational costs during training and inference. To address this challenge, we propose HAWAII, a novel framework that distills knowledge from multiple visual experts into a single vision encoder, enabling it to inherit the complementary strengths of several experts with minimal computational overhead. To mitigate conflicts among different teachers and switch between different teacher-specific knowledge, instead of using a fixed set of adapters for multiple teachers, we propose to use teacher-specific Low-Rank Adaptation (LoRA) adapters with a corresponding router. Each adapter is aligned with a specific teacher, avoiding noisy guidance during distillation. To enable efficient knowledge distillation, we propose fine-grained and coarse-grained distillation. At the fine-grained level, token importance scores are employed to emphasize the most informative tokens from each teacher adaptively. At the coarse-grained level, we summarize the knowledge from multiple teachers and transfer it to the student using a set of general-knowledge LoRA adapters with a router. Extensive experiments on various vision-language tasks demonstrate the superiority of HAWAII, compared to the popular open-source VLMs.


{location} Spotlight Poster
A machine learning approach that beats Rubik's cubes

Alexander Chervov · Kirill Khoruzhii · Nikita Bukhal · Jalal Naghiyev · Vladislav Zamkovoy · Ivan Koltsov · Lyudmila Cheldieva · Arsenii Sychev · Arsenii Lenin · Mark Obozov · Egor Urvanov · Alexey Romanov

The paper proposes a novel machine learning-based approach to the pathfinding problem on extremely large graphs. This method leverages diffusion distance estimation via a neural network and uses beam search for pathfinding. We demonstrate its efficiency by finding solutions for 4x4x4 and 5x5x5 Rubik's cubes with unprecedentedly short solution lengths, outperforming all available solvers and introducing the first machine learning solver beyond the 3x3x3 case. In particular, it surpasses every single case of the combined best results in the Kaggle Santa 2023 challenge, which involved over 1,000 teams. For the 3x3x3 Rubik's cube, our approach achieves an optimality rate exceeding 98%, matching the performance of task-specific solvers and significantly outperforming prior solutions such as DeepCubeA (60.3%) and EfficientCube (69.6%). Our solution in its current implementation is approximately 25.6 times faster in solving 3x3x3 Rubik's cubes while requiring up to 8.5 times less model training time than the most efficient state-of-the-art competitor. Finally, it is demonstrated that even a single agent trained using a relatively small number of examples can robustly solve a broad range of puzzles represented by Cayley graphs of size up to $10^{145}$, confirming the generality of the proposed method.


{location} Poster
A unified framework for establishing the universal approximation of transformer-type architectures

Jingpu Cheng · Ting Lin · Zuowei Shen · Qianxiao Li

We investigate the universal approximation property (UAP) of transformer-type architectures, providing a unified theoretical framework that extends prior results on residual networks to models incorporating attention mechanisms. Our work identifies token distinguishability as a fundamental requirement for UAP and introduces a general sufficient condition that applies to a broad class of architectures. Leveraging an analyticity assumption on the attention layer, we can significantly simplify the verification of this condition, providing a non-constructive approach in establishing UAP for such architectures. We demonstrate the applicability of our framework by proving UAP for transformers with various attention mechanisms, including kernel-based and sparse ones. The corollaries of our results either generalize prior works or establish UAP for architectures not previously covered. Furthermore, our framework offers a principled foundation for designing novel transformer architectures with inherent UAP guarantees, including those with specific functional symmetries. We propose examples to illustrate these insights.


{location} Poster
Beyond Pairwise Connections: Extracting High-Order Functional Brain Network Structures under Global Constraints

Ling Zhan · Junjie Huang · Xiaoyao Yu · Wenyu Chen · Tao Jia

Functional brain network (FBN) modeling often relies on local pairwise interactions, whose limitation in capturing high-order dependencies is theoretically analyzed in this paper. Meanwhile, the computational burden and heuristic nature of current hypergraph modeling approaches hinder end-to-end learning of FBN structures directly from data distributions. To address this, we propose to extract high-order FBN structures under global constraints, and implement this as a Global Constraints oriented Multi-resolution (GCM) FBN structure learning framework. It incorporates 4 types of global constraint (signal synchronization, subject identity, expected edge numbers, and data labels) to enable learning FBN structures for 4 distinct levels (sample/subject/group/project) of modeling resolution. Experimental results demonstrate that GCM achieves up to a 30.6% improvement in relative accuracy and a 96.3% reduction in computational time across 5 datasets and 2 task settings, compared to 9 baselines and 10 state-of-the-art methods. Extensive experiments validate the contributions of individual components and highlight the interpretability of GCM. This work offers a novel perspective on FBN structure learning and provides a foundation for interdisciplinary applications in cognitive neuroscience. Code is publicly available on https://github.com/lzhan94swu/GCM.

This paper investigates the computational complexity of reinforcement learning within a novel linear function approximation regime, termed partial $q^{\pi}$-realizability. In this framework, the objective is to learn an $\epsilon$-optimal policy with respect to a predefined policy set $\Pi$, under the assumption that all value functions corresponding to policies in $\Pi$ are linearly realizable. This framework adopts assumptions that are weaker than those in the $q^{\pi}$-realizability setting yet stronger than those in the q*-realizability setup. As a result, it provides a more practical model for reinforcement learning scenarios where function approximation naturally arise. We prove that learning an $\epsilon$-optimal policy in this newly defined setting is computationally hard. More specifically, we establish NP-hardness under a parameterized greedy policy set (i.e., argmax) and, further, show that—unless NP = RP—an exponential lower bound (exponential in feature vector dimension) holds when the policy set contains softmax policies, under the Randomized Exponential Time Hypothesis. Our hardness results mirror those obtained in the $q^*$-realizability settings, and suggest that computational difficulty persists even when the policy class $ \Pi $ is expanded beyond the optimal policy, reinforcing the unbreakable nature of the computational hardness result regarding partial $ q^{\pi} $-realizability under two important policy sets. To establish our negative result, our primary technical contribution is a reduction from two complexity problems, $\delta$-Max-3SAT and $\delta$-Max-3SAT($b$), to instances of our problem settings: GLinear-$\kappa$-RL (under the greedy policy set) and SLinear-$\kappa$-RL (under the softmax policy set), respectively. Our findings indicate that positive computational results are generally unattainable in the context of partial $ q^{\pi} $-realizability, in sharp contrast to the $ q^{\pi} $-realizability setting under a generative access model.


{location} Poster
Doubly Robust Alignment for Large Language Models

Erhan Xu · Kai Ye · Hongyi Zhou · Luhan Zhu · Francesco Quinzan · Chengchun Shi

This paper studies reinforcement learning from human feedback (RLHF) for aligning large language models with human preferences. While RLHF has demonstrated promising results, many algorithms are highly sensitive to misspecifications in the underlying preference model (e.g., the Bradley-Terry model), the reference policy, or the reward function, resulting in undesirable fine-tuning. To address model misspecification, we propose a doubly robust preference optimization algorithm that remains consistent when either the preference model or the reference policy is correctly specified (without requiring both). Our proposal demonstrates superior and more robust performance than state-of-the-art algorithms, both in theory and in practice. The code is available at https://github.com/DRPO4LLM/DRPO4LLM


{location} Poster
ECO: Evolving Core Knowledge for Efficient Transfer

Fu Feng · Yucheng Xie · Ruixiao Shi · Jianlu Shen · Jingq Wang · Xin Geng

Knowledge in modern neural networks is often entangled and structurally opaque, making current transfer methods—typically based on reusing entire parameter sets—inefficient and inflexible. Efforts to improve flexibility by reusing partial parameters frequently depend on handcrafted heuristics or rigid structural assumptions, which constrain generalization. In contrast, biological evolution enables efficient knowledge transfer by encoding only essential information into genes through iterative refinement under environmental pressure. Inspired by this principle, we propose ECO, a framework that Evolves COre knowledge into modular, reusable neural components—termed learngenes—through similar evolutionary dynamics. To this end, we redefine learngenes as neural circuits and introduce Genetic Transfer Learning (GTL), a biologically inspired paradigm that establishes a genetic mechanism within neural networks in the context of supervised learning. GTL simulates evolutionary processes by generating diverse network populations, selecting high-performing individuals, and transferring their learngenes to subsequent generations. Through iterative refinement, GTL enables learngenes to accumulate transferable common knowledge. Extensive experiments show that ECO achieves efficient initialization and strong generalization across diverse models and tasks, while significantly reducing computational and memory costs compared to conventional methods.


{location} Poster
Enhancing Privacy in Multimodal Federated Learning with Information Theory

Tianzhe Xiao · Yichen Li · Yining Qi · YI LIU · wangshi.ww · Haozhao Wang · Yi Wang · Ruixuan Li

Multimodal federated learning (MMFL) has gained increasing popularity due to its ability to leverage the correlation between various modalities, meanwhile preserving data privacy for different clients. However, recent studies show that correlation between modalities increase the vulnerability of federated learning against Gradient Inversion Attack (GIA). The complicated situation of MMFL privacy preserving can be summarized as follows: 1) different modality transmits different amounts of information, thus requires various protection strength; 2) correlation between modalities should be taken into account. This paper introduces an information theory perspective to analyze the leaked privacy in process of MMFL, and tries to propose a more reasonable protection method \textbf{Sec-MMFL} based on assessing different information leakage possibilities of each modality by conditional mutual information and adjust the corresponding protection strength. Moreover, we use mutual information to reduce the cross-modality information leakage in MMFL. Experiments have proven that our method can bring more balanced and comprehensive protection at an acceptable cost.


{location} Poster
Fair Matroid Selection

Kiarash Banihashem · MohammadTaghi Hajiaghayi · Danny Mittal

We investigate the problem of sequentially selecting elements of an unknown matroid in an online manner to form an independent set, with the goal of maximizing the minimum probability of acceptance across all elements, a property we define as $f$-fairness. Under adversarial arrival orders, we design an $\alpha(\ln(k)+1)$-fair algorithm, where $\alpha$ is the arboricity of the matroid and $k$ is the rank, a result that is nearly optimal. For laminar matroids, we develop an $(2\alpha-1)$-fair algorithm, which is optimal up to constant factors, achieved through a novel online coloring scheme. In the random arrival order setting, we achieve a $(4+o(1))\alpha$-fair algorithm for graphic matroids, matching the optimal result up to constant factors, relying on a novel technique for learning a degeneracy ordering using a sampled subset of edges. We further generalize our result to $p$-matchoids, obtaining a $\beta(p\ln k+1)$-fair algorithm for the adversarial arrival model, where $\beta$ is the optimal offline fairness. Notably, all our results can be extended to a setting with no prior knowledge of the matroid with only a logarithmic increase in the fairness factor.


{location} Poster
Fast Rate Bounds for Multi-Task and Meta-Learning with Different Sample Sizes

Hossein Zakerinia · Christoph Lampert

We present new fast-rate PAC-Bayesian generalization bounds for multi-task and meta-learning in the unbalanced setting, i.e. when the tasks have training sets of different sizes, as is typically the case in real-world scenarios. Previously, only standard-rate bounds were known for this situation, while fast-rate bounds were limited to the setting where all training sets are of equal size. Our new bounds are numerically computable as well as interpretable, and we demonstrate their flexibility in handling a number of cases where they give stronger guarantees than previous bounds. Besides the bounds themselves, we also make conceptual contributions: we demonstrate that the unbalanced multi-task setting has different statistical properties than the balanced situation, specifically that proofs from the balanced situation do not carry over to the unbalanced setting. Additionally, we shed light on the fact that the unbalanced situation allows two meaningful definitions of multi-task risk, depending on whether all tasks should be considered equally important or if sample-rich tasks should receive more weight than sample-poor ones.


{location} Poster
Global Prompt Refinement with Non-Interfering Attention Masking for One-Shot Federated Learning

Zhuang Qi · Yu Pan · Lei Meng · Sijin Zhou · Han Yu · Xiaoxiao Li · Xiangxu Meng

Federated Prompt Learning (FPL) enables communication-efficient adaptation by tuning lightweight prompts on top of frozen pre-trained models. Existing FPL methods typically rely on global information, which is only available after the second training round, to facilitate collaboration among client models. Therefore, they are inherently dependent on multi-round communication to fully exhibit their strengths. Moreover, existing one-shot federated learning methods typically focus on fitting seen tasks, but lack cross-task generalization. To bridge this gap, we propose the global prompt refinement with non-interfering attention masking (GPR-NIAM) method for one-shot FPL. The core idea is to design a masking mechanism that restricts excessive interaction between the original text embeddings and the learnable prompt embeddings. GPR-NIAM achieves this through the collaboration of two key modules. Firstly, the attention isolation module suppresses attention from the learnable prompt tokens to the original text tokens, and reweights the reverse attention which preserves generalization across tasks. Secondly, the cross-silo collaborative refinement module integrates decentralized visual knowledge into a unified base and calibrates the global prompt through multi-source cross-modal knowledge alignment, further mitigating the inconsistency caused by data heterogeneity. Extensive experiments conducted on ten benchmark datasets under two tasks show that GPR-NIAM outperforms eight state-of-the-art methods in both class-level and domain-level generalization.

Embodied agents operating in smart homes must understand human behavior through diverse sensory inputs and communicate via natural language. While Vision-Language Models (VLMs) have enabled impressive language-grounded perception, their reliance on visual data limits robustness in real-world scenarios with occlusions, poor lighting, or privacy constraints. In this paper, we introduce HoloLLM, a Multimodal Large Language Model (MLLM) that integrates uncommon but powerful sensing modalities, such as LiDAR, infrared, mmWave radar, and WiFi, to enable seamless human perception and reasoning across heterogeneous environments. We address two key challenges: (1) the scarcity of aligned modality-text data for rare sensors, and (2) the heterogeneity of their physical signal representations. To overcome these, we design a Universal Modality-Injection Projector (UMIP) that enhances pre-aligned modality embeddings with fine-grained, text-aligned features from tailored encoders via coarse-to-fine cross-attention without introducing significant alignment overhead. We further introduce a human-VLM collaborative data curation pipeline to generate paired textual annotations for sensing datasets. Extensive experiments on two newly constructed benchmarks show that HoloLLM significantly outperforms existing MLLMs, improving language-grounded human sensing accuracy by up to 30\%. This work establishes a new foundation for real-world, language-informed multisensory embodied intelligence.


{location} Poster
IMPACT: Irregular Multi-Patch Adversarial Composition Based on Two‑Phase Optimization

Zenghui Yang · Xingquan Zuo · Hai Huang · Gang Chen · Xinchao Zhao · Tianle Zhang

Deep neural networks have become foundational in various applications but remain vulnerable to adversarial patch attacks. Crafting effective adversarial patches is inherently challenging due to the combinatorial complexity involved in jointly optimizing critical factors such as patch shape, location, number, and content. Existing approaches often simplify this optimization by addressing each factor independently, which limits their effectiveness. To tackle this significant challenge, we introduce a novel and flexible adversarial attack framework termed IMPACT (Irregular Multi-Patch Adversarial Composition based on Two-phase optimization). IMPACT uniquely enables comprehensive optimization of all essential patch factors using gradient-free methods. Specifically, we propose a novel dimensionality reduction encoding scheme that substantially lowers computational complexity while preserving expressive power. Leveraging this encoding, we further develop a two-phase optimization framework: phase 1 employs differential evolution for joint optimization of patch mask and content, while phase 2 refines patch content using an evolutionary strategy for enhanced precision. Additionally, we introduce a new aggregation algorithm explicitly designed to produce contiguous, irregular patches by merging localized regions, ensuring physical applicability. Extensive experiments demonstrate that our method significantly outperforms several state-of-the-art approaches, highlighting the critical benefit of jointly optimizing all patch factors in adversarial patch attacks.


{location} Poster
Improving the Straight-Through Estimator with Zeroth-Order Information

Ningfeng Yang · Tor Aamodt

We study the problem of training neural networks with quantized parameters. Learning low-precision quantized parameters by enabling computation of gradients via the Straight-Through Estimator (STE) can be challenging. While the STE enables back-propagation, which is a first-order method, recent works have explored the use of zeroth-order (ZO) gradient descent for fine-tuning. We note that the STE provides high-quality biased gradients, and ZO gradients are unbiased but can be expensive. We thus propose First-Order-Guided Zeroth-Order Gradient Descent (FOGZO) that reduces STE bias while reducing computations relative to ZO methods. Empirically, we show FOGZO improves the tradeoff between quality and training time in Quantization-Aware Pre-Training. Specifically, versus STE at the same number of iterations, we show a 1-8% accuracy improvement for DeiT Tiny/Small, 1-2% accuracy improvement on ResNet 18/50, and 1-22 perplexity point improvement for LLaMA models with up to 0.3 billion parameters. For the same loss, FOGZO yields a 796$\times$ reduction in computation versus n-SPSA for a 2-layer MLP on MNIST. Code is available at [https://github.com/1733116199/fogzo](https://github.com/1733116199/fogzo).


{location} Poster
Information Theoretic Learning for Diffusion Models with Warm Start

Yirong Shen · Lu GAN · Cong Ling

Generative models that maximize model likelihood have gained traction in many practical settings. Among them, perturbation-based approaches underpin many state-of-the-art likelihood estimation models, yet they often face slow convergence and limited theoretical understanding. In this paper, we derive a tighter likelihood bound for noise-driven models to improve both the accuracy and efficiency of maximum likelihood learning. Our key insight extends the classical Kullback–Leibler (KL) divergence–Fisher information relationship to arbitrary noise perturbations, going beyond the Gaussian assumption and enabling structured noise distributions. This formulation allows flexible use of randomized noise distributions that naturally account for sensor artifacts, quantization effects, and data distribution smoothing, while remaining compatible with standard diffusion training. Treating the diffusion process as a Gaussian channel, we further express the mismatched entropy between data and model, showing that the proposed objective upper-bounds the negative log-likelihood (NLL). In experiments, our models achieve competitive NLL on CIFAR-10 and state-of-the-art results on ImageNet across multiple resolutions, all without data augmentation, and the framework extends naturally to discrete data.


{location} Spotlight Poster
InstructHOI: Context-Aware Instruction for Multi-Modal Reasoning in Human-Object Interaction Detection

Jinguo Luo · Weihong Ren · Quanlong Zheng · Yanhao Zhang · Zhenlong Yuan · Zhiyong Wang · Haonan Lu · Honghai LIU

Recently, Large Foundation Models (LFMs), e.g., CLIP and GPT, have significantly advanced the Human-Object Interaction (HOI) detection, due to their superior generalization and transferability. Prior HOI detectors typically employ single- or multi-modal prompts to generate discriminative representations for HOIs from pretrained LFMs. However, such prompt-based approaches focus on transferring HOI-specific knowledge, but unexplore the potential reasoning capabilities of LFMs, which can provide informative context for ambiguous and open-world interaction recognition. In this paper, we propose InstructHOI, a novel method that leverages context-aware instructions to guide multi-modal reasoning for HOI detection. Specifically, to bridge knowledge gap and enhance reasoning abilities, we first perform HOI-domain fine-tuning on a pretrained multi-modal LFM, using a generated dataset with 140K interaction-reasoning image-text pairs. Then, we develop a Context-aware Instruction Generator (CIG) to guide interaction reasoning. Unlike traditional language-only instructions, CIG first mines visual interactive context at the human-object level, which is then fused with linguistic instructions, forming multi-modal reasoning guidance. Furthermore, an Interest Token Selector (ITS) is adopted to adaptively filter image tokens based on context-aware instructions, thereby aligning reasoning process with interaction regions. Extensive experiments on two public benchmarks demonstrate that our proposed method outperforms the state-of-the-art ones, under both supervised and zero-shot settings.


{location} Poster
Limitations of Normalization in Attention

Timur Mudarisov · Mikhail Burtsev · Tatiana Petrova · Radu State

This paper investigates the limitations of the normalization in attention mechanisms. We begin with a theoretical framework that enables the identification of the model's selective ability and the geometric separation involved in token selection. Our analysis includes explicit bounds on distances and separation criteria for token vectors under softmax scaling. Through experiments with pre-trained GPT-2 model, we empirically validate our theoretical results and analyze key behaviors of the attention mechanism. Notably, we demonstrate that as the number of selected tokens increases, the model's ability to distinguish informative tokens declines, often converging toward a uniform selection pattern. We also show that gradient sensitivity under softmax normalization presents challenges during training, especially at low temperature settings. These findings advance current understanding of softmax-based attention mechanism and motivate the need for more robust normalization and selection strategies in future attention architectures.


{location} Poster
Multi-Objective Hyperparameter Selection via Hypothesis Testing on Reliability Graphs

Amirmohammad Farzaneh · Osvaldo Simeone

The selection of hyperparameters, such as prompt templates in large language models (LLMs), must often strike a balance between reliability and cost. In many cases, structural relationships between the expected reliability levels of the hyperparameters can be inferred from prior information and held-out data -- e.g., longer prompt templates may be more detailed and thus more reliable. However, existing hyperparameter selection methods either do not provide formal reliability guarantees or are unable to incorporate structured knowledge in the hyperparameter space. This paper introduces reliability graph-based Pareto testing (RG-PT), a novel multi-objective hyperparameter selection framework that maintains formal reliability guarantees in terms of false discovery rate (FDR), while accounting for known relationships among hyperparameters via a directed acyclic graph. Edges in the graph reflect expected reliability and cost trade-offs among hyperparameters, which are inferred via the Bradley-Terry (BT) ranking model from prior information and held-out data. Experimental evaluations demonstrate that RG-PT significantly outperforms existing methods such as learn-then-test (LTT) and Pareto testing (PT) through a more efficient exploration of the hyperparameter space.


{location} Poster
MUVR: A Multi-Modal Untrimmed Video Retrieval Benchmark with Multi-Level Visual Correspondence

Yue Feng · Jinwei Hu · Qijia Lu · Jiawei Niu · Li Tan · Shuo Yuan · Ziyi Yan · Yizhen Jia · Qingzhi He · Shiping Ge · Ethan Chen · Wentong Li · Limin Wang · Jie Qin

We propose the Multi-modal Untrimmed Video Retrieval task, along with a new benchmark (MUVR) to advance video retrieval for long-video platforms. MUVR aims to retrieve untrimmed videos containing relevant segments using multi-modal queries. It has the following features: 1) Practical retrieval paradigm: MUVR supports video-centric multi-modal queries, expressing fine-grained retrieval needs through long text descriptions, video tag prompts, and mask prompts. It adopts a one-to-many retrieval paradigm and focuses on untrimmed videos, tailored for long-video platform applications. 2) Multi-level visual correspondence: To cover common video categories (e.g., news, travel, dance) and precisely define retrieval matching criteria, we construct multi-level visual correspondence based on core video content (e.g., news events, travel locations, dance moves) which users are interested in and want to retrieve. It covers six levels: copy, event, scene, instance, action, and others. 3) Comprehensive evaluation criteria: We develop 3 versions of MUVR (i.e., Base, Filter, QA). MUVR-Base/Filter evaluates retrieval models, while MUVR-QA assesses MLLMs in a question-answering format. We also propose a Reranking Score to evaluate the reranking ability of MLLMs. MUVR consists of 53K untrimmed videos from the video platform Bilibili, with 1,050 multi-modal queries and 84K matches. Extensive evaluations of 3 state-of-the-art video retrieval models, 6 image-based VLMs, and 10 MLLMs are conducted. MUVR reveals the limitations of retrieval methods in processing untrimmed videos and multi-modal queries, as well as MLLMs in multi-video understanding and reranking. Our code and benchmark is available at https://github.com/debby-0527/MUVR.


{location} Poster
One-Step is Enough: Sparse Autoencoders for Text-to-Image Diffusion Models

Viacheslav Surkov · Chris Wendler · Antonio Mari · Mikhail Terekhov · Justin Deschenaux · Robert West · Caglar Gulcehre · David Bau

For large language models (LLMs), sparse autoencoders (SAEs) have been shown to decompose intermediate representations that often are not interpretable directly into sparse sums of interpretable features, facilitating better control and subsequent analysis. However, similar analyses and approaches have been lacking for text-to-image models. We investigate the possibility of using SAEs to learn interpretable features for SDXL Turbo, a few-step text-to-image diffusion model. To this end, we train SAEs on the updates performed by transformer blocks within SDXL Turbo's denoising U-net in its 1-step setting. Interestingly, we find that they generalize to 4-step SDXL Turbo and even to the multi-step SDXL base model (i.e., a different model) without additional training. In addition, we show that their learned features are interpretable, causally influence the generation process, and reveal specialization among the blocks. We do so by creating RIEBench, a representation-based image editing benchmark, for editing images while they are generated by turning on and off individual SAE features. This allows us to track which transformer blocks' features are the most impactful depending on the edit category. Our work is the first investigation of SAEs for interpretability in text-to-image diffusion models and our results establish SAEs as a promising approach for understanding and manipulating the internal mechanisms of text-to-image models.

Direct alignment methods typically train large language models (LLMs) by contrasting the likelihoods of preferred and dispreferred responses. While effective for matching relative preferences, these methods have been widely observed to depress the absolute likelihoods of example responses. Consequently, aligned models often exhibit behaviors that deviate from expected patterns, resembling the well‑known reward‑hacking effect even in the absence of an explicit reward model. This phenomenon exposes a fundamental limitation of contrastive alignment, which we characterize as likelihood underdetermination. In this work, we revisit direct preference optimization (DPO)—the seminal direct alignment method—and show that its loss admits a principled decomposition. The resulting reformulation not only extends naturally to a broader range of feedback types, but also sheds light on the origin of likelihood underdetermination. In particular, we identify that the standard DPO implementation implicitly oversimplifies a regularizer in the reformulated loss, and restoring its full version effectively resolves the underdetermination. Building on these insights, we introduce PRoximalized PReference Optimization (PRO), a unified alignment method that handles diverse feedback types while eliminating likelihood underdetermination through an efficient approximation of the full regularizer. Empirical evaluations demonstrate the consistent superiority of PRO over existing methods across pairwise, binary and scalar feedback.

Parameter-efficient fine-tuning (PEFT) methods have shown promise in adapting large language models, yet existing approaches exhibit counter-intuitive phenomena: integrating either matrix decomposition or mixture-of-experts (MoE) individually decreases performance across tasks, though decomposition improves results on specific domains despite reducing parameters, while MoE increases parameter count without corresponding decrease in training efficiency. Motivated by these observations and the modular nature of PT, we propose PT-MoE, a novel framework that integrates matrix decomposition with MoE routing for efficient PT. Evaluation results across 17 datasets demonstrate that PT-MoE achieves state-of-the-art performance in both question answering (QA) and mathematical problem solving tasks, improving F1 score by 1.49 points over PT and 2.13 points over LoRA in QA tasks, while improving mathematical accuracy by 10.75 points over PT and 0.44 points over LoRA, all while using 25% fewer parameters than LoRA. Our analysis reveals that while PT methods generally excel in QA tasks and LoRA-based methods in math datasets, the integration of matrix decomposition and MoE in PT-MoE yields complementary benefits: decomposition enables efficient parameter sharing across experts while MoE provides dynamic adaptation, collectively enabling PT-MoE to demonstrate cross-task consistency and generalization abilities. These findings, along with ablation studies on routing mechanisms and architectural components, provide insights for future PEFT methods.


{location} Poster
Revealing Multimodal Causality with Large Language Models

Jin Li · Shoujin Wang · Qi Zhang · Feng Liu · Tongliang Liu · Longbing Cao · Shui Yu · Fang Chen

Uncovering cause-and-effect mechanisms from data is fundamental to scientific progress. While large language models (LLMs) show promise for enhancing causal discovery (CD) from unstructured data, their application to the increasingly prevalent multimodal setting remains a critical challenge. Even with the advent of multimodal LLMs (MLLMs), their efficacy in multimodal CD is hindered by two primary limitations: (1) difficulty in exploring intra- and inter-modal interactions for comprehensive causal variable identification; and (2) insufficiency to handle structural ambiguities with purely observational data. To address these challenges, we propose MLLM-CD, a novel framework for multimodal causal discovery from unstructured data. It consists of three key components: (1) a novel contrastive factor discovery module to identify genuine multimodal factors based on the interactions explored from contrastive sample pairs; (2) a statistical causal structure discovery module to infer causal relationships among discovered factors; and (3) an iterative multimodal counterfactual reasoning module to refine the discovery outcomes iteratively by incorporating the world knowledge and reasoning capabilities of MLLMs. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of the proposed MLLM-CD in revealing genuine factors and causal relationships among them from multimodal unstructured data. The implementation code and data are available at https://github.com/JinLi-i/MLLM-CD.


{location} Poster
Shapley-Based Data Valuation for Weighted $k$-Nearest Neighbors

Guangyi Zhang · Qiyu Liu · Aristides Gionis

Data valuation quantifies the impact of individual data points on model performance, and Shapley values provide a principled approach to this important task due to their desirable axiomatic properties, albeit with high computational complexity. Recent breakthroughs have enabled fast computation of exact Shapley values for unweighted $k$-nearest neighbor ($k$NN) classifiers. However, extending this to weighted $k$NN models has remained a significant open challenge. The state-of-the-art methods either require quadratic time complexity or resort to approximation via sampling. In this paper, we show that a conceptually simple but overlooked approach --- data duplication --- can be applied to this problem, yielding a natural variant of weighted $k$NN-Shapley. However, a straightforward application of the data-duplication idea leads to increased data size and prohibitive computational and memory costs. We develop an efficient algorithm that avoids materializing the duplicated dataset by exploiting the structural properties of weighted $k$NN models, reducing the complexity to near-linear time in the original data size. Besides, we establish theoretical foundations for this approach through axiomatic characterization of the resulting values, and empirically validate the effectiveness and efficiency of our method.


{location} Poster
Soft-consensual Federated Learning for Data Heterogeneity via Multiple Paths

Sheng Huang · Lele Fu · Fanghua Ye · Tianchi Liao · Bowen Deng · zhangchuanfu · Chuan Chen

Federated learning enables collaborative training while preserving the privacy of all participants. However, the heterogeneity in data distribution across multiple training nodes poses significant challenges to the construction of federated models. Prior studies were dedicated to mitigating the effects of data heterogeneity by using global information as a blueprint and restricting the local update of the model for reaching a "hard consensus". But this practice makes it difficult to balance local and global information, and it neglects to negotiate amicably between local and global models to reach mutually agreeable results, called ``soft consensus". In this paper, a multiple-path solving method is proposed to balance global and local features and combine these two feature preference paths to reach a soft consensus. Rather than relying on global information as the sole criterion, a negotiation process is employed to address the same objective by accommodating diverse feature preferences, thereby facilitating the discovery of a more plausible solution through multiple distinct pathways. Considering the overwhelming power of local features during local training, a swapping strategy is applied to weaken them to balance the solution paths. Moreover, to minimize the additional communication cost caused by the introduction of multiple paths, the solution of the task network is converted into data adaptation to reduce the amount of parameter transmission. Extensive experiments are conducted to demonstrate the advantages of the proposed method.


{location} Poster
SSRB: Direct Natural Language Querying to Massive Heterogeneous Semi-Structured Data

Xin Zhang · Mingxin Li · Yanzhao Zhang · Dingkun Long · Yongqi Li · Yinghui Li · Pengjun Xie · Meishan Zhang · Wenjie Li · Min Zhang · Philip S Yu

Searching over semi-structured data with natural language (NL) queries has attracted sustained attention, enabling broader audiences to access information easily. As more applications, such as LLM agents and RAG systems, emerge to search and interact with semi-structured data, two major challenges have become evident: (1) the increasing diversity of domains and schema variations, making domain-customized solutions prohibitively costly; (2) the growing complexity of NL queries, which combine both exact field matching conditions and fuzzy semantic requirements, often involving multiple fields and implicit reasoning. These challenges make formal language querying or keyword-based search insufficient. In this work, we explore neural retrievers as a unified non-formal querying solution by directly index semi-structured collections and understand NL queries. We employ LLM-based automatic evaluation and build a large-scale semi-structured retrieval benchmark (SSRB) using LLM generation and filtering, containing 14M semi-structured objects from 99 different schemas across 6 domains, along with 8,485 test queries that combine both exact and fuzzy matching conditions. Our systematic evaluation of popular retrievers shows that current state-of-the-art models could achieve acceptable performance, yet they still lack precise understanding of matching constraints. While by in-domain training of dense retrievers, the performance can be significantly improved. We believe that our SSRB could serve as a valuable resource for future research in this area, and we hope to inspire further exploration of semi-structured retrieval with complex queries.


{location} Poster
Statistical inference for Linear Stochastic Approximation with Markovian Noise

Sergey Samsonov · Marina Sheshukova · Eric Moulines · Alexey Naumov

In this paper we derive non-asymptotic Berry–Esseen bounds for Polyak–Ruppert averaged iterates of the Linear Stochastic Approximation (LSA) algorithm driven by the Markovian noise. Our analysis yields $O(n^{-1/4})$ convergence rates to the Gaussian limit in the Kolmogorov distance. We further establish the non-asymptotic validity of a multiplier block bootstrap procedure for constructing the confidence intervals, guaranteeing consistent inference under Markovian sampling. Our work provides the first non-asymptotic guarantees on the rate of convergence of bootstrap-based confidence intervals for stochastic approximation with Markov noise. Moreover, we recover the classical rate of order $\mathcal{O}(n^{-1/8})$ up to logarithmic factors for estimating the asymptotic variance of the iterates of the LSA algorithm.


{location} Poster
The Atlas of In-Context Learning: How Attention Heads Shape In-Context Retrieval Augmentation

Patrick Kahardipraja · Reduan Achtibat · Thomas Wiegand · Wojciech Samek · Sebastian Lapuschkin

Large language models are able to exploit in-context learning to access external knowledge beyond their training data through retrieval-augmentation. While promising, its inner workings remain unclear. In this work, we shed light on the mechanism of in-context retrieval augmentation for question answering by viewing a prompt as a composition of informational components. We propose an attribution-based method to identify specialized attention heads, revealing in-context heads that comprehend instructions and retrieve relevant contextual information, and parametric heads that store entities' relational knowledge. To better understand their roles, we extract function vectors and modify their attention weights to show how they can influence the answer generation process. Finally, we leverage the gained insights to trace the sources of knowledge used during inference, paving the way towards more safe and transparent language models.


{location} Poster
Toward a Unified Geometry Understanding : Riemannian Diffusion Framework for Graph Generation and Prediction

Yisen Gao · Xingcheng Fu · Qingyun Sun · Jianxin Li · Xianxian LI

Graph diffusion models have made significant progress in learning structured graph data and have demonstrated strong potential for predictive tasks. Existing approaches typically embed node, edge, and graph-level features into a unified latent space, modeling prediction tasks including classification and regression as a form of conditional generation. However, due to the non-Euclidean nature of graph data, features of different curvatures are entangled in the same latent space without releasing their geometric potential. To address this issue, we aim to construt an ideal Riemannian diffusion model to capture distinct manifold signatures of complex graph data and learn their distribution. This goal faces two challenges: numerical instability caused by exponential mapping during the encoding proces and manifold deviation during diffusion generation. To address these challenges, we propose GeoMancer: a novel Riemannian graph diffusion framework for both generation and prediction tasks. To mitigate numerical instability, we replace exponential mapping with an isometric-invariant Riemannian gyrokernel approach and decouple multi-level features onto their respective task-specific manifolds to learn optimal representations. To address manifold deviation, we introduce a manifold-constrained diffusion method and a self-guided strategy for unconditional generation, ensuring that the generated data remains aligned with the manifold signature. Extensive experiments validate the effectiveness of our approach, demonstrating superior performance across a variety of tasks.


{location} Poster
Train with Perturbation, Infer after Merging: A Two-Stage Framework for Continual Learning

Haomiao Qiu · Miao Zhang · Ziyue Qiao · Liqiang Nie

Continual Learning (CL) aims to enable models to continuously acquire new knowledge from a sequence of tasks with avoiding the forgetting of learned information. However, existing CL methods only rely on the parameters of the most recent task for inference, which makes them susceptible to catastrophic forgetting. Inspired by the recent success of model merging techniques, we propose Perturb-and-Merge (P\&M), a novel continual learning framework that integrates model merging into the CL paradigm to mitigate forgetting. Specifically, after training on each task, P\&M constructs a new model by forming a convex combination of the previous model and the newly trained task-specific model. Through theoretical analysis, We minimize the total loss increase across all tasks and derive a closed-form solution for the merging coefficient under mild assumptions. To further improve the performance of the merged model, we observe that the degradation introduced during merging can be alleviated by a regularization term composed of the task vector and the Hessian matrix of the loss function. Interestingly, we show that this term can be efficiently approximated using second-order symmetric finite differences, and a stochastic perturbation strategy along the task vector direction is accordingly devised which incurs no additional forward or backward passes while providing an effective approximation of the regularization term. Finally, we combine P\&M with LoRA, a parameter-efficient fine-tuning method, to reduce memory overhead. Our proposed approach achieves state-of-the-art performance on several continual learning benchmark datasets. The code is available at \url{https://github.com/qhmiao/P-M-for-Continual-Learning}.


{location} Poster
Trust Region Reward Optimization and Proximal Inverse Reward Optimization Algorithm

Yang Chen · Menglin Zou · Jiaqi Zhang · Yitan Zhang · Junyi Yang · Gaël Gendron · Libo Zhang · Jiamou Liu · Michael Witbrock

Inverse Reinforcement Learning (IRL) learns a reward function to explain expert demonstrations. Modern IRL methods often use the adversarial (minimax) formulation that alternates between reward and policy optimization, which often lead to {\em unstable} training. Recent non-adversarial IRL approaches improve stability by jointly learning reward and policy via energy-based formulations but lack formal guarantees. This work bridges this gap. We first present a unified view showing canonical non-adversarial methods explicitly or implicitly maximize the likelihood of expert behavior, which is equivalent to minimizing the expected return gap. This insight leads to our main contribution: Trust Region Reward Optimization (TRRO), a framework that guarantees monotonic improvement in this likelihood via a Minorization-Maximization process. We instantiate TRRO into Proximal Inverse Reward Optimization (PIRO), a practical and stable IRL algorithm. Theoretically, TRRO provides the IRL counterpart to the stability guarantees of Trust Region Policy Optimization (TRPO) in forward RL. Empirically, PIRO matches or surpasses state-of-the-art baselines in reward recovery, policy imitation with high sample efficiency on MuJoCo and Gym-Robotics benchmarks and a real-world animal behavior modeling task.


{location} Poster
Unifying and Enhancing Graph Transformers via a Hierarchical Mask Framework

Yujie Xing · Xiao Wang · Bin Wu · Hai Huang · Chuan Shi

Graph Transformers (GTs) have emerged as a powerful paradigm for graph representation learning due to their ability to model diverse node interactions. However, existing GTs often rely on intricate architectural designs tailored to specific interactions, limiting their flexibly. To address this, we propose a unified hierarchical mask framework that reveals an underlying equivalence between model architecture and attention mask construction. This framework enables a consistent modeling paradigm by capturing diverse interactions through carefully designed attention masks. Theoretical analysis under this framework demonstrates that the probability of correct classification positively correlates with the receptive field size and label consistency, leading to a fundamental design principle: An effective attention mask should ensure both a sufficiently large receptive field and a high level of label consistency. While no single existing mask satisfies this principle across all scenarios, our analysis reveals that hierarchical masks offer complementary strengths—motivating their effective integration. Then, we introduce M$^3$Dphormer, a Mixture-of-Experts based Graph Transformer with Multi-Level Masking and Dual Attention Computation. M$^3$Dphormer incorporates three theoretically grounded hierarchical masks and employs a bi-level expert routing mechanism to adaptively integrate multi-level interaction information. To ensure scalability, we further introduce a dual attention computation scheme that dynamically switches between dense and sparse modes based on local mask sparsity. Extensive experiments across multiple benchmarks demonstrate that M$^3$Dphormer achieves state-of-the-art performance, validating the effectiveness of our unified framework and model design.


{location} Poster
V2V: Scaling Event-Based Vision through Efficient Video-to-Voxel Simulation

Hanyue Lou · Jinxiu Liang · Minggui Teng · Yi Wang · Boxin Shi

Event-based cameras offer unique advantages such as high temporal resolution, high dynamic range, and low power consumption. However, the massive storage requirements and I/O burdens of existing synthetic data generation pipelines and the scarcity of real data prevent event-based training datasets from scaling up, limiting the development and generalization capabilities of event vision models. To address this challenge, we introduce Video-to-Voxel (V2V), an approach that directly converts conventional video frames into event-based voxel grid representations, bypassing the storage-intensive event stream generation entirely. V2V enables a 150× reduction in storage requirements while supporting on-the-fly parameter randomization for enhanced model robustness. Leveraging this efficiency, we train several video reconstruction and optical flow estimation model architectures on 10,000 diverse videos totaling 52 hours—an order of magnitude larger than existing event datasets, yielding substantial improvements.