NeurIPS 2025 Sunday 12/7

Skip to yearly menu bar Skip to main content

Timezone: America/Los_Angeles

Full Schedule Sun 11/30 Mon 12/1 Tue 12/2 Wed 12/3 Thu 12/4 Fri 12/5 Sat 12/6 Sun 12/7

Registration Desk

Registration Desk

7:30 AM - 12:00 PM

Competition

The MindGames Challenge: Theory-of-Mind and Game Intelligence in LLM Agents

Kevin Wang · Jianzhu Yao · Yihan Jiang · Benjamin Finch · Viraj Nadkarni · Benjamin Kempinski · Anna C. M. Thöni · Mathieu Lauriere · Maria Polukarov · Pramod Viswanath · Tal Kachman · Yoram Bachrach · Zhangyang "Atlas" Wang

8:00 AM - 10:45 AM

Recent breakthroughs in large language models have revolutionized natural language processing and spawned new classes of multi-agent AI systems. Yet, essential gaps remain in such systems' abilities to model beliefs, detect deception, coordinate effectively under uncertainty, and planning in longer-term dynamic environments ---collectively known as ``theory-of-mind" capacities. The MindGames Challenge seeks to address these gaps by testing and advancing the cooperative intelligence of LLM agents across multiple distinct social-deduction and coordination tasks. Participants will develop agents that (i) communicate via natural language, (ii) reason about hidden states and competing objectives, and (iii) dynamically adapt strategies in repeated and iterative interactions

Competition

The PokéAgent Challenge: Competitive and Long-Context Learning at Scale

Seth Karten · Jake Grigsby · Stephanie Milani · Kiran Vodrahalli · Amy Zhang · Fei Fang · Yuke Zhu · Chi Jin

8:00 AM - 10:45 AM

While frontier AI models excel at language understanding, math reasoning, and code generation, they underperform in out-of-distribution generalization, adaptation to strategic opponents, game-theoretic decision-making, and long-context reasoning and planning. To address these gaps, we introduce the PokéAgent Challenge, leveraging Pokémon’s rich multi-agent battle system and expansive role-playing game (RPG) environment. The competition features two complementary tracks: the \textit{Battling Track} evaluates generalization and strategic reasoning under uncertainty in the two-player game of Competitive Pokémon, while the \textit{Speedrunning Track} targets long-horizon planning and decision-making in the Pokémon RPG. Together, our competition tracks unify recent interests in reinforcement learning (RL) and large language model (LLM) research, encouraging collaboration across communities. Pokémon's popularity and internet presence are a key strength of our competition: Participants will have access to a large dataset of over 3.5 million battles and a knowledge base of reference materials and baseline methods. Recent work led by our competition's organizers will provide varied baselines, including rule-based, RL, and LLM-based agents. Our resources will make the PokéAgent challenge accessible while maintaining the complexity needed to drive fundamental advances in decision-making systems.

Competition

Ariel Data Challenge 2025: Methods to Extract Planetary Signals for the Ariel Space Telescope

Kai Hou Yip · Lorenzo Mugnai · Andrea Bocchieri · Giuseppe Morello · Orphée Faucoz · Angèle Syty · Tara Tahseen · Luís F. Simões · Ingo Waldmann

8:00 AM - 10:45 AM

This workshop showcases winning approaches from the 2025 Ariel Data Challenge, a Kaggle competition tackling a notoriously difficult signal processing problem: extracting extremely faint exoplanet signatures from complex, non-linear noise in spatiotemporal data. The 2024 challenge drew thousands of competitors worldwide, yet no solution achieved the mission's stringent performance thresholds. The 2025 competition raised the bar with higher-fidelity simulations that closely mirror real observational conditions from the Ariel Space Telescope.

Winners will present novel architectures and algorithms for two core problems: advanced denoising in the presence of structured noise and robust uncertainty quantification under extreme signal-to-noise ratios. These solutions emerged from a realistic constraint environment where both accuracy and calibrated confidence estimates are mission critical.

While framed within an astronomy context, the technical challenges are broadly applicable. Whether you are an researchers interested in this domain, or simply interested in ML applications in space science, come and join us!

Competition

Early Training Scientific Knowledge and Reasoning Evaluation of Small Language Models

Mouadh Yagoubi · YASSER ABDELAZIZ DAHOU DJILALI · Billel Mokeddem · Younes Belkada · Phúc Lê Khắc · Basma Boussaha · REDA ALAMI · Jingwei Zuo · Damiano Marsili · Mugariya Farooq · Mounia Lalmas · Georgia Gkioxari · Patrick Gallinari · Philip Torr · Hakim Hacid

8:00 AM - 10:45 AM

Existing benchmarks have proven effective for assessing the performance of fully trained large language models. However, we find striking differences in the early training stages of small models, where benchmarks often fail to provide meaningful or discriminative signals. To explore how these differences arise, this competition tackles the challenge of designing scientific knowledge evaluation tasks specifically tailored for measuring early training progress of language models. Participants are invited to develop novel evaluation methodologies or adapt existing benchmarks to better capture performance differences among language models.To support this effort, we provide three pre-trained small models (0.5B, 1B, and 3B parameters), along with intermediate checkpoints sampled during training up to 200B tokens. All experiments and development work can be run on widely available free cloud-based GPU platforms, making participation accessible to researchers with limited computational resources. Submissions will be evaluated based on three criteria: the quality of the performance signal they produce, the consistency of model rankings at 1 trillion tokens of training, and their relevance to the scientific knowledge domain. By promoting the design of tailored evaluation strategies for early training, this competition aims to attract a broad range of participants from various disciplines, including those who may not be machine learning experts or have access to dedicated GPU resources. Ultimately, this initiative seeks to make foundational LLM research more systematic and benchmark-informed from the earliest phases of model development.

Workshop

Regulatable ML: Towards Bridging the Gaps between Machine Learning Research and Regulations

Chirag Agarwal · Jiaqi Ma · Sarah Tan · Himabindu Lakkaraju · Junwei Deng · Pingbang Hu · Eileanor LaRocco · Karolina Naranjo · Shichang (Ray) Zhang

8:00 AM - 5:00 PM

Workshop

New Perspectives in Graph Machine Learning

Zhiyang Wang · Juan Cervino · Luana Ruiz · Alejandro Ribeiro · Stefanie Jegelka · Charilaos Kanatsoulis

8:00 AM - 5:00 PM

Workshop

2nd Workshop on Multi-modal Foundation Models and Large Language Models for Life Sciences

Pengtao Xie · James Zou · Le Song · Ruishan Liu · Aidong Zhang · Eran Segal · Wei Wang · Li Zhang

8:00 AM - 5:00 PM

Workshop

LAW 2025: Bridging Language, Agent, and World Models for Reasoning and Planning

Zhen Wang · Ziqiao Ma · Jessy Lin · Melanie Sclar · Jianwen Xie · Kelsey Allen · Alane Suhr · Jacob Andreas · Tianmin Shu · Zhiting Hu

8:00 AM - 5:00 PM

Workshop

Workshop on Mechanistic Interpretability

Neel Nanda · Andy Arditi · Stefan Heimersheim · Anna Soligo · Andrew Lee · Martin Wattenberg · Sarah Wiegreffe · Atticus Geiger · Julius Adebayo · Kayo Yin · Fazl Barez · Lawrence Chan

8:00 AM - 5:00 PM

Workshop

Non-Euclidean Foundation Models and Geometric Learning: Advancing AI Beyond Euclidean Frameworks

Menglin Yang · Neil He · Yifei Zhang · Weikang Qiu · Ngoc Bui · Jiahong Liu · Melanie Weber · Rex Ying

8:00 AM - 5:00 PM

In the era of foundation models and Large Language Models (LLMs), Euclidean space is the de facto geometric setting of our machine learning architectures. However, recent literature has demonstrated that this choice comes with fundamental limitations. Non-Euclidean learning is quickly gaining traction. Non-Euclidean spaces, such as hyperbolic, spherical, and mixed-curvature spaces, have been shown to provide more efficient and effective representations for data with intrinsic geometric properties, like hierarchy, symmetry, and heterogeneity.

Integrating foundation models with non-Euclidean spaces has great potential to enhance their ability to capture and model the underlying structures and relationships in complex real-world data, leading to better performance, generalization, and interpretability. This workshop focuses on the intersection of Non-Euclidean representation learning and Foundation Models, exploring its potential benefits, challenges, and future directions.

Workshop

Constrained Optimization for Machine Learning

Juan Ramirez · Meraj Hashemizadeh · Ignacio Hounie · Juan Elenter · Katya Scheinberg · Alejandro Ribeiro · Simon Lacoste-Julien

8:00 AM - 5:00 PM

As AI systems are increasingly deployed in safety-critical domains—including credit scoring, medical diagnosis, and autonomous systems—there is a growing demand to ensure their fairness, safety, robustness, and interpretability, alongside stronger calls for regulation. Constrained optimization offers an accountable framework for enforcing these requirements by embedding them directly into the training process, steering models to satisfy explicit constraints. This framework facilitates compliance with regulatory, industry, or ethical standards, which can be easily verified by checking constraint satisfaction.

This workshop explores constrained optimization as a principled method for enforcing desirable properties in machine learning models. It brings together experts in optimization, machine learning, and trustworthy AI to address the algorithmic and practical challenges of scaling constrained methods to modern deep learning settings, which are often large-scale, non-convex, and stochastic.

Workshop

Artificial Intelligence for Music: Where Creativity Meets Computation

Hao-Wen Dong · Zachary Novack · Yung-Hsiang Lu · Kristen Yeon-Ji Yun · Benjamin Chou

8:00 AM - 5:00 PM

This workshop explores the dynamic intersection of AI and music, a rapidly evolving field where creativity meets computation. The goal of this workshop is twofold: First, we aim to explore the latest advancements of AI’s applications for music, from analysis, creation, performance, production, retrieval to music education and therapy. Second, we aim to discuss the impacts and implications of AI in music, including AI’s impacts on the music industry, musician community, and music education as well as ethical, legal and societal implications of AI music and AI’s implications for future musicians.

Workshop

Tackling Climate Change with Machine Learning

Hari Prasanna Das · Raluca Stevenson · Joaquin Salas · Salva Rühling Cachay · Nadia Ahmed · Yoshua Bengio

8:00 AM - 5:00 PM

Many in the ML community wish to take action on climate change, but are unsure how to have the most impact. This workshop will highlight work that demonstrates that, while ML is no silver bullet, it can be an invaluable tool in reducing greenhouse gas emissions and in helping society adapt to the effects of climate change.

Climate change is a complex problem for which action takes many forms, from advancing theory to deploying new technology. Many of these actions represent high-impact opportunities for real-world change, and simultaneously pose interesting academic research problems.

The theme of this workshop, “Roots to Routes: A Dialogue on Different Machine Learning Methods for Climate Impact,” invites submissions that explore the strengths of diverse machine learning approaches in climate-related contexts. We particularly encourage work that demonstrates the effectiveness of classical ML methods under real-world constraints, such as limited data availability, privacy concerns, or restricted computational resources. At the same time, we welcome contributions that showcase how scaling up data and computing resources combined with modern tools and techniques can unlock new possibilities for tackling global-scale climate prediction challenges.

This workshop is part of a series that aims to bring together those applying ML to climate change challenges and facilitate cross-pollination between ML researchers and experts in climate-relevant fields.

The main workshop will take place on December 6 or 7, 2025 (exact date TBD).

Workshop

Recent Advances in Time Series Foundation Models: Have We Reached the ‘BERT Moment’?

Thomas Moreau · Romain Tavenard · Valentina Zantedeschi · Vasilii Feofanov · Ievgen Redko · Ambroise Odonnat

8:00 AM - 5:00 PM

Foundation models (FMs) have achieved great success in NLP and vision, inspiring over 20 new time series FMs (TSFMs) in the past year. Despite promising results, studies show that carefully designed lightweight supervised baselines often match TSFM performance. Unlike NLP’s “BERT Moment,” TSFMs still require full fine-tuning to be competitive in real-world scenarios. Additionally, some tabular FMs rival TSFMs without being time series-specific. Recent benchmarks also provide mixed evidence: GIFT-Eval favors TSFMs, OpenTS shows statistical models outperforming deep learning on univariate data, and FoundTS finds supervised baselines on par with TSFMs. This workshop aims to bring together researchers to examine the gap between TSFM potential and real-world utility, and to identify benchmarks and applications where TSFMs can truly excel.

The key topics of this workshop include, but are not limited to:

- Benchmarking Foundation Models in Time Series,
- Scaling Laws and Efficiency in Time Series Models,
- Evaluating Transferability and Adaptability of Foundation Models,
- Leveraging Foundation Models of Other Modalities for Time Series,
- Unsupervised performance estimation of TSFMs,
- Industrial Benchmarking of Time Series Foundation Models

More details are provided in our Call for Papers.

Workshop

Learning to Sense (L2S)

Shashank Agnihotri · Mishal Fatima · Marius Bock · Kanchana Vaishnavi Gandikota · Jovita Lukasik · Margret Keuper · Michael Moeller

8:00 AM - 5:00 PM

The workshop explores the joint optimization of sensors and machine learning models, pushing beyond traditional paradigms of data acquisition and processing. We aim to rethink the foundations of how machines sense the world by replacing hand-crafted ISPs, leveraging learnable sensor layouts, and adopting task-driven sensing strategies.

We welcome original contributions and position papers on the following topics (non-exhaustive):

Sensor optimization for e.g. computer vision (bit-depth, pixel layouts, color filter design)
RAW-to-task or RAW-to-label approaches for visual tasks
Co-design of neural networks and sensor hardware
Low-bit and energy-efficient sensing for embedded or mobile devices
Benchmarks, datasets, and metrics for evaluating sensor-model pipelines
Generalization and robustness of sensor-model systems in real-world conditions
Failure case studies and negative results in joint optimization pipelines

Join us to engage with cutting-edge research and cross-disciplinary discussions that are shaping the future of sensor systems for real-world deployment across mobile, embedded, and autonomous platforms.

Workshop

What Can('t) Transformers Do?

Tobias Schnabel · Kiran Tomlinson · Lena Strobl · Michael Hahn

8:00 AM - 5:00 PM

With most advances in large foundation models (LFMs) being empirical, our theoretical understanding of what transformers can compute, express, and learn still lags behind. This workshop will convene theorists and empiricists to chart a rigorous agenda for the next generation of LFMs, asking “What can and can’t transformers do?” We welcome both formal analyses and empirically grounded studies that shed light on theoretical questions, aiming to close the gap between proofs and practice while fostering new, interdisciplinary collaborations.

Workshop

Data on the Brain and Mind

Catherine Ji · Vivek Myers · Archer Wang · Benjamin Eysenbach · Jenelle Feather · Erin Grant

8:00 AM - 5:00 PM

Workshop

Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling

Berivan Isik · Beyza Ermis · Nithya Attaluri · Rishi Bommasani · Marius Hobbhahn · Yangjun Ruan · Diyi Yang

8:00 AM - 5:00 PM

Workshop

Symmetry and Geometry in Neural Representations

Francisco Acosta · Chase van de Geijn · Simone Azeglio · Christian A Shewmake · Sophia Sanborn · Nina Miolane

8:00 AM - 5:00 PM

The fields of biological and artificial intelligence are increasingly converging on a shared principle: the geometry and topology of real-world structure play a central role in building efficient, robust, and interpretable representations. In neuroscience, mounting evidence suggests that neural circuits encode task and environmental structure through low-dimensional manifolds, conserved symmetries, and structured transformations. In deep learning, principles such as sparsity, equivariance, and compositionality are guiding the development of more generalizable and interpretable models, including new approaches to foundation model distillation. The NeurReps workshop brings these threads together, fostering dialogue among machine learning researchers, neuroscientists, and mathematicians to uncover unifying geometric principles of neural representation. Just as geometry and symmetry once unified the models of 20th-century physics, we believe they may now illuminate the computational foundations of intelligence.

Workshop

Frontiers in Probabilistic Inference: Learning meets Sampling

Tara Akhound-Sadegh · Michael Albergo · Joey Bose · Marylou Gabrié · Louis Grenioux · Guan-Horng Liu · Kirill Neklyudov · Grant Rotskoff · Eva Smorodina · Alexander Tong

8:00 AM - 5:00 PM

Workshop

UrbanAI: Harnessing Artificial Intelligence for Smart Cities

Judah Goldfeder · Na Li · Donna Vakalis · Bianca Howard · Philippe Wyder · Bing Dong · Yoshua Bengio

8:00 AM - 5:00 PM

As the world population becomes increasingly urban, incredible innovation is required to sustain such dense populations in a safe and healthy manner. Urban areas are responsible for over 70% of global carbon emissions and energy consumption, driven by infrastructure and transportation systems that often remain inefficient or outdated despite advances in technology. AI and machine learning present immense opportunities to reshape urban environments, optimizing everything from energy use and transportation networks to public health and governance, addressing urgent challenges in areas as diverse as building optimization and sustainability, pollution mitigation, infrastructure maintenance, urban planning, traffic management, and civic life. However, applying AI solutions to these challenges is not easy. Engineers must deal with a wide range of obstacles, including complex and non standardized management systems, diverse and non-integrated data sources, security and privacy risks, and continuous integration and maintenance. This workshop aims to engage the machine learning community in addressing urban optimization challenges, overcoming real world obstacles that prevent adoption of solutions, and fostering interdisciplinary collaboration among experts in infrastructure, transportation, and public health. By leveraging cutting-edge ML methodologies, participants can develop scalable solutions to enhance efficiency, sustainability, and quality of life in urban areas. Confirmed speakers from fields including power systems, building optimization, transportation, water, energy systems, climate science, and building control, will lead discussions on establishing benchmarks, developing robust methodologies, and creating solutions with measurable real-world impact. This workshop offers the ML community a unique platform to directly contribute to sustainable and intelligent urban development, helping cities globally meet climate goals, improve public services, and enhance overall urban resilience. Unlike existing ML-focused workshops on climate and physical systems, UrbanAI explicitly addresses the multifaceted challenges of urban environments, bringing international experts together for the first time at a major ML conference

Workshop

Workshop on Scaling Environments for Agents

Yuan He · Guohao Li · Yi R. (May) Fung · Qingyun Wang · Fangru Lin · Xingyue Huang · Alisia Lupidi · Yusheng Su · Ziyu Ye · Da Yin · Ziyi Yang · Jialin Yu · Sunando Sengupta · Rishabh Agarwal · Bernard Ghanem · Animashree Anandkumar · Philip Torr · yuqin xie

8:00 AM - 5:00 PM

The development of intelligent agents – particularly those powered by large language models (LLMs) – has emphasized the critical role of environments in shaping agent behavior and capabilities, especially for achieving end-to-end autonomy. Environments are not merely testing grounds; they are dynamic, interactive contexts that serve as the essential "data" for agents to learn adaptive behavior, complex reasoning, and long-term decision-making skills. Just as scaling the model size, dataset size, and training computation has led to emergent capabilities in LLMs, scaling the structure, fidelity, and diversity of environments is one of the crucial dimensions in advancing agent intelligence. Moreover, recent advances in end-to-end reinforcement learning (RL), particularly when paired with LLM-based agents, have made it increasingly viable to train agents through sustained interaction. These agents can now acquire skills, strategies, and planning abilities through environmental feedback, rather than relying solely on imitation learning or static prompt engineering. As we move toward more autonomous, general-purpose agents, the need for scalable, richly interactive, and diverse environments has become both urgent and foundational.

Workshop

CogInterp: Interpreting Cognition in Deep Learning Models

Eric Bigelow · Jennifer Hu · Ekdeep S Lubana · Kanishk Gandhi · Laura Ruis · Thomas Fel · Ellie Pavlick · Noah Goodman

8:00 AM - 5:00 PM

Recent innovations in deep learning have produced models with impressive capabilities, achieving or even exceeding human performance in a wide range of domains. A timely and critical challenge in AI is understanding what behaviors these models are actually capable of, and the internal processes which support these behaviors. As interest continues to grow in models’ internal processes, the field of cognitive science is becoming increasingly useful for describing and understanding cognition in deep learning models: cognitive science, which seeks to describe the cognitive processes in human and animal minds, offers a rich body of theories, experiments, and frameworks which may be adopted to understand how deep learning models achieve complex behaviors in domains such as language, vision, and reasoning.

The workshop will focus on Cognitive Interpretability (“CogInterp”), which involves the systematic interpretation of high-level cognition in deep learning models. Similar to how cognitive science describes the intermediate representations and algorithms (or cognition) between behavior and neurons in biological systems, the goal of Cognitive Interpretability is to describe the cognitive processes which lie between the levels of behavioral evaluations and mechanistic interpretability in deep learning models. Practically speaking, this means that Cognitive Interpretability does not just ask whether a model can perform task X or has a certain ability Y , but additionally (or instead) how a model performs X or learns and implements Y . These kinds of inferences—from observable behavior to latent “mental” processes—are the bread and butter of cognitive science, but many of the theoretical and empirical tools developed to tackle these problems have not yet been widely adopted in AI research, in part because of the separation between the fields and communities.

To address the gap above, our goal is to bring together researchers in cognitive science and AI interpretability to discuss new empirical results and theories about the inner workings of deep learning models. We hope to gather perspectives from various disciplines, including machine learning, psychology, linguistics, vision science, neuroscience, philosophy of mind, and law.

Workshop

AI That Keeps Up: Workshop on Continual and Compatible Foundation Model Updates (CCFM)

Fartash Faghri · Jessica Echterhoff · Jeffrey Li · Saurabh Garg · Amal Rannen-Triki · Sayna Ebrahimi

8:00 AM - 5:00 PM

Foundation models, despite their impressive capabilities, face a critical challenge: they naturally become outdated. Trained on vast datasets, frequently updating these models is expensive. Crucially, these challenges extend beyond the scope of studies in traditional continual learning, as foundation models require rapid and scalable adaptation to dynamic global changes and the emergence of both generalized and specialized tasks. This workshop addresses the urgent need for up-to-date foundation models. We invite researchers to explore cost-effective methods for frequent updates and adaptation, minimizing forgetting and deterioration, ensuring a consistent user experience, and designing dynamic evaluations that remain relevant as models evolve.

Workshop

NeurIPS 2025 Workshop on Space in Vision, Language, and Embodied AI

Ziqiao Ma · Freda Shi · Jiayuan Mao · Jiafei Duan · Manling Li · David Hsu · Parisa Kordjamshidi

8:00 AM - 5:00 PM

Workshop

Learning from Time-Series for Health

Max Xu · Hyewon Jeong · Wanting Mao · Girish Narayanswamy · Sujay Nagaraj · Tom Hartvigsen · Xin Liu · Sana Tonekaboni · James Rehg

8:00 AM - 5:00 PM

Time-series data underpin modern healthcare, spanning electronic health records, physiological waveforms, wearables, and population trends, yet their unique characteristics—including uncertain ground truth, quasi-periodic physiological motifs, and non-semantic timepoints—demand specialized machine learning approaches. While recent advances in foundation models, multimodal learning, and generative methods show promise, significant challenges remain in causality, interpretability, and deployment. This workshop unites researchers across health time-series domains (from wearables to clinical systems) to address shared challenges through: (1) cross-domain discussion, (2) diverse industry/academic perspectives (featuring Google, Oura, Apple and 5 institutions), and (3) community engagement via posters, talks, and panels. By fostering cross-domain collaboration on physiological-aware methods, we aim to bridge the gap between cutting-edge ML and real-world healthcare impact.

Workshop

AI for Science: The Reach and Limits of AI for Scientific Discovery

Ada Fang · Marinka Zitnik · Max Welling · Carla Gomes · Yuanqi Du · Sanjeev Raja · Lixue Cheng · Lijing Wang · Michael Albergo

8:00 AM - 5:00 PM

Through our proposed AI for Science workshop, we will bring together experimentalists, domain scientists, and ML researchers to discuss the reach and limits of AI for scientific discovery. We will center our discussion on three challenges that are essential to progress across scientific domains: LLM reasoning across scientific domains– can present-day LLMs generate rigorously testable hypotheses and reason over experimental results that span scientific domains such as physics, chemistry, and biology? Fidelity of generative and surrogate simulators– In biology, we see a shift towards all-atom models with increasingly powerful capabilities, in chemistry machine learning force fields are increasing in accuracy and generalizability, and in climate modeling we can now accurately predict weather 15 days out. How far can we push this limit? What spatial or temporal scales remain intractable? Experimental data scarcity and bias. We see modern examples of large-scale dataset generation such as the Protein Data Bank, Human Cell Atlas, and the Materials Project. Are there other fields where AI can benefit most from consortium efforts to generate large-scale datasets? How far can models trained on limited experimental datasets take us and where are lab-in-the-loop strategies essential? To address this, we additionally introduce a dataset proposal competition. Our workshop will highlight common bottlenecks in developing AI methods across scientific application domains, and delve into solutions that can unlock progress across all of these domains.

Workshop

Foundations of Reasoning in Language Models

Audrey Huang · Adam Block · Sadhika Malladi · Will Merrill · Tatsunori Hashimoto · Dylan J Foster · Akshay Krishnamurthy · Pavel Izmailov

8:00 AM - 5:00 PM

Our workshop’s goal is to advance foundational understanding, principled innovations, and rigorous scientific evaluations for reasoning in language models. These advancements are built upon theoretical analyses and controlled empirical studies that illuminate how reasoning emerges, where it fails, and how it can be systematically improved.

We want to foster dialogue between communities with complementary strengths---those building theoretical models of reasoning phenomena, those designing experiments that reveal its emergence or failure in practice, and those proposing algorithmic developments that advance reasoning---around three primary questions:

1. How are language models able to solve complex tasks, and what do they still struggle with?

2. What fundamental challenges stand in the way of advancing reasoning capabilities?

3. What algorithmic innovations can overcome these obstacles?

Workshop

GPU-Accelerated and Scalable Optimization (ScaleOpt)

Parth Nobel · Fangzhao Zhang · Maximilian Schaller · Alexandre Amice '' · Tobia Marcucci · Tetiana Parshakova · Stephen Boyd

8:00 AM - 5:00 PM

Recent advancements in GPU-based large-scale optimization have been remarkable. Recognizing the revolution in optimizing neural network weights via large-scale GPU-accelerated algorithms, the optimization community has been interested in developing general purpose GPU-accelerated optimizers for various families of classic optimization problems, including linear programming, general conic optimization, combinatorial optimization, and more specific problem families such as flow optimization and optimal transport. Beyond deploying GPUs directly at classical problems, current frontier AI tools—including large language models (LLMs)—are being deployed to solve optimization problem. Various works have used neural networks to solve mixed integer problems, linear or quadratic programs, general combinatorial optimization problems, and more specific optimization problems such as LASSO and robust PCA. In this workshop, we aim to provide a platform for interested researchers to engage with each other on recent breakthroughs and current bottlenecks in designing large-scale GPU-based optimizers and synergizing AI systems with solving optimization problems.

Workshop

Multimodal Algorithmic Reasoning Workshop

Anoop Cherian · Kuan-Chuan Peng · Suhas Lohit · Honglu Zhou · Kevin Smith · Josh Tenenbaum

8:00 AM - 5:00 PM

Large AI frameworks have been increasing in their data modeling abilities at an ever more vigor in recent times, with compelling applications emerging frequently, many of which may even appear to challenge human intelligence. Yet despite such impressive performance, there remain open questions about whether these models include the foundations of general intelligence, or whether they perform these tasks without human-like understanding. This necessitates development of better tools for assessing these models in tandem with developing the models themselves. This workshop focuses on the topic of multimodal algorithmic reasoning, where an agent needs to assimilate information from multiple modalities towards deriving reasoning algorithms for complex problem solving. In the last year, we have seen rapid advances in AI capabilities that better bridge across modalities, bringing both optimism about superhuman capabilities and skepticism about the limits of current approaches. Through talks from outstanding researchers and faculty, we hope to dive deep into this exciting topic at the intersection of theory, multimodal learning and cognitive science to understand what we have achieved thus far in machine intelligence and what we are lacking in relation to the human way of thinking, towards finding the missing rungs on the ladder to truly intelligent reasoning.

Competition

FAIR Universe – handling uncertainties and distribution shifts for precision cosmology

Biwei Dai · Po-Wen Chang · Wahid Bhimji · Paolo Calafiura · Ragansu Chakkappai · Yuan-Tang Chou · Sascha Diefenbacher · Steven Farrell · Isabelle Guyon · Chris Harris · Elham E Khoda · Benjamin Nachman · David Rousseau · Uros Seljak · Ihsan Ullah · Yulei Zhang · Jordan Dudley

11:00 AM - 1:45 PM

We propose a challenge organised in conjunction with the FAIR Universe project, a collaborative effort funded by the US Department of Energy and involving the Lawrence Berkeley National Laboratory, Université Paris-Saclay, University of Washington, and ChaLearn. This initiative aims to forge an open AI ecosystem for scientific discovery. The challenge will focus on measuring the fundamental properties of the universe from weak gravitational lensing datasets with imperfect simulators and potential distribution shifts. Additionally, the challenge will leverage a large-compute-scale AI platform for sharing datasets, training models, and hosting machine learning competitions. Our challenge will bring together the physics and machine learning communities to advance our understanding and methodologies in handling systematic (otherwise known as epistemic) uncertainties and distribution shifts within AI techniques.

Competition

Foundation Models for Embodied Agents

Manling Li · Tianwei Bao · Qineng Wang · Yu Zhou · Shiyu Zhao · Kangrui Wang · Jiayuan Mao · Ruohan Zhang · Weiyu Liu · Tony Lee · Erran Li · Yejin Choi · Percy Liang · Fei-Fei Li · Jiajun Wu

11:00 AM - 1:45 PM

This challenge invites participants to enhance Large Language Models (LLMs) for embodied reasoning through our standardized Embodied Agent Interface evaluation protocol. The framework systematically evaluates critical embodied reasoning capabilities: goal interpretation (understanding objectives and grounding to environment states), subgoal decomposition (breaking down complex goals), action sequencing (planning action sequences), and transition/world modeling (modeling world state changes through actions).Despite growing interest in using LLMs for robotics and agent planning, current evaluations lack standardization and fail to pinpoint fine-grained reasoning failures. Our challenge builds a unified evaluation approach to task formulation, input/output structures, and evaluation metrics by utilizing the well-established BEHAVIOR and VirtualHome simulators, enhanced with detailed annotations including Linear Temporal Logic (LTL) goal specifications and comprehensive error analysis.Unlike typical evaluations that just report an overall success rate, leaving us in the dark about which specific abilities LLMs struggle with—our framework uses fine-grained metrics that examine both whether proposed actions could actually work in practice and if they truly accomplish the intended goals. Our evaluation system breaks down each reasoning component into separate modules, giving us a clear picture of exactly where and how these models succeed or fail. Baseline results from state-of-the-art LLMs reveal significant performance gaps and motivate further innovation. The competition aims to advance the understanding of how LLMs reason in embodied environments, promote the development of robust and interpretable LLM AI agents, and foster collaboration between the language modeling and robotics communities. More info at https://neurips25-eai.github.io/.

Competition

SLC-PFM: Self-supervised Learning for Cancer Pathology Foundation Models

Neeraj Kumar · Ruchika Verma · Gabriele Campanella · Jia Wu · Jianjun Zhang · Luisa Soto · Rukhmini Bandyopadhyay · Muhammad Waqas · Hamid Tizhoosh · Joel Saltz · Jakub Roman Kaczmarzyk · Melissa Troester · Katherine Hoadley · Chad Vanderbilt · Kostas Triaridis · Saghir Alfasly

11:00 AM - 1:45 PM

The emergence of foundation models has revolutionized artificial intelligence (AI) across various applications \cite{bommasani2021}, with recent advances in computational pathology (for example, UNI \cite{chen2024}, Virchow \cite{vorontsov2024}, GigaPath \cite{xu2024}, etc.) demonstrating potential for improving diagnostic capabilities and patient outcomes. The proposed Competition on Self-supervised Learning for Cancer Pathology Foundation Models (SLC-PFM) provides an unprecedented platform for advancing the development of the next generation of pathology foundation models. Central to this competition is MSK-SLCPFM, the largest pathology dataset to date for purposes of a competition, comprising over 300 million images spanning 39 cancer types that will be provided to participants for pre-training their models with self-supervised learning techniques. The competition follows a two-phase structure: foundation model development followed by evaluation across 23 clinically relevant downstream tasks including biomarker prediction, cancer subtyping, and survival prediction. The competition is designed to be inclusive for a diverse audience of machine learning and AI practitioners, computer scientists, engineers, bioinformaticians, and specialists from related disciplines, regardless of their background in pathology or medical image processing. By eliminating the barrier of domain-specific data curation, the competition enables participants to focus on technical innovation. The key highlights of the proposed competition are -- Comprehensive Pre-training Data: access to the largest pathology data with 300M images enabling foundation model training at scale, Robust Validation Framework: multi-institutional evaluation across diverse clinically relevant pathology tasks, and Focus on Technical Innovation: participants can focus on novel architectures and learning approaches without the burden of data curation.

Competition

Mouse vs. AI: A Neuroethological Benchmark for Visual Robustness

Marius Schneider · Joe Canzano · Jing Peng · Yuchen Hou · Spencer Smith · Michael Beyeler

11:00 AM - 1:45 PM

Visual robustness under real-world conditions remains a critical bottleneck for modern reinforcement learning agents. In contrast, biological systems like mice display remarkable resilience to environmental changes—maintaining stable performance even under degraded or perturbed visual input with minimal exposure.Inspired by this gap, we introduce a novel Bio-Inspired Visual Robustness Benchmark for testing generalization in reinforcement learning agents trained to navigate a virtual environment toward a visually cued target. Participants train agents to perform a visually guided foraging task in a naturalistic 3D Unity environment and are evaluated on their ability to generalize to unseen, ecologically realistic visual perturbations, having been exposed during training only to a single illustrative example: fog.What sets this challenge apart is its biological grounding: real mice performed the same task, and participants receive both behavioral performance data and large-scale neural recordings (19,000+ neurons across visual cortex) for benchmarking. The competition features two tracks: (1) Robustness, assessing generalization across held-out perturbations; and (2) Neural Alignment, evaluating how well agents’ internal representations predict mouse visual cortical activity via a linear readout. We provide the full Unity environment, a fog-perturbed training condition for validation, baseline PPO agents, and a rich multimodal dataset. Track 2 offers the first competition framework for testing whether task-trained agents spontaneously develop brain-like representations—assessed by their ability to predict neural activity recorded from mice during the same behavior. By bridging reinforcement learning, computer vision, and neuroscience through shared, behaviorally grounded tasks, this challenge advances the development of robust, generalizable, and biologically inspired AI.

Competition

NeurIPS 2025 Competition: MMU-RAGent: Massive Multi-Modal User-Centric Retrieval Augmented Generation Benchmark

Luo Chan · Tevin Wang · Shuting Wang · Zhihan Zhang · Alfredo Gomez · Prahaladh Chandrahasan · Andy Tang · Lan Yan · Zimeng Qiu · Morteza Ziyadi · Sherry Wu · Mona Diab · Akari Asai · Chenyan Xiong

2:00 PM - 4:45 PM

We introduce the first competition to evaluate RAG systems on real-user queries and feedback, leverage on web-scale corpora, and support both text and video generation. Participants develop systems that respond to real-user queries, which are curated from MS MARCO Web Search and Chatbot Arena Conversations, or collected live via our RAG-Arena platform. To support retrieval at scale, we provide API access to the English subset of ClueWeb22-B and ClueWeb22-A (87M and 800M documents), along with AWS-hosted infrastructure to facilitate system deployment on our RAG-Arena platform. Systems are evaluated using a combination of human likert-scale ratings, live preference judgments via RAG-Arena, LLM-as-a-Judge, and automatic metrics. To support flexibility in system design, we accept submissions that leverage proprietary search APIs or models, alongside open-source approaches. Participants are encouraged to clearly document system components, and separate leaderboard categories ensure fair and transparent comparison across open- and close-source systems. By focusing on user needs, large-scale retrieval, and multimodal generations, this competition aims to push academic RAG research toward more scalable, and user-aligned settings.

Competition

Weather4cast 2025 – Multi-task Challenges for Weather & Pollution Pattern Prediction on the Road to Hi-Res Foundation Models

Aleksandra Gruca · Pilar Rípodas · Xavier Calbet · Llorenç Lliso Valverde · Bertrand Saux · David Kreil · Sepp Hochreiter

2:00 PM - 4:45 PM

The competition will advance modern algorithms in AI and machine learning through a highly topical interdisciplinary competition challenge: The prediction of hi-res rain radar movies from multi-band satellite sensors requires data fusion of complementary signal sources, multi-channel video frame prediction, as well as super-resolution techniques. To reward models that extract relevant mechanistic patterns reflecting the underlying complex weather systems our evaluation incorporates spatio-temporal shifts: Specifically, algorithms need to forecast several hours of ground-based hi-res precipitation radar from lo-res satellite spectral images in a unique cross-sensor prediction challenge. Models are evaluated within and across regions on Earth with diverse climate and different distributions of heavy precipitation events. Conversely, robustness over time is achieved by testing predictions on data one year after the training period.Now, in its fourth year, Weather4cast moves to improve forecasts world-wide on an expansive data set with over a magnitude more hi-res rain radar data, allowing a move towards Foundation Models through multi-modality, multi-scale, multi-task challenges. Accurate rain predictions are becoming ever more critical for everyone, with climate change increasing the frequency of extreme precipitation events. Notably, the new models and insights will have a particular impact for the many regions on Earth where costly weather radar data are not available. As a complementary application-specific forecasting endpoint, in 2025, for the first time, we add a pollution forecasting challenge. Join us on https://www.weather4cast.net!

Competition

The 2025 PNPL Competition: Speech Detection and Phoneme Classification in the LibriBrain Dataset

Oiwi Parker Jones · Gilad Landau · Miran Özdogan · Gereon Elvers · Francesco Mantegna · Pratik Somaiya · Dulhan Jayalath · Brendan Shillingford · Greg Farquhar · Minqi Jiang · Karim Jerbi · Hamza Abdelhedi · Caglar Gulcehre · Mark Woolrich

2:00 PM - 4:45 PM

The advance of speech decoding from non-invasive brain data holds the potential forprofound societal impact. Among its most promising applications is the restorationof communication to paralysed individuals affected by speech deficits such asdysarthria, without the need for high-risk surgical interventions. The ultimateaim of the 2025 PNPL competition is to produce the conditions for an “ImageNetmoment” or breakthrough in non-invasive neural decoding, by harnessing thecollective power of the machine learning community.To facilitate this vision we present the largest within-subject MEG dataset recordedto date (LibriBrain) together with a user-friendly Python library (pnpl) for easydata access and integration with deep learning frameworks. For the competitionwe define two foundational tasks (Speech Detection and Phoneme Classificationfrom brain data), complete with standardised data splits and evaluation metrics,illustrative benchmark models, online tutorial code, a community discussion board,and public leaderboard for submissions. To promote accessibility and participationthe competition features a Standard track that emphasises algorithmic innovation,as well as an Extended track that is expected to reward larger-scale computing,accelerating progress toward a non-invasive brain-computer interface for speech.

Competition

The 2025 Google Code Golf Championship

Michael D. Moffitt · Divy Thakkar · Ryan Burnell · Orhan Firat · Walter Reade · Sohier Dane · Addison Howard

2:00 PM - 4:45 PM

The Abstraction and Reasoning Corpus remains one of the most interesting and challenging benchmarks available for tracking progress toward achieving Artificial General Intelligence. In contrast to other evaluation datasets designed to capture the extent of an agent's skill or knowledge, the ARC-AGI suite is instead targeted at measuring skill acquisition, a trait that has (so far) evaded even the most sophisticated machine learning systems. A key limitation to date has been the relatively limited number of reference programs, designed to transform image grids corresponding to example pairs in the original benchmark suite. In order to embellish the space of available solutions, the 2025 Google Code Golf Championship will present contestants with those same tasks, encouraging them to produce Python programs for each capable of exhibiting the desired behavior. Not only must these programs be functionally correct, but (as an added twist) should also be as minimal as possible. A set of concise programs emphasizing robustness and simplicity could potentially serve as canonical reference solutions for this seminal dataset, and -- once open sourced to the broader research community -- might contribute toward the development of more versatile AI systems.