Constrained Optimization for Machine Learning
As AI systems are increasingly deployed in safety-critical domains—including credit scoring, medical diagnosis, and autonomous systems—there is a growing demand to ensure their fairness, safety, robustness, and interpretability, alongside stronger calls for regulation. Constrained optimization offers an accountable framework for enforcing these requirements by embedding them directly into the training process, steering models to satisfy explicit constraints. This framework facilitates compliance with regulatory, industry, or ethical standards, which can be easily verified by checking constraint satisfaction.
This workshop explores constrained optimization as a principled method for enforcing desirable properties in machine learning models. It brings together experts in optimization, machine learning, and trustworthy AI to address the algorithmic and practical challenges of scaling constrained methods to modern deep learning settings, which are often large-scale, non-convex, and stochastic.
Data on the Brain and Mind
Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling
2nd Workshop on Multi-modal Foundation Models and Large Language Models for Life Sciences
Foundations of Reasoning in Language Models
Our workshop’s goal is to advance foundational understanding, principled innovations, and rigorous scientific evaluations for reasoning in language models. These advancements are built upon theoretical analyses and controlled empirical studies that illuminate how reasoning emerges, where it fails, and how it can be systematically improved.
We want to foster dialogue between communities with complementary strengths---those building theoretical models of reasoning phenomena, those designing experiments that reveal its emergence or failure in practice, and those proposing algorithmic developments that advance reasoning---around three primary questions:
1. How are language models able to solve complex tasks, and what do they still struggle with?
2. What fundamental challenges stand in the way of advancing reasoning capabilities?
3. What algorithmic innovations can overcome these obstacles?
LAW 2025: Bridging Language, Agent, and World Models for Reasoning and Planning
Workshop on Mechanistic Interpretability
Non-Euclidean Foundation Models and Geometric Learning: Advancing AI Beyond Euclidean Frameworks
In the era of foundation models and Large Language Models (LLMs), Euclidean space is the de facto geometric setting of our machine learning architectures. However, recent literature has demonstrated that this choice comes with fundamental limitations. Non-Euclidean learning is quickly gaining traction. Non-Euclidean spaces, such as hyperbolic, spherical, and mixed-curvature spaces, have been shown to provide more efficient and effective representations for data with intrinsic geometric properties, like hierarchy, symmetry, and heterogeneity.
Integrating foundation models with non-Euclidean spaces has great potential to enhance their ability to capture and model the underlying structures and relationships in complex real-world data, leading to better performance, generalization, and interpretability. This workshop focuses on the intersection of Non-Euclidean representation learning and Foundation Models, exploring its potential benefits, challenges, and future directions.
Symmetry and Geometry in Neural Representations
The fields of biological and artificial intelligence are increasingly converging on a shared principle: the geometry and topology of real-world structure play a central role in building efficient, robust, and interpretable representations. In neuroscience, mounting evidence suggests that neural circuits encode task and environmental structure through low-dimensional manifolds, conserved symmetries, and structured transformations. In deep learning, principles such as sparsity, equivariance, and compositionality are guiding the development of more generalizable and interpretable models, including new approaches to foundation model distillation. The NeurReps workshop brings these threads together, fostering dialogue among machine learning researchers, neuroscientists, and mathematicians to uncover unifying geometric principles of neural representation. Just as geometry and symmetry once unified the models of 20th-century physics, we believe they may now illuminate the computational foundations of intelligence.
New Perspectives in Graph Machine Learning
Frontiers in Probabilistic Inference: Learning meets Sampling
Learning to Sense (L2S)
The workshop explores the joint optimization of sensors and machine learning models, pushing beyond traditional paradigms of data acquisition and processing. We aim to rethink the foundations of how machines sense the world by replacing hand-crafted ISPs, leveraging learnable sensor layouts, and adopting task-driven sensing strategies.
We welcome original contributions and position papers on the following topics (non-exhaustive):
Sensor optimization for e.g. computer vision (bit-depth, pixel layouts, color filter design)
RAW-to-task or RAW-to-label approaches for visual tasks
Co-design of neural networks and sensor hardware
Low-bit and energy-efficient sensing for embedded or mobile devices
Benchmarks, datasets, and metrics for evaluating sensor-model pipelines
Generalization and robustness of sensor-model systems in real-world conditions
Failure case studies and negative results in joint optimization pipelines
Join us to engage with cutting-edge research and cross-disciplinary discussions that are shaping the future of sensor systems for real-world deployment across mobile, embedded, and autonomous platforms.
Ariel Data Challenge 2025: Methods to Extract Planetary Signals for the Ariel Space Telescope
This workshop showcases winning approaches from the 2025 Ariel Data Challenge, a Kaggle competition tackling a notoriously difficult signal processing problem: extracting extremely faint exoplanet signatures from complex, non-linear noise in spatiotemporal data. The 2024 challenge drew thousands of competitors worldwide, yet no solution achieved the mission's stringent performance thresholds. The 2025 competition raised the bar with higher-fidelity simulations that closely mirror real observational conditions from the Ariel Space Telescope.
Winners will present novel architectures and algorithms for two core problems: advanced denoising in the presence of structured noise and robust uncertainty quantification under extreme signal-to-noise ratios. These solutions emerged from a realistic constraint environment where both accuracy and calibrated confidence estimates are mission critical.
While framed within an astronomy context, the technical challenges are broadly applicable. Whether you are an researchers interested in this domain, or simply interested in ML applications in space science, come and join us!
Early Training Scientific Knowledge and Reasoning Evaluation of Small Language Models
Existing benchmarks have proven effective for assessing the performance of fully trained large language models. However, we find striking differences in the early training stages of small models, where benchmarks often fail to provide meaningful or discriminative signals. To explore how these differences arise, this competition tackles the challenge of designing scientific knowledge evaluation tasks specifically tailored for measuring early training progress of language models. Participants are invited to develop novel evaluation methodologies or adapt existing benchmarks to better capture performance differences among language models.To support this effort, we provide three pre-trained small models (0.5B, 1B, and 3B parameters), along with intermediate checkpoints sampled during training up to 200B tokens. All experiments and development work can be run on widely available free cloud-based GPU platforms, making participation accessible to researchers with limited computational resources. Submissions will be evaluated based on three criteria: the quality of the performance signal they produce, the consistency of model rankings at 1 trillion tokens of training, and their relevance to the scientific knowledge domain. By promoting the design of tailored evaluation strategies for early training, this competition aims to attract a broad range of participants from various disciplines, including those who may not be machine learning experts or have access to dedicated GPU resources. Ultimately, this initiative seeks to make foundational LLM research more systematic and benchmark-informed from the earliest phases of model development.
The PokéAgent Challenge: Competitive and Long-Context Learning at Scale
While frontier AI models excel at language understanding, math reasoning, and code generation, they underperform in out-of-distribution generalization, adaptation to strategic opponents, game-theoretic decision-making, and long-context reasoning and planning. To address these gaps, we introduce the PokéAgent Challenge, leveraging Pokémon’s rich multi-agent battle system and expansive role-playing game (RPG) environment. The competition features two complementary tracks: the \textit{Battling Track} evaluates generalization and strategic reasoning under uncertainty in the two-player game of Competitive Pokémon, while the \textit{Speedrunning Track} targets long-horizon planning and decision-making in the Pokémon RPG. Together, our competition tracks unify recent interests in reinforcement learning (RL) and large language model (LLM) research, encouraging collaboration across communities. Pokémon's popularity and internet presence are a key strength of our competition: Participants will have access to a large dataset of over 3.5 million battles and a knowledge base of reference materials and baseline methods. Recent work led by our competition's organizers will provide varied baselines, including rule-based, RL, and LLM-based agents. Our resources will make the PokéAgent challenge accessible while maintaining the complexity needed to drive fundamental advances in decision-making systems.
Artificial Intelligence for Music: Where Creativity Meets Computation
This workshop explores the dynamic intersection of AI and music, a rapidly evolving field where creativity meets computation. The goal of this workshop is twofold: First, we aim to explore the latest advancements of AI’s applications for music, from analysis, creation, performance, production, retrieval to music education and therapy. Second, we aim to discuss the impacts and implications of AI in music, including AI’s impacts on the music industry, musician community, and music education as well as ethical, legal and societal implications of AI music and AI’s implications for future musicians.
AI for Science: The Reach and Limits of AI for Scientific Discovery
Through our proposed AI for Science workshop, we will bring together experimentalists, domain scientists, and ML researchers to discuss the reach and limits of AI for scientific discovery. We will center our discussion on three challenges that are essential to progress across scientific domains: LLM reasoning across scientific domains– can present-day LLMs generate rigorously testable hypotheses and reason over experimental results that span scientific domains such as physics, chemistry, and biology? Fidelity of generative and surrogate simulators– In biology, we see a shift towards all-atom models with increasingly powerful capabilities, in chemistry machine learning force fields are increasing in accuracy and generalizability, and in climate modeling we can now accurately predict weather 15 days out. How far can we push this limit? What spatial or temporal scales remain intractable? Experimental data scarcity and bias. We see modern examples of large-scale dataset generation such as the Protein Data Bank, Human Cell Atlas, and the Materials Project. Are there other fields where AI can benefit most from consortium efforts to generate large-scale datasets? How far can models trained on limited experimental datasets take us and where are lab-in-the-loop strategies essential? To address this, we additionally introduce a dataset proposal competition. Our workshop will highlight common bottlenecks in developing AI methods across scientific application domains, and delve into solutions that can unlock progress across all of these domains.
AI That Keeps Up: Workshop on Continual and Compatible Foundation Model Updates (CCFM)
Foundation models, despite their impressive capabilities, face a critical challenge: they naturally become outdated. Trained on vast datasets, frequently updating these models is expensive. Crucially, these challenges extend beyond the scope of studies in traditional continual learning, as foundation models require rapid and scalable adaptation to dynamic global changes and the emergence of both generalized and specialized tasks. This workshop addresses the urgent need for up-to-date foundation models. We invite researchers to explore cost-effective methods for frequent updates and adaptation, minimizing forgetting and deterioration, ensuring a consistent user experience, and designing dynamic evaluations that remain relevant as models evolve.
Regulatable ML: Towards Bridging the Gaps between Machine Learning Research and Regulations
GPU-Accelerated and Scalable Optimization (ScaleOpt)
Recent advancements in GPU-based large-scale optimization have been remarkable. Recognizing the revolution in optimizing neural network weights via large-scale GPU-accelerated algorithms, the optimization community has been interested in developing general purpose GPU-accelerated optimizers for various families of classic optimization problems, including linear programming, general conic optimization, combinatorial optimization, and more specific problem families such as flow optimization and optimal transport. Beyond deploying GPUs directly at classical problems, current frontier AI tools—including large language models (LLMs)—are being deployed to solve optimization problem. Various works have used neural networks to solve mixed integer problems, linear or quadratic programs, general combinatorial optimization problems, and more specific optimization problems such as LASSO and robust PCA. In this workshop, we aim to provide a platform for interested researchers to engage with each other on recent breakthroughs and current bottlenecks in designing large-scale GPU-based optimizers and synergizing AI systems with solving optimization problems.
Workshop on Scaling Environments for Agents
The development of intelligent agents – particularly those powered by large language models (LLMs) – has emphasized the critical role of environments in shaping agent behavior and capabilities, especially for achieving end-to-end autonomy. Environments are not merely testing grounds; they are dynamic, interactive contexts that serve as the essential "data" for agents to learn adaptive behavior, complex reasoning, and long-term decision-making skills. Just as scaling the model size, dataset size, and training computation has led to emergent capabilities in LLMs, scaling the structure, fidelity, and diversity of environments is one of the crucial dimensions in advancing agent intelligence. Moreover, recent advances in end-to-end reinforcement learning (RL), particularly when paired with LLM-based agents, have made it increasingly viable to train agents through sustained interaction. These agents can now acquire skills, strategies, and planning abilities through environmental feedback, rather than relying solely on imitation learning or static prompt engineering. As we move toward more autonomous, general-purpose agents, the need for scalable, richly interactive, and diverse environments has become both urgent and foundational.
Learning from Time-Series for Health
Time-series data underpin modern healthcare, spanning electronic health records, physiological waveforms, wearables, and population trends, yet their unique characteristics—including uncertain ground truth, quasi-periodic physiological motifs, and non-semantic timepoints—demand specialized machine learning approaches. While recent advances in foundation models, multimodal learning, and generative methods show promise, significant challenges remain in causality, interpretability, and deployment. This workshop unites researchers across health time-series domains (from wearables to clinical systems) to address shared challenges through: (1) cross-domain discussion, (2) diverse industry/academic perspectives (featuring Google, Oura, Apple and 5 institutions), and (3) community engagement via posters, talks, and panels. By fostering cross-domain collaboration on physiological-aware methods, we aim to bridge the gap between cutting-edge ML and real-world healthcare impact.
UrbanAI: Harnessing Artificial Intelligence for Smart Cities
As the world population becomes increasingly urban, incredible innovation is required to sustain such dense populations in a safe and healthy manner. Urban areas are responsible for over 70% of global carbon emissions and energy consumption, driven by infrastructure and transportation systems that often remain inefficient or outdated despite advances in technology. AI and machine learning present immense opportunities to reshape urban environments, optimizing everything from energy use and transportation networks to public health and governance, addressing urgent challenges in areas as diverse as building optimization and sustainability, pollution mitigation, infrastructure maintenance, urban planning, traffic management, and civic life. However, applying AI solutions to these challenges is not easy. Engineers must deal with a wide range of obstacles, including complex and non standardized management systems, diverse and non-integrated data sources, security and privacy risks, and continuous integration and maintenance. This workshop aims to engage the machine learning community in addressing urban optimization challenges, overcoming real world obstacles that prevent adoption of solutions, and fostering interdisciplinary collaboration among experts in infrastructure, transportation, and public health. By leveraging cutting-edge ML methodologies, participants can develop scalable solutions to enhance efficiency, sustainability, and quality of life in urban areas. Confirmed speakers from fields including power systems, building optimization, transportation, water, energy systems, climate science, and building control, will lead discussions on establishing benchmarks, developing robust methodologies, and creating solutions with measurable real-world impact. This workshop offers the ML community a unique platform to directly contribute to sustainable and intelligent urban development, helping cities globally meet climate goals, improve public services, and enhance overall urban resilience. Unlike existing ML-focused workshops on climate and physical systems, UrbanAI explicitly addresses the multifaceted challenges of urban environments, bringing international experts together for the first time at a major ML conference
What Can('t) Transformers Do?
With most advances in large foundation models (LFMs) being empirical, our theoretical understanding of what transformers can compute, express, and learn still lags behind. This workshop will convene theorists and empiricists to chart a rigorous agenda for the next generation of LFMs, asking “What can and can’t transformers do?” We welcome both formal analyses and empirically grounded studies that shed light on theoretical questions, aiming to close the gap between proofs and practice while fostering new, interdisciplinary collaborations.
NeurIPS 2025 Workshop on Space in Vision, Language, and Embodied AI
Recent Advances in Time Series Foundation Models: Have We Reached the ‘BERT Moment’?
Foundation models (FMs) have achieved great success in NLP and vision, inspiring over 20 new time series FMs (TSFMs) in the past year. Despite promising results, studies show that carefully designed lightweight supervised baselines often match TSFM performance. Unlike NLP’s “BERT Moment,” TSFMs still require full fine-tuning to be competitive in real-world scenarios. Additionally, some tabular FMs rival TSFMs without being time series-specific. Recent benchmarks also provide mixed evidence: GIFT-Eval favors TSFMs, OpenTS shows statistical models outperforming deep learning on univariate data, and FoundTS finds supervised baselines on par with TSFMs. This workshop aims to bring together researchers to examine the gap between TSFM potential and real-world utility, and to identify benchmarks and applications where TSFMs can truly excel.
The key topics of this workshop include, but are not limited to:
- Benchmarking Foundation Models in Time Series,
- Scaling Laws and Efficiency in Time Series Models,
- Evaluating Transferability and Adaptability of Foundation Models,
- Leveraging Foundation Models of Other Modalities for Time Series,
- Unsupervised performance estimation of TSFMs,
- Industrial Benchmarking of Time Series Foundation Models
More details are provided in our Call for Papers.
The MindGames Challenge: Theory-of-Mind and Game Intelligence in LLM Agents
Recent breakthroughs in large language models have revolutionized natural language processing and spawned new classes of multi-agent AI systems. Yet, essential gaps remain in such systems' abilities to model beliefs, detect deception, coordinate effectively under uncertainty, and planning in longer-term dynamic environments ---collectively known as ``theory-of-mind" capacities. The MindGames Challenge seeks to address these gaps by testing and advancing the cooperative intelligence of LLM agents across multiple distinct social-deduction and coordination tasks. Participants will develop agents that (i) communicate via natural language, (ii) reason about hidden states and competing objectives, and (iii) dynamically adapt strategies in repeated and iterative interactions
Multimodal Algorithmic Reasoning Workshop
Large AI frameworks have been increasing in their data modeling abilities at an ever more vigor in recent times, with compelling applications emerging frequently, many of which may even appear to challenge human intelligence. Yet despite such impressive performance, there remain open questions about whether these models include the foundations of general intelligence, or whether they perform these tasks without human-like understanding. This necessitates development of better tools for assessing these models in tandem with developing the models themselves. This workshop focuses on the topic of multimodal algorithmic reasoning, where an agent needs to assimilate information from multiple modalities towards deriving reasoning algorithms for complex problem solving. In the last year, we have seen rapid advances in AI capabilities that better bridge across modalities, bringing both optimism about superhuman capabilities and skepticism about the limits of current approaches. Through talks from outstanding researchers and faculty, we hope to dive deep into this exciting topic at the intersection of theory, multimodal learning and cognitive science to understand what we have achieved thus far in machine intelligence and what we are lacking in relation to the human way of thinking, towards finding the missing rungs on the ladder to truly intelligent reasoning.
Tackling Climate Change with Machine Learning
Many in the ML community wish to take action on climate change, but are unsure how to have the most impact. This workshop will highlight work that demonstrates that, while ML is no silver bullet, it can be an invaluable tool in reducing greenhouse gas emissions and in helping society adapt to the effects of climate change.
Climate change is a complex problem for which action takes many forms, from advancing theory to deploying new technology. Many of these actions represent high-impact opportunities for real-world change, and simultaneously pose interesting academic research problems.
The theme of this workshop, “Roots to Routes: A Dialogue on Different Machine Learning Methods for Climate Impact,” invites submissions that explore the strengths of diverse machine learning approaches in climate-related contexts. We particularly encourage work that demonstrates the effectiveness of classical ML methods under real-world constraints, such as limited data availability, privacy concerns, or restricted computational resources. At the same time, we welcome contributions that showcase how scaling up data and computing resources combined with modern tools and techniques can unlock new possibilities for tackling global-scale climate prediction challenges.
This workshop is part of a series that aims to bring together those applying ML to climate change challenges and facilitate cross-pollination between ML researchers and experts in climate-relevant fields.
The main workshop will take place on December 6 or 7, 2025 (exact date TBD).
CogInterp: Interpreting Cognition in Deep Learning Models
Recent innovations in deep learning have produced models with impressive capabilities, achieving or even exceeding human performance in a wide range of domains. A timely and critical challenge in AI is understanding what behaviors these models are actually capable of, and the internal processes which support these behaviors. As interest continues to grow in models’ internal processes, the field of cognitive science is becoming increasingly useful for describing and understanding cognition in deep learning models: cognitive science, which seeks to describe the cognitive processes in human and animal minds, offers a rich body of theories, experiments, and frameworks which may be adopted to understand how deep learning models achieve complex behaviors in domains such as language, vision, and reasoning.
The workshop will focus on Cognitive Interpretability (“CogInterp”), which involves the systematic interpretation of high-level cognition in deep learning models. Similar to how cognitive science describes the intermediate representations and algorithms (or cognition) between behavior and neurons in biological systems, the goal of Cognitive Interpretability is to describe the cognitive processes which lie between the levels of behavioral evaluations and mechanistic interpretability in deep learning models. Practically speaking, this means that Cognitive Interpretability does not just ask whether a model can perform task X or has a certain ability Y , but additionally (or instead) how a model performs X or learns and implements Y . These kinds of inferences—from observable behavior to latent “mental” processes—are the bread and butter of cognitive science, but many of the theoretical and empirical tools developed to tackle these problems have not yet been widely adopted in AI research, in part because of the separation between the fields and communities.
To address the gap above, our goal is to bring together researchers in cognitive science and AI interpretability to discuss new empirical results and theories about the inner workings of deep learning models. We hope to gather perspectives from various disciplines, including machine learning, psychology, linguistics, vision science, neuroscience, philosophy of mind, and law.
Foundation Models for Embodied Agents
This challenge invites participants to enhance Large Language Models (LLMs) for embodied reasoning through our standardized Embodied Agent Interface evaluation protocol. The framework systematically evaluates critical embodied reasoning capabilities: goal interpretation (understanding objectives and grounding to environment states), subgoal decomposition (breaking down complex goals), action sequencing (planning action sequences), and transition/world modeling (modeling world state changes through actions).Despite growing interest in using LLMs for robotics and agent planning, current evaluations lack standardization and fail to pinpoint fine-grained reasoning failures. Our challenge builds a unified evaluation approach to task formulation, input/output structures, and evaluation metrics by utilizing the well-established BEHAVIOR and VirtualHome simulators, enhanced with detailed annotations including Linear Temporal Logic (LTL) goal specifications and comprehensive error analysis.Unlike typical evaluations that just report an overall success rate, leaving us in the dark about which specific abilities LLMs struggle with—our framework uses fine-grained metrics that examine both whether proposed actions could actually work in practice and if they truly accomplish the intended goals. Our evaluation system breaks down each reasoning component into separate modules, giving us a clear picture of exactly where and how these models succeed or fail. Baseline results from state-of-the-art LLMs reveal significant performance gaps and motivate further innovation. The competition aims to advance the understanding of how LLMs reason in embodied environments, promote the development of robust and interpretable LLM AI agents, and foster collaboration between the language modeling and robotics communities. More info at https://neurips25-eai.github.io/.
Mouse vs. AI: A Neuroethological Benchmark for Visual Robustness
Visual robustness under real-world conditions remains a critical bottleneck for modern reinforcement learning agents. In contrast, biological systems like mice display remarkable resilience to environmental changes—maintaining stable performance even under degraded or perturbed visual input with minimal exposure.Inspired by this gap, we introduce a novel Bio-Inspired Visual Robustness Benchmark for testing generalization in reinforcement learning agents trained to navigate a virtual environment toward a visually cued target. Participants train agents to perform a visually guided foraging task in a naturalistic 3D Unity environment and are evaluated on their ability to generalize to unseen, ecologically realistic visual perturbations, having been exposed during training only to a single illustrative example: fog.What sets this challenge apart is its biological grounding: real mice performed the same task, and participants receive both behavioral performance data and large-scale neural recordings (19,000+ neurons across visual cortex) for benchmarking. The competition features two tracks: (1) Robustness, assessing generalization across held-out perturbations; and (2) Neural Alignment, evaluating how well agents’ internal representations predict mouse visual cortical activity via a linear readout. We provide the full Unity environment, a fog-perturbed training condition for validation, baseline PPO agents, and a rich multimodal dataset. Track 2 offers the first competition framework for testing whether task-trained agents spontaneously develop brain-like representations—assessed by their ability to predict neural activity recorded from mice during the same behavior. By bridging reinforcement learning, computer vision, and neuroscience through shared, behaviorally grounded tasks, this challenge advances the development of robust, generalizable, and biologically inspired AI.
SLC-PFM: Self-supervised Learning for Cancer Pathology Foundation Models
The emergence of foundation models has revolutionized artificial intelligence (AI) across various applications \cite{bommasani2021}, with recent advances in computational pathology (for example, UNI \cite{chen2024}, Virchow \cite{vorontsov2024}, GigaPath \cite{xu2024}, etc.) demonstrating potential for improving diagnostic capabilities and patient outcomes. The proposed Competition on Self-supervised Learning for Cancer Pathology Foundation Models (SLC-PFM) provides an unprecedented platform for advancing the development of the next generation of pathology foundation models. Central to this competition is MSK-SLCPFM, the largest pathology dataset to date for purposes of a competition, comprising over 300 million images spanning 39 cancer types that will be provided to participants for pre-training their models with self-supervised learning techniques. The competition follows a two-phase structure: foundation model development followed by evaluation across 23 clinically relevant downstream tasks including biomarker prediction, cancer subtyping, and survival prediction. The competition is designed to be inclusive for a diverse audience of machine learning and AI practitioners, computer scientists, engineers, bioinformaticians, and specialists from related disciplines, regardless of their background in pathology or medical image processing. By eliminating the barrier of domain-specific data curation, the competition enables participants to focus on technical innovation. The key highlights of the proposed competition are -- Comprehensive Pre-training Data: access to the largest pathology data with 300M images enabling foundation model training at scale, Robust Validation Framework: multi-institutional evaluation across diverse clinically relevant pathology tasks, and Focus on Technical Innovation: participants can focus on novel architectures and learning approaches without the burden of data curation.
FAIR Universe – handling uncertainties and distribution shifts for precision cosmology
We propose a challenge organised in conjunction with the FAIR Universe project, a collaborative effort funded by the US Department of Energy and involving the Lawrence Berkeley National Laboratory, Université Paris-Saclay, University of Washington, and ChaLearn. This initiative aims to forge an open AI ecosystem for scientific discovery. The challenge will focus on measuring the fundamental properties of the universe from weak gravitational lensing datasets with imperfect simulators and potential distribution shifts. Additionally, the challenge will leverage a large-compute-scale AI platform for sharing datasets, training models, and hosting machine learning competitions. Our challenge will bring together the physics and machine learning communities to advance our understanding and methodologies in handling systematic (otherwise known as epistemic) uncertainties and distribution shifts within AI techniques.
Weather4cast 2025 – Multi-task Challenges for Weather & Pollution Pattern Prediction on the Road to Hi-Res Foundation Models
The competition will advance modern algorithms in AI and machine learning through a highly topical interdisciplinary competition challenge: The prediction of hi-res rain radar movies from multi-band satellite sensors requires data fusion of complementary signal sources, multi-channel video frame prediction, as well as super-resolution techniques. To reward models that extract relevant mechanistic patterns reflecting the underlying complex weather systems our evaluation incorporates spatio-temporal shifts: Specifically, algorithms need to forecast several hours of ground-based hi-res precipitation radar from lo-res satellite spectral images in a unique cross-sensor prediction challenge. Models are evaluated within and across regions on Earth with diverse climate and different distributions of heavy precipitation events. Conversely, robustness over time is achieved by testing predictions on data one year after the training period.Now, in its fourth year, Weather4cast moves to improve forecasts world-wide on an expansive data set with over a magnitude more hi-res rain radar data, allowing a move towards Foundation Models through multi-modality, multi-scale, multi-task challenges. Accurate rain predictions are becoming ever more critical for everyone, with climate change increasing the frequency of extreme precipitation events. Notably, the new models and insights will have a particular impact for the many regions on Earth where costly weather radar data are not available. As a complementary application-specific forecasting endpoint, in 2025, for the first time, we add a pollution forecasting challenge. Join us on https://www.weather4cast.net!
The 2025 Google Code Golf Championship
The Abstraction and Reasoning Corpus remains one of the most interesting and challenging benchmarks available for tracking progress toward achieving Artificial General Intelligence. In contrast to other evaluation datasets designed to capture the extent of an agent's skill or knowledge, the ARC-AGI suite is instead targeted at measuring skill acquisition, a trait that has (so far) evaded even the most sophisticated machine learning systems. A key limitation to date has been the relatively limited number of reference programs, designed to transform image grids corresponding to example pairs in the original benchmark suite. In order to embellish the space of available solutions, the 2025 Google Code Golf Championship will present contestants with those same tasks, encouraging them to produce Python programs for each capable of exhibiting the desired behavior. Not only must these programs be functionally correct, but (as an added twist) should also be as minimal as possible. A set of concise programs emphasizing robustness and simplicity could potentially serve as canonical reference solutions for this seminal dataset, and -- once open sourced to the broader research community -- might contribute toward the development of more versatile AI systems.
The 2025 PNPL Competition: Speech Detection and Phoneme Classification in the LibriBrain Dataset
The advance of speech decoding from non-invasive brain data holds the potential forprofound societal impact. Among its most promising applications is the restorationof communication to paralysed individuals affected by speech deficits such asdysarthria, without the need for high-risk surgical interventions. The ultimateaim of the 2025 PNPL competition is to produce the conditions for an “ImageNetmoment” or breakthrough in non-invasive neural decoding, by harnessing thecollective power of the machine learning community.To facilitate this vision we present the largest within-subject MEG dataset recordedto date (LibriBrain) together with a user-friendly Python library (pnpl) for easydata access and integration with deep learning frameworks. For the competitionwe define two foundational tasks (Speech Detection and Phoneme Classificationfrom brain data), complete with standardised data splits and evaluation metrics,illustrative benchmark models, online tutorial code, a community discussion board,and public leaderboard for submissions. To promote accessibility and participationthe competition features a Standard track that emphasises algorithmic innovation,as well as an Extended track that is expected to reward larger-scale computing,accelerating progress toward a non-invasive brain-computer interface for speech.
NeurIPS 2025 Competition: MMU-RAGent: Massive Multi-Modal User-Centric Retrieval Augmented Generation Benchmark
We introduce the first competition to evaluate RAG systems on real-user queries and feedback, leverage on web-scale corpora, and support both text and video generation. Participants develop systems that respond to real-user queries, which are curated from MS MARCO Web Search and Chatbot Arena Conversations, or collected live via our RAG-Arena platform. To support retrieval at scale, we provide API access to the English subset of ClueWeb22-B and ClueWeb22-A (87M and 800M documents), along with AWS-hosted infrastructure to facilitate system deployment on our RAG-Arena platform. Systems are evaluated using a combination of human likert-scale ratings, live preference judgments via RAG-Arena, LLM-as-a-Judge, and automatic metrics. To support flexibility in system design, we accept submissions that leverage proprietary search APIs or models, alongside open-source approaches. Participants are encouraged to clearly document system components, and separate leaderboard categories ensure fair and transparent comparison across open- and close-source systems. By focusing on user needs, large-scale retrieval, and multimodal generations, this competition aims to push academic RAG research toward more scalable, and user-aligned settings.