Skip to yearly menu bar Skip to main content


Competition

The Competition of Fairness in AI Face Detection

Shu Hu · Xin Wang · Daniel Schiff · Sachi mohanty · Ryan Ofman · Wenbin Zhang · Baoyuan Wu · Cristian Ferrer · Xiaoming Liu · Luisa Verdoliva · Siwei Lyu
Dec 6, 8:00 AM - 10:45 AM Mezzanine Room 15AB

This competition focuses on advancing fairness-aware detection of AI-generated (deepfake) faces and promoting new methodological innovations, addressing a critical gap where fairness methods developed in machine learning have been largely overlooked in deepfake detection. In the competition, participants will work with two large-scale datasets provided by the organizers: AI-Face (CVPR 2025), a million-scale, demographically annotated dataset for training and validation, and PDID (AAAI 2024), a newly curated dataset comprising real-world deepfake incidents, reserved for testing. Participants are tasked with developing models that achieve strong utility performance (e.g., AUC) while ensuring fairness generalization under real-world deployment conditions. The baseline method, PG-FDD (published at CVPR 2024 from the organizer’s group), which demonstrates state-of-the-art performance in fairness generalization for AI face detection, will be provided to support participation.The competition’s potential impact includes fostering the development of robust, fair, and generalizable deepfake detectors, raising awareness of fairness challenges in combating AI-generated fakes, and promoting responsible AI and machine learning deployment in societal applications such as media forensics and digital identity verification. Our competition is fortunately sponsored by Deep Media AI and Originality.AI companies. The challenge link is https://sites.google.com/view/aifacedetection/home.

Show more
View full details
Competition

MyoChallenge 2025: Towards Human Athletic Intelligence

Vittorio Caggiano · Huiyi Wang · Chun Kwang Tan · Balint Hodossy · Shirui Lyu · Massimo Sartori · Seungmoon Song · Letizia Gionfrida · Guillaume Durandau · Vikash Kumar
Dec 6, 8:00 AM - 10:45 AM Upper Level Ballroom 6DE

Athletic performance represents the pinnacle of human decision-making. It demands rapid choices, precise motor control, agility, and coordinated physical execution. Such a combination of capabilities remains elusive in current artificial intelligence and robotic systems.Building on the momentum of the MyoChallenge at NeurIPS 2022, 2023, and 2024, the 4th edition of our MyoChallenge series -- Towards Human Athletic Intelligence-- moves toward capturing the full expressivity and agility of human athletic performance. Participants will develop behaviors for physiologically realistic musculoskeletal models performing fast-paced, and high-skill athletic tasks.The challenge will feature two tracks. First, a Soccer shootout. A full-body musculoskeletal model must dynamically approach and shoot a ball past a moving goalkeeper. Success requires balance, foot targeting, force generation, and rapid whole-body coordination. Second, a Table Tennis competition. A musculoskeletal model of the upper body (arm and trunk) must track, strike, and return balls in a fast-paced table tennis rally against an AI opponent. These challenges go far beyond static or repetitive motions. They demand generalization via reactive and adaptive embodied behavior grounded in the physics of muscle, tendon, and joint dynamics, with real-time perception-action loops capable of agile motor control. The challenge will be staged in the commonly used MyoSuite framework, which offers physiologically accurate, state-of-the-art musculoskeletal models, an intuitive interface to scalable reinforcement learning and control libraries. The framework also enables easy onboarding via extensive tutorials and getting-started materials, and access to multiple baseline libraries needed for the challenge.The competition aims to engage diverse research communities: biomechanics, motor neuroscience, reinforcement learning, control theory, and more. As in previous years, it will prioritize scalability, reproducibility, and generalization, and be open-sourced following best engineering and academic practices to advance physiological control and bring us closer to replicating human athletic intelligence.

Show more
View full details
Competition

EEG Foundation Challenge: From Cross-Task to Cross-Subject EEG Decoding

Bruno Aristimunha · Dung Truong · Pierre Guetschel · Seyed (Yahya) Shirazi · Isabelle Guyon · Alexandre Franco · Michael Milham · Aviv Dotan · Scott Makeig · Alex Gramfort · Jean-Remi King · Marie-Constance Corsi · Pedro Valdés-Sosa · Amitava Majumdar · Alan Evans · Terrence Sejnowski · Oren Shriki · Sylvain Chevallier · Arnaud Delorme
Dec 6, 11:00 AM - 1:45 PM Mezzanine Room 15AB

Current electroencephalogram (EEG) decoding models are typically trained on specific subjects and specific tasks. Here, we introduce a large-scale, code-submission-based competition to subsume this approach through two challenges. First, the transfer challenge consists of building a model that can zero-shot decode new tasks and new subjects from their EEG. Second, the psychopathology factor prediction challenge consists of predicting measures of mental health from EEG data. For this, we use an unprecedented, multi-terabyte dataset of high-density EEG signals (128 channels) recorded from over 3,000 subjects engaged in multiple active and passive tasks. We provide several tunable neural network baselines for each of these two challenges, including a simple network and demographic-based regression models. Developing models that generalize across tasks and individuals will pave the way for EEG architectures capable of adapting to diverse tasks and individuals. Similarly, predicting mental health dimensions from EEG will be essential to systematically identify objective biomarkers for clinical diagnosis and personalized treatment. Ultimately, the advances spurred by this challenge are poised to shape the future of neurotechnology and computational psychiatry, catalyzing breakthroughs in both fundamental neuroscience and applied clinical research.

Show more
View full details
Competition

DCVLR: Data Curation for Vision Language Reasoning

Benjamin Feuer · Rohun Tripathi · Oussama Elachqar · Yuhui Zhang · Neha Hulkund · Thao Nguyen · Vishaal Udandarao · Xiaohan Wang · Sara Beery · Georgia Gkioxari · Emmanouil Koukoumidis · Paul Liang · Ludwig Schmidt · Saining Xie · Serena Yeung-Levy
Dec 6, 11:00 AM - 1:45 PM Upper Level Ballroom 6DE

We propose a new data-centric competition that aims to advance the visual reasoning capabilities of vision-language models (VLMs) through instruction-tuning dataset curation. Participants are provided with a pool of 1 million image-text pairs and tasked with generating a small (1K) or large (10K) instruction-tuning dataset using any method of their choice. Submissions will be evaluated by fine-tuning a fixed VLM (Molmo) on the curated data and measuring performance on VMCBench, a newly released benchmark composed of multiple-choice visual reasoning questions spanning six diverse datasets.The competition provides all necessary resources, including the image-text pool, fine-tuning scripts, evaluation code, and baselines generated using GPT-4o and Claude, as well as 400 USD GPU compute from Lambda Labs. The evaluation metric is accuracy, and all training and evaluation will be reproduced by organizers on standardized infrastructure. This challenge reframes data curation as the primary variable for scientific investigation, with implications for adapting foundation models to real-world domains such as education, biomedicine, and scientific reasoning.We aim to foster broad participation across academia and industry, democratizing model adaptation by focusing on data quality rather than computational scale.

Show more
View full details
Competition

CURE-Bench: Competition on Reasoning Models for Drug Decision-Making in Precision Therapeutics

Shanghua Gao · Richard Zhu · Zhenglun Kong · Xiaorui Su · Curtis Ginder · Sufian Aldogom · Ishita Das · Taylor Evans · Theodoros Tsiligkaridis · Marinka Zitnik
Dec 6, 2:00 PM - 4:45 PM Upper Level Ballroom 6DE

Precision therapeutics require models that can reason over complex relationships between patients, diseases, and drugs. Large language models and large reasoning models, especially when combined with external tool use and multi-agent coordination, have demonstrated the potential to perform structured, multi-step reasoning in clinical settings. However, existing benchmarks (mostly QA benchmarks) do not evaluate these capabilities in the context of real-world therapeutic decision-making. We present CURE-Bench, a competition and benchmark for evaluating AI models in drug decision-making and treatment planning. CURE-Bench includes clinically grounded tasks such as recommending treatments, assessing drug safety and efficacy, designing treatment plans, and identifying repurposing opportunities for diseases with limited therapeutic options. The competition has two tracks: one for models reasoning using internal knowledge, and another one for agentic reasoning that integrates external tools and real-time information. Evaluation data are generated using a validated multi-agent pipeline that produces realistic questions, reasoning traces, and tool-based solutions. Participants will have access to baselines spanning both open-weight and API-based models, along with standardized metrics for correctness, factuality, interpretability, and robustness. Human expert evaluation provides an additional layer of validation. CURE-Bench provides a rigorous, reproducible competition framework for assessing the performance, robustness, and interpretability of reasoning models in high-stakes clinical applications. It will accelerate the development of therapeutic AI and foster collaboration between AI and therapeutics communities.

Show more
View full details
Competition

Open Polymer Challenge: Leveraging Machine Learning for Polymer Informatics

Gang Liu · Sobin Alosious · Yuhan Liu · Eric Inae · Yihan Zhu · Renzheng Zhang · Jiaxin Xu · Addison Howard · Ying Li · Tengfei Luo · Meng Jiang
Dec 6, 2:00 PM - 4:45 PM Mezzanine Room 15AB

Machine learning (ML) holds immense potential for discovering sustainable polymer materials, yet progress is hindered by the lack of high-quality open data. We provide an open-sourced dataset that is ten times larger than existing ones, along with competitive ML baselines and evaluation pipelines. This challenge targets multi-task polymer property prediction, which is crucial for virtual screening of polymers.Participants are asked to develop accurate prediction models, with a focus on material properties. A variety of ML techniques such as data augmentation and imbalanced learning, sophisticated learning paradigms like transfer learning and self-supervised learning, and novel model architectures with a good inductive bias on polymers can be leveraged. The competition results will directly accelerate the discovery of novel polymers for sustainable and energy-saving materials.

Show more
View full details
Competition

Ariel Data Challenge 2025: Methods to Extract Planetary Signals for the Ariel Space Telescope

Kai Hou Yip · Lorenzo Mugnai · Andrea Bocchieri · Giuseppe Morello · Orphée Faucoz · Angèle Syty · Tara Tahseen · Luís F. Simões · Ingo Waldmann
Dec 7, 8:00 AM - 10:45 AM Exhibit Hall G,H

This workshop showcases winning approaches from the 2025 Ariel Data Challenge, a Kaggle competition tackling a notoriously difficult signal processing problem: extracting extremely faint exoplanet signatures from complex, non-linear noise in spatiotemporal data. The 2024 challenge drew thousands of competitors worldwide, yet no solution achieved the mission's stringent performance thresholds. The 2025 competition raised the bar with higher-fidelity simulations that closely mirror real observational conditions from the Ariel Space Telescope.

Winners will present novel architectures and algorithms for two core problems: advanced denoising in the presence of structured noise and robust uncertainty quantification under extreme signal-to-noise ratios. These solutions emerged from a realistic constraint environment where both accuracy and calibrated confidence estimates are mission critical.

While framed within an astronomy context, the technical challenges are broadly applicable. Whether you are an researchers interested in this domain, or simply interested in ML applications in space science, come and join us!

Show more
View full details
Competition

Early Training Scientific Knowledge and Reasoning Evaluation of Small Language Models

Mouadh Yagoubi · YASSER ABDELAZIZ DAHOU DJILALI · Billel Mokeddem · Younes Belkada · Phúc Lê Khắc · Basma Boussaha · REDA ALAMI · Jingwei Zuo · Damiano Marsili · Mugariya Farooq · Mounia Lalmas · Georgia Gkioxari · Patrick Gallinari · Philip Torr · Hakim Hacid
Dec 7, 8:00 AM - 10:45 AM Upper Level Ballroom 6B

Existing benchmarks have proven effective for assessing the performance of fully trained large language models. However, we find striking differences in the early training stages of small models, where benchmarks often fail to provide meaningful or discriminative signals. To explore how these differences arise, this competition tackles the challenge of designing scientific knowledge evaluation tasks specifically tailored for measuring early training progress of language models. Participants are invited to develop novel evaluation methodologies or adapt existing benchmarks to better capture performance differences among language models.To support this effort, we provide three pre-trained small models (0.5B, 1B, and 3B parameters), along with intermediate checkpoints sampled during training up to 200B tokens. All experiments and development work can be run on widely available free cloud-based GPU platforms, making participation accessible to researchers with limited computational resources. Submissions will be evaluated based on three criteria: the quality of the performance signal they produce, the consistency of model rankings at 1 trillion tokens of training, and their relevance to the scientific knowledge domain. By promoting the design of tailored evaluation strategies for early training, this competition aims to attract a broad range of participants from various disciplines, including those who may not be machine learning experts or have access to dedicated GPU resources. Ultimately, this initiative seeks to make foundational LLM research more systematic and benchmark-informed from the earliest phases of model development.

Show more
View full details
Competition

The PokéAgent Challenge: Competitive and Long-Context Learning at Scale

Seth Karten · Jake Grigsby · Stephanie Milani · Kiran Vodrahalli · Amy Zhang · Fei Fang · Yuke Zhu · Chi Jin
Dec 7, 8:00 AM - 10:45 AM Mezzanine Room 15AB

While frontier AI models excel at language understanding, math reasoning, and code generation, they underperform in out-of-distribution generalization, adaptation to strategic opponents, game-theoretic decision-making, and long-context reasoning and planning. To address these gaps, we introduce the PokéAgent Challenge, leveraging Pokémon’s rich multi-agent battle system and expansive role-playing game (RPG) environment. The competition features two complementary tracks: the \textit{Battling Track} evaluates generalization and strategic reasoning under uncertainty in the two-player game of Competitive Pokémon, while the \textit{Speedrunning Track} targets long-horizon planning and decision-making in the Pokémon RPG. Together, our competition tracks unify recent interests in reinforcement learning (RL) and large language model (LLM) research, encouraging collaboration across communities. Pokémon's popularity and internet presence are a key strength of our competition: Participants will have access to a large dataset of over 3.5 million battles and a knowledge base of reference materials and baseline methods. Recent work led by our competition's organizers will provide varied baselines, including rule-based, RL, and LLM-based agents. Our resources will make the PokéAgent challenge accessible while maintaining the complexity needed to drive fundamental advances in decision-making systems.

Show more
View full details
Competition

The MindGames Challenge: Theory-of-Mind and Game Intelligence in LLM Agents

Kevin Wang · Jianzhu Yao · Yihan Jiang · Benjamin Finch · Viraj Nadkarni · Benjamin Kempinski · Anna C. M. Thöni · Mathieu Lauriere · Maria Polukarov · Pramod Viswanath · Tal Kachman · Yoram Bachrach · Zhangyang "Atlas" Wang
Dec 7, 8:00 AM - 10:45 AM Upper Level Ballroom 6CF

Recent breakthroughs in large language models have revolutionized natural language processing and spawned new classes of multi-agent AI systems. Yet, essential gaps remain in such systems' abilities to model beliefs, detect deception, coordinate effectively under uncertainty, and planning in longer-term dynamic environments ---collectively known as ``theory-of-mind" capacities. The MindGames Challenge seeks to address these gaps by testing and advancing the cooperative intelligence of LLM agents across multiple distinct social-deduction and coordination tasks. Participants will develop agents that (i) communicate via natural language, (ii) reason about hidden states and competing objectives, and (iii) dynamically adapt strategies in repeated and iterative interactions

Show more
View full details
Competition

Mouse vs. AI: A Neuroethological Benchmark for Visual Robustness

Marius Schneider · Joe Canzano · Jing Peng · Yuchen Hou · Spencer Smith · Michael Beyeler
Dec 7, 11:00 AM - 1:45 PM Upper Level Ballroom 6CF

Visual robustness under real-world conditions remains a critical bottleneck for modern reinforcement learning agents. In contrast, biological systems like mice display remarkable resilience to environmental changes—maintaining stable performance even under degraded or perturbed visual input with minimal exposure.Inspired by this gap, we introduce a novel Bio-Inspired Visual Robustness Benchmark for testing generalization in reinforcement learning agents trained to navigate a virtual environment toward a visually cued target. Participants train agents to perform a visually guided foraging task in a naturalistic 3D Unity environment and are evaluated on their ability to generalize to unseen, ecologically realistic visual perturbations, having been exposed during training only to a single illustrative example: fog.What sets this challenge apart is its biological grounding: real mice performed the same task, and participants receive both behavioral performance data and large-scale neural recordings (19,000+ neurons across visual cortex) for benchmarking. The competition features two tracks: (1) Robustness, assessing generalization across held-out perturbations; and (2) Neural Alignment, evaluating how well agents’ internal representations predict mouse visual cortical activity via a linear readout. We provide the full Unity environment, a fog-perturbed training condition for validation, baseline PPO agents, and a rich multimodal dataset. Track 2 offers the first competition framework for testing whether task-trained agents spontaneously develop brain-like representations—assessed by their ability to predict neural activity recorded from mice during the same behavior. By bridging reinforcement learning, computer vision, and neuroscience through shared, behaviorally grounded tasks, this challenge advances the development of robust, generalizable, and biologically inspired AI.

Show more
View full details
Competition

Foundation Models for Embodied Agents

Manling Li · Tianwei Bao · Qineng Wang · Yu Zhou · Shiyu Zhao · Kangrui Wang · Jiayuan Mao · Ruohan Zhang · Weiyu Liu · Tony Lee · Erran Li · Yejin Choi · Percy Liang · Fei-Fei Li · Jiajun Wu
Dec 7, 11:00 AM - 1:45 PM Mezzanine Room 15AB

This challenge invites participants to enhance Large Language Models (LLMs) for embodied reasoning through our standardized Embodied Agent Interface evaluation protocol. The framework systematically evaluates critical embodied reasoning capabilities: goal interpretation (understanding objectives and grounding to environment states), subgoal decomposition (breaking down complex goals), action sequencing (planning action sequences), and transition/world modeling (modeling world state changes through actions).Despite growing interest in using LLMs for robotics and agent planning, current evaluations lack standardization and fail to pinpoint fine-grained reasoning failures. Our challenge builds a unified evaluation approach to task formulation, input/output structures, and evaluation metrics by utilizing the well-established BEHAVIOR and VirtualHome simulators, enhanced with detailed annotations including Linear Temporal Logic (LTL) goal specifications and comprehensive error analysis.Unlike typical evaluations that just report an overall success rate, leaving us in the dark about which specific abilities LLMs struggle with—our framework uses fine-grained metrics that examine both whether proposed actions could actually work in practice and if they truly accomplish the intended goals. Our evaluation system breaks down each reasoning component into separate modules, giving us a clear picture of exactly where and how these models succeed or fail. Baseline results from state-of-the-art LLMs reveal significant performance gaps and motivate further innovation. The competition aims to advance the understanding of how LLMs reason in embodied environments, promote the development of robust and interpretable LLM AI agents, and foster collaboration between the language modeling and robotics communities. More info at https://neurips25-eai.github.io/.

Show more
View full details
Competition

FAIR Universe – handling uncertainties and distribution shifts for precision cosmology

Biwei Dai · Po-Wen Chang · Wahid Bhimji · Paolo Calafiura · Ragansu Chakkappai · Yuan-Tang Chou · Sascha Diefenbacher · Steven Farrell · Isabelle Guyon · Chris Harris · Elham E Khoda · Benjamin Nachman · David Rousseau · Uros Seljak · Ihsan Ullah · Yulei Zhang · Jordan Dudley
Dec 7, 11:00 AM - 1:45 PM Exhibit Hall G,H

We propose a challenge organised in conjunction with the FAIR Universe project, a collaborative effort funded by the US Department of Energy and involving the Lawrence Berkeley National Laboratory, Université Paris-Saclay, University of Washington, and ChaLearn. This initiative aims to forge an open AI ecosystem for scientific discovery. The challenge will focus on measuring the fundamental properties of the universe from weak gravitational lensing datasets with imperfect simulators and potential distribution shifts. Additionally, the challenge will leverage a large-compute-scale AI platform for sharing datasets, training models, and hosting machine learning competitions. Our challenge will bring together the physics and machine learning communities to advance our understanding and methodologies in handling systematic (otherwise known as epistemic) uncertainties and distribution shifts within AI techniques.

Show more
View full details
Competition

SLC-PFM: Self-supervised Learning for Cancer Pathology Foundation Models

Neeraj Kumar · Ruchika Verma · Gabriele Campanella · Jia Wu · Jianjun Zhang · Luisa Soto · Rukhmini Bandyopadhyay · Muhammad Waqas · Hamid Tizhoosh · Joel Saltz · Jakub Roman Kaczmarzyk · Melissa Troester · Katherine Hoadley · Chad Vanderbilt · Kostas Triaridis · Saghir Alfasly
Dec 7, 11:00 AM - 1:45 PM Upper Level Ballroom 6B

The emergence of foundation models has revolutionized artificial intelligence (AI) across various applications \cite{bommasani2021}, with recent advances in computational pathology (for example, UNI \cite{chen2024}, Virchow \cite{vorontsov2024}, GigaPath \cite{xu2024}, etc.) demonstrating potential for improving diagnostic capabilities and patient outcomes. The proposed Competition on Self-supervised Learning for Cancer Pathology Foundation Models (SLC-PFM) provides an unprecedented platform for advancing the development of the next generation of pathology foundation models. Central to this competition is MSK-SLCPFM, the largest pathology dataset to date for purposes of a competition, comprising over 300 million images spanning 39 cancer types that will be provided to participants for pre-training their models with self-supervised learning techniques. The competition follows a two-phase structure: foundation model development followed by evaluation across 23 clinically relevant downstream tasks including biomarker prediction, cancer subtyping, and survival prediction. The competition is designed to be inclusive for a diverse audience of machine learning and AI practitioners, computer scientists, engineers, bioinformaticians, and specialists from related disciplines, regardless of their background in pathology or medical image processing. By eliminating the barrier of domain-specific data curation, the competition enables participants to focus on technical innovation. The key highlights of the proposed competition are -- Comprehensive Pre-training Data: access to the largest pathology data with 300M images enabling foundation model training at scale, Robust Validation Framework: multi-institutional evaluation across diverse clinically relevant pathology tasks, and Focus on Technical Innovation: participants can focus on novel architectures and learning approaches without the burden of data curation.

Show more
View full details
Competition

Weather4cast 2025 – Multi-task Challenges for Weather & Pollution Pattern Prediction on the Road to Hi-Res Foundation Models

Aleksandra Gruca · Pilar Rípodas · Xavier Calbet · Llorenç Lliso Valverde · Bertrand Saux · David Kreil · Sepp Hochreiter
Dec 7, 2:00 PM - 4:45 PM Exhibit Hall G,H

The competition will advance modern algorithms in AI and machine learning through a highly topical interdisciplinary competition challenge: The prediction of hi-res rain radar movies from multi-band satellite sensors requires data fusion of complementary signal sources, multi-channel video frame prediction, as well as super-resolution techniques. To reward models that extract relevant mechanistic patterns reflecting the underlying complex weather systems our evaluation incorporates spatio-temporal shifts: Specifically, algorithms need to forecast several hours of ground-based hi-res precipitation radar from lo-res satellite spectral images in a unique cross-sensor prediction challenge. Models are evaluated within and across regions on Earth with diverse climate and different distributions of heavy precipitation events. Conversely, robustness over time is achieved by testing predictions on data one year after the training period.Now, in its fourth year, Weather4cast moves to improve forecasts world-wide on an expansive data set with over a magnitude more hi-res rain radar data, allowing a move towards Foundation Models through multi-modality, multi-scale, multi-task challenges. Accurate rain predictions are becoming ever more critical for everyone, with climate change increasing the frequency of extreme precipitation events. Notably, the new models and insights will have a particular impact for the many regions on Earth where costly weather radar data are not available. As a complementary application-specific forecasting endpoint, in 2025, for the first time, we add a pollution forecasting challenge. Join us on https://www.weather4cast.net!

Show more
View full details
Competition

The 2025 PNPL Competition: Speech Detection and Phoneme Classification in the LibriBrain Dataset

Oiwi Parker Jones · Gilad Landau · Miran Özdogan · Gereon Elvers · Francesco Mantegna · Pratik Somaiya · Dulhan Jayalath · Brendan Shillingford · Greg Farquhar · Minqi Jiang · Karim Jerbi · Hamza Abdelhedi · Caglar Gulcehre · Mark Woolrich
Dec 7, 2:00 PM - 4:45 PM Upper Level Ballroom 6CF

The advance of speech decoding from non-invasive brain data holds the potential forprofound societal impact. Among its most promising applications is the restorationof communication to paralysed individuals affected by speech deficits such asdysarthria, without the need for high-risk surgical interventions. The ultimateaim of the 2025 PNPL competition is to produce the conditions for an “ImageNetmoment” or breakthrough in non-invasive neural decoding, by harnessing thecollective power of the machine learning community.To facilitate this vision we present the largest within-subject MEG dataset recordedto date (LibriBrain) together with a user-friendly Python library (pnpl) for easydata access and integration with deep learning frameworks. For the competitionwe define two foundational tasks (Speech Detection and Phoneme Classificationfrom brain data), complete with standardised data splits and evaluation metrics,illustrative benchmark models, online tutorial code, a community discussion board,and public leaderboard for submissions. To promote accessibility and participationthe competition features a Standard track that emphasises algorithmic innovation,as well as an Extended track that is expected to reward larger-scale computing,accelerating progress toward a non-invasive brain-computer interface for speech.

Show more
View full details
Competition

NeurIPS 2025 Competition: MMU-RAGent: Massive Multi-Modal User-Centric Retrieval Augmented Generation Benchmark

Luo Chan · Tevin Wang · Shuting Wang · Zhihan Zhang · Alfredo Gomez · Prahaladh Chandrahasan · Andy Tang · Lan Yan · Zimeng Qiu · Morteza Ziyadi · Sherry Wu · Mona Diab · Akari Asai · Chenyan Xiong
Dec 7, 2:00 PM - 4:45 PM Upper Level Ballroom 6B

We introduce the first competition to evaluate RAG systems on real-user queries and feedback, leverage on web-scale corpora, and support both text and video generation. Participants develop systems that respond to real-user queries, which are curated from MS MARCO Web Search and Chatbot Arena Conversations, or collected live via our RAG-Arena platform. To support retrieval at scale, we provide API access to the English subset of ClueWeb22-B and ClueWeb22-A (87M and 800M documents), along with AWS-hosted infrastructure to facilitate system deployment on our RAG-Arena platform. Systems are evaluated using a combination of human likert-scale ratings, live preference judgments via RAG-Arena, LLM-as-a-Judge, and automatic metrics. To support flexibility in system design, we accept submissions that leverage proprietary search APIs or models, alongside open-source approaches. Participants are encouraged to clearly document system components, and separate leaderboard categories ensure fair and transparent comparison across open- and close-source systems. By focusing on user needs, large-scale retrieval, and multimodal generations, this competition aims to push academic RAG research toward more scalable, and user-aligned settings.

Show more
View full details
Competition

The 2025 Google Code Golf Championship

Michael D. Moffitt · Divy Thakkar · Ryan Burnell · Orhan Firat · Walter Reade · Sohier Dane · Addison Howard
Dec 7, 2:00 PM - 4:45 PM Mezzanine Room 15AB

The Abstraction and Reasoning Corpus remains one of the most interesting and challenging benchmarks available for tracking progress toward achieving Artificial General Intelligence. In contrast to other evaluation datasets designed to capture the extent of an agent's skill or knowledge, the ARC-AGI suite is instead targeted at measuring skill acquisition, a trait that has (so far) evaded even the most sophisticated machine learning systems. A key limitation to date has been the relatively limited number of reference programs, designed to transform image grids corresponding to example pairs in the original benchmark suite. In order to embellish the space of available solutions, the 2025 Google Code Golf Championship will present contestants with those same tasks, encouraging them to produce Python programs for each capable of exhibiting the desired behavior. Not only must these programs be functionally correct, but (as an added twist) should also be as minimal as possible. A set of concise programs emphasizing robustness and simplicity could potentially serve as canonical reference solutions for this seminal dataset, and -- once open sourced to the broader research community -- might contribute toward the development of more versatile AI systems.

Show more
View full details