NeurIPS 2025 Mexico City Datasets & Benchmarks

Poster

Care-PD: A Multi-Site Anonymized Clinical Dataset for Parkinson’s Disease Gait Assessment

Vida Adeli · Ivan Klabučar · Javad Rajabi · Benjamin Filtjens · Soroush Mehraban · Diwei Wang · Trung Hieu Hoang · Minh Do · Hyewon Seo · Candice MULLER · Daniel Coelho · Claudia de Oliveira · Pieter Ginis · Moran Gilat · Alice Nieuwboer · Joke Spildooren · J. Mckay · Hyeokhyen Kwon · Gari Clifford · Christine Esper · Stewart Factor · Imari Genias · Amirhossein Dadashzadeh · Leia Shum · Alan Whone · Majid Mirmehdi · Andrea Iaboni · Babak Taati

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

Objective gait assessment in Parkinson’s Disease (PD) is limited by the absence of large, diverse, and clinically annotated motion datasets. We introduce Care-PD, the largest publicly available archive of 3D mesh gait data for PD, and the first multi-site collection spanning 9 cohorts from 8 clinical centers. All recordings (RGB video or motion capture) are converted into anonymized SMPL meshes via a harmonized preprocessing pipeline. Care-PD supports two key benchmarks: supervised clinical score prediction (estimating Unified Parkinson’s Disease Rating Scale, UPDRS, gait scores) and unsupervised motion pretext tasks (2D-to-3D keypoint lifting and full-body 3D reconstruction). Clinical prediction is evaluated under four generalization protocols: within-dataset, cross-dataset, leave-one-dataset-out, and multi-dataset in-domain adaptation.To assess clinical relevance, we compare state-of-the-art motion encoders with a traditional gait-feature baseline, finding that encoders consistently outperform handcrafted features. Pretraining on Care-PD reduces MPJPE (from 60.8mm to 7.5mm) and boosts PD severity macro-F1 by 17\%, underscoring the value of clinically curated, diverse training data. Care-PD and all benchmark code are released for non-commercial research (Code, Data).

View full details

Poster

ChemX: A Collection of Chemistry Datasets for Benchmarking Automated Information Extraction

Anastasia Vepreva · Julia Razlivina · Mariia Eremeyeva · Nina Gubina · Anastasia Orlova · Aleksei Dmitrenko · Kapranova Xenia · Susan Jyakhwo · Nikita Vasilev · Arsen Sarkisyan · Ivan Chernyshov · Vladimir Vinogradov · Andrei Dmitrenko

Dec 3, 11:00 AM - 2:00 PM Don Alberto 4

Despite recent advances in machine learning, many scientific discoveries in chemistry still rely on manually curated datasets extracted from the scientific literature. Automation of information extraction in specialized chemistry domains has the potential to scale up machine learning applications and improve the quality of predictions, enabling data-driven scientific discoveries at a faster pace. In this paper, we present ChemX, a collection of 10 benchmarking datasets across several domains of chemistry providing a reliable basis for evaluating and fine-tuning automated information extraction methods. The datasets encompassing various properties of small molecules and nanomaterials have been manually extracted from peer-reviewed publications and systematically validated by domain experts through a cross-verification procedure allowing for identification and correction of errors at sources. In order to demonstrate the utility of the resulting datasets, we evaluate the extraction performance of the state-of-the-art large language models (LLMs). Moreover, we design our own agentic approach to take full control of the document preprocessing before LLM-based information extraction. Finally, we apply the recently emerged multi-agent systems specialized in chemistry to compare performance against the strong baselines. Our empirical results highlight persistent challenges in chemical information extraction, particularly in handling domain-specific terminology, complex tabular and schematic formats, and context-dependent ambiguities. We discuss the importance of expert data validation, the nuances of the evaluation pipeline, and the prospects of automated information extraction in chemistry. Finally, we provide open documentation including standardized schemas and provenance metadata, as well as the code and other materials to ensure reproducibility. ChemX is poised to advance automatic information extraction in chemistry by challenging the quality and generalization capabilities of existing methods, as well as providing insights into evaluation strategies.

View full details

Poster

PolypSense3D: A Multi-Source Benchmark Dataset for Depth-Aware Polyp Size Measurement in Endoscopy

Ruyu Liu · Lin Wang · Zhou Mingming · Jianhua Zhang · ZHANG HAOYU · Xiufeng Liu · Xu Cheng · Sixian Chan · Shen yanbin · Dai Sheng · Yuping Yan · Yaochu Jin · Lingjuan Lyu

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

Accurate polyp sizing during endoscopy is crucial for cancer risk assessment but is hindered by subjective methods and inadequate datasets lacking integrated 2D appearance, 3D structure, and real-world size information. We introduce PolypSense3D, the first multi-source benchmark dataset specifically targeting depth-aware polyp size measurement. It uniquely integrates over 43,000 frames from virtual simulations, physical phantoms, and clinical sequences, providing synchronized RGB, dense/sparse depth, segmentation masks, camera parameters, and millimeter-scale size labels derived via a novel forceps-assisted in-vivo annotation technique. To establish its value, we benchmark state-of-the-art segmentation and depth estimation models. Results quantify significant domain gaps between simulated/phantom and clinical data and reveal substantial error propagation from perception stages to final size estimation, with the best fully automated pipelines achieving an average Mean Absolute Error (MAE) of 0.95 mm on the clinical data subset. Publicly released under CC BY-SA 4.0 with code and evaluation protocols, PolypSense3D offers a standardized platform to accelerate research in robust, clinically relevant quantitative endoscopic vision. The benchmark dataset and code are available at: https://github.com/HNUicda/PolypSense3D and https://doi.org/10.7910/DVN/K13H89.

View full details

Poster

PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts

Yiming Wang · Pei Zhang · Jialong Tang · Hao-Ran Wei · Baosong Yang · Rui Wang · Chenshu Sun · Feitong Sun · Jiran Zhang · Junxuan Wu · Qiqian Cang · Yichang Zhang · Fei Huang · Junyang Lin · Fei Huang · Jingren Zhou

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

In this paper, we introduce PolyMath, a multilingual mathematical reasoning benchmark covering 18 languages and 4 easy-to-hard difficulty levels. Our benchmark ensures difficulty comprehensiveness, language diversity, and high-quality translation, making it a highly discriminative multilingual mathematical benchmark in the era of reasoning LLMs.We conduct a comprehensive evaluation for advanced LLMs and find that even Qwen-3-235B-A22B-Thinking and Gemini-2.5-pro, achieve only 54.6 and 52.2 benchmark scores, with about 40% accuracy under the highest level.From a language perspective, our benchmark reveals several key challenges of LLMs in multilingual reasoning:(1) Reasoning performance varies widely across languages for current LLMs;(2) Input-output language consistency is low in reasoning LLMs and may be correlated with performance;(3) The thinking length differs significantly by language for current LLMs.Additionally, we demonstrate that controlling the output language in the instructions has the potential to affect reasoning performance, especially for some low-resource languages, suggesting a promising direction for improving multilingual capabilities in LLMs.

View full details

Poster

OpenGU: A Comprehensive Benchmark for Graph Unlearning

Bowen Fan · Yuming Ai · Xunkai Li · Zhilin Guo · LEI ZHU · Guang Zeng · Rong-Hua Li · Guoren Wang

Dec 3, 4:30 PM - 7:30 PM Don Alberto 4

Graph Machine Learning is essential for understanding and analyzing relational data. However, privacy-sensitive applications demand the ability to efficiently remove sensitive information from trained graph neural networks (GNNs), avoiding the unnecessary time and space overhead caused by retraining models from scratch.To address this issue, Graph Unlearning (GU) has emerged as a critical solution to support dynamic graph updates while ensuring privacy compliance. Unlike machine unlearning in computer vision or other fields, GU faces unique difficulties due to the non-Euclidean nature of graph data and the recursive message-passing mechanism of GNNs. Additionally, the diversity of downstream tasks and the complexity of unlearning requests further amplify these challenges. Despite the proliferation of diverse GU strategies, the absence of a benchmark providing fair comparisons for GU, and the limited flexibility in combining downstream tasks and unlearning requests, have yielded inconsistencies in evaluations, hindering the development of this domain. To fill this gap, we present OpenGU, the first GU benchmark, where 16 SOTA GU algorithms and 37 multi-domain datasets are integrated, enabling various downstream tasks with 13 GNN backbones when responding to flexible unlearning requests. Through extensive experimentation, we have drawn $10$ crucial conclusions about existing GU methods, while also gaining valuable insights into their limitations, shedding light on potential avenues for future research. Our code is available at \href{https://github.com/bwfan-bit/OpenGU}{https://github.com/bwfan-bit/OpenGU}.

View full details

Poster

MUVR: A Multi-Modal Untrimmed Video Retrieval Benchmark with Multi-Level Visual Correspondence

Yue Feng · Jinwei Hu · Qijia Lu · Jiawei Niu · Li Tan · Shuo Yuan · Ziyi Yan · Yizhen Jia · Qingzhi He · Shiping Ge · Ethan Chen · Wentong Li · Limin Wang · Jie Qin

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

We propose the Multi-modal Untrimmed Video Retrieval task, along with a new benchmark (MUVR) to advance video retrieval for long-video platforms. MUVR aims to retrieve untrimmed videos containing relevant segments using multi-modal queries. It has the following features: 1) Practical retrieval paradigm: MUVR supports video-centric multi-modal queries, expressing fine-grained retrieval needs through long text descriptions, video tag prompts, and mask prompts. It adopts a one-to-many retrieval paradigm and focuses on untrimmed videos, tailored for long-video platform applications. 2) Multi-level visual correspondence: To cover common video categories (e.g., news, travel, dance) and precisely define retrieval matching criteria, we construct multi-level visual correspondence based on core video content (e.g., news events, travel locations, dance moves) which users are interested in and want to retrieve. It covers six levels: copy, event, scene, instance, action, and others. 3) Comprehensive evaluation criteria: We develop 3 versions of MUVR (i.e., Base, Filter, QA). MUVR-Base/Filter evaluates retrieval models, while MUVR-QA assesses MLLMs in a question-answering format. We also propose a Reranking Score to evaluate the reranking ability of MLLMs. MUVR consists of 53K untrimmed videos from the video platform Bilibili, with 1,050 multi-modal queries and 84K matches. Extensive evaluations of 3 state-of-the-art video retrieval models, 6 image-based VLMs, and 10 MLLMs are conducted. MUVR reveals the limitations of retrieval methods in processing untrimmed videos and multi-modal queries, as well as MLLMs in multi-video understanding and reranking. Our code and benchmark is available at https://github.com/debby-0527/MUVR.

View full details

Poster

SSRB: Direct Natural Language Querying to Massive Heterogeneous Semi-Structured Data

Xin Zhang · Mingxin Li · Yanzhao Zhang · Dingkun Long · Yongqi Li · Yinghui Li · Pengjun Xie · Meishan Zhang · Wenjie Li · Min Zhang · Philip S Yu

Dec 4, 11:00 AM - 2:00 PM Don Alberto 4

Searching over semi-structured data with natural language (NL) queries has attracted sustained attention, enabling broader audiences to access information easily. As more applications, such as LLM agents and RAG systems, emerge to search and interact with semi-structured data, two major challenges have become evident: (1) the increasing diversity of domains and schema variations, making domain-customized solutions prohibitively costly; (2) the growing complexity of NL queries, which combine both exact field matching conditions and fuzzy semantic requirements, often involving multiple fields and implicit reasoning. These challenges make formal language querying or keyword-based search insufficient. In this work, we explore neural retrievers as a unified non-formal querying solution by directly index semi-structured collections and understand NL queries. We employ LLM-based automatic evaluation and build a large-scale semi-structured retrieval benchmark (SSRB) using LLM generation and filtering, containing 14M semi-structured objects from 99 different schemas across 6 domains, along with 8,485 test queries that combine both exact and fuzzy matching conditions. Our systematic evaluation of popular retrievers shows that current state-of-the-art models could achieve acceptable performance, yet they still lack precise understanding of matching constraints. While by in-domain training of dense retrievers, the performance can be significantly improved. We believe that our SSRB could serve as a valuable resource for future research in this area, and we hope to inspire further exploration of semi-structured retrieval with complex queries.

View full details

Poster

PhysDrive: A Multimodal Remote Physiological Measurement Dataset for In-vehicle Driver Monitoring

Wang · Xiao Yang · Qingyong Hu · Jack Tang · Can Liu · Dengbo He · Yuntao Wang · Yingcong Chen · Kaishun Wu

Dec 4, 4:30 PM - 7:30 PM Don Alberto 4

Robust and unobtrusive in-vehicle physiological monitoring is crucial for ensuring driving safety and user experience. While remote physiological measurement (RPM) offers a promising non-invasive solution, its translation to real-world driving scenarios is critically constrained by the scarcity of comprehensive datasets. Existing resources are often limited in scale, modality diversity, the breadth of biometric annotations, and the range of captured conditions, thereby omitting inherent real-world challenges in driving. Here, we present PhysDrive, the first large-scale multimodal dataset for contactless in-vehicle physiological sensing with dedicated consideration of various modality settings and driving factors. PhysDrive collects data from 48 drivers, including synchronized RGB, near-infrared camera, and raw mmWave radar data, accompanied by six synchronized ground truths (ECG, BVP, Respiration, HR, RR, and SpO2). It covers a wide spectrum of naturalistic driving conditions, including driver motions, dynamic natural light, vehicle types, and road conditions. We extensively evaluate both signal‑processing and deep‑learning methods on PhysDrive, establishing a comprehensive benchmark across all modalities, and release full open‑source code with compatibility for mainstream public toolboxes. We envision PhysDrive will serve as a foundational resource and accelerate research on multimodal driver monitoring and smart‑cockpit systems.

View full details

Poster

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

Ling Fu · Zhebin Kuang · Jiajun Song · Mingxin Huang · Biao Yang · Yuzhe Li · Linghao Zhu · Qidi Luo · Xinyu Wang · Hao Lu · Zhang Li · Guozhi Tang · Bin Shan · Chunhui Lin · Qi Liu · Binghong Wu · Hao Feng · Hao Liu · Can Huang · Jingqun Tang · Wei Chen · Lianwen Jin · Yuliang Liu · Xiang Bai

Dec 5, 4:30 PM - 7:30 PM Don Alberto 4

Scoring the Optical Character Recognition (OCR) capabilities of Large Multimodal Models (LMMs) has witnessed growing interest. Existing benchmarks have highlighted the impressive performance of LMMs in text recognition; however, their abilities in certain challenging tasks, such as text localization, handwritten content extraction, and logical reasoning, remain underexplored. To bridge this gap, we introduce OCRBench v2, a large-scale bilingual text-centric benchmark with currently the most comprehensive set of tasks ($4\times$ more tasks than the previous multi-scene benchmark OCRBench), the widest coverage of scenarios ($31$ diverse scenarios), and thorough evaluation metrics, with $10,000$ human-verified question-answering pairs and a high proportion of difficult samples. Moreover, we construct a private test set with $1,500$ manually annotated images. The consistent evaluation trends observed across both public and private test sets validate the OCRBench v2's reliability. After carefully benchmarking state-of-the-art LMMs, we find that most LMMs score below $50$ ($100$ in total) and suffer from five-type limitations, including less frequently encountered text recognition, fine-grained perception, layout perception, complex element parsing, and logical reasoning. The benchmark and evaluation scripts are available at https://github.com/Yuliang-Liu/MultimodalOCR.

View full details

Poster

Linguini: A benchmark for language-agnostic linguistic reasoning

Eduardo Sánchez · Belen Alastruey · Christophe Ropers · Arina Turkatenko · Pontus Lars Erik Saito Stenetorp · Mikel Artetxe · Marta Ruiz Costa-jussà

Dec 5, 4:30 PM - 7:30 PM Don Alberto 4

We propose a new benchmark to measure a language model's linguistic reasoning skills without relying on pre-existing language-specific knowledge. The test covers 894 questions grouped in 160 problems across 75 (mostly) extremely low-resource languages, extracted from the International Linguistic Olympiad corpus. To attain high accuracy on this benchmark, models don't need previous knowledge of the tested language, as all the information needed to solve the linguistic puzzle is presented in the context. We find that, while all analyzed models rank below 25% accuracy, there is a significant gap between open and closed models, with the best-performing proprietary model scoring 24.05% and the best-performing open model 8.84%.

View full details

Main Navigation

Mexico City Datasets & Benchmarks

Care-PD: A Multi-Site Anonymized Clinical Dataset for Parkinson’s Disease Gait Assessment

ChemX: A Collection of Chemistry Datasets for Benchmarking Automated Information Extraction

PolypSense3D: A Multi-Source Benchmark Dataset for Depth-Aware Polyp Size Measurement in Endoscopy

PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts

OpenGU: A Comprehensive Benchmark for Graph Unlearning

MUVR: A Multi-Modal Untrimmed Video Retrieval Benchmark with Multi-Level Visual Correspondence

SSRB: Direct Natural Language Querying to Massive Heterogeneous Semi-Structured Data

PhysDrive: A Multimodal Remote Physiological Measurement Dataset for In-vehicle Driver Monitoring

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

Linguini: A benchmark for language-agnostic linguistic reasoning

No Events Found