Skip to yearly menu bar Skip to main content


NeurIPS 2026 Evaluations & Datasets FAQ

 

Updated 7 April 2026

This FAQ will be continually updated. Please bookmark this page and review it before submitting any questions.

Note: Authors are also advised to consult the NeurIPS Main Track handbook, as general policies apply to ED submissions as well.

General FAQs

 

Will accepted papers in the Evaluations & Datasets Track appear in exactly the same proceedings as the main track papers?

Yes, accepted papers will be published in the NeurIPS proceedings and presented at the conference alongside main track papers.

What is the LaTeX template for the ED track?

It’s the same as the main track template. Check “Paper Formatting Instructions” at NeurIPS Main Track handbook.

Are dataset/code submissions due on May 6 (the full paper deadline)?

Yes. We follow the Main Track timeline, so the full paper — including all required materials — must be submitted by May 6, 2026 (AOE). For the ED track, datasets and code are not considered supplementary materials. If your submission includes data and/or code, they must be submitted in their final form by May 6, 2026 (AOE), together with the full paper.

What is the LaTeX configuration for a single-blind submission? (Updated)

Please use the default double-blind option: "\usepackage[eandd]{neurips_2026}". 

We ask authors of dataset submissions with non-anonymization restrictions to indicate that their submission requires single-blind review by choosing "single-blind" in the submission form. In this case, author names should still not appear in the manuscript, but you do not need to rigorously anonymize all references (such as links to project pages or resources).

My contribution includes a dataset, when should I use single-blind review?

We allow single-blind review when anonymization is not possible for scientific or ethical reasons. For instance, the collection of the dataset reveals the institution or the data is hosted on an institutional platform with gated access for ethical reasons. Outside of these cases, submissions should be double blind, using anonymous accounts on data hosting platforms and anonymous code where relevant.

How do I anonymize data and code?

Unless prevented by scientific or ethical reasons on datasets, we expect submissions to be anonymous. On hosting platforms, create an anonymous account (e.g. using the name of your project) and ensure there is no information leaked (including in the Croissant file). For code, a similar strategy can be used or a GitHub repository can be interfaced via anonymous git (e.g., see here).

I have read the different CfPs and my work could fall under different tracks or topics. How do I choose which one is most appropriate?

When reading CfPs, consider the main contribution of your work to make a decision. For instance, suppose your paper’s main contribution is to  highlight that a certain architecture is more performant, with secondary contributions in terms of new evaluations. In this case, you should use the novel architecture as the primary contribution.

If a case seems truly ambiguous, authors should select their track based on how they would like their paper to be evaluated (see CfPs). The framing of the paper should match that of the track. For example, consider: 

A submission evaluates LLMs for legal tasks such as statutory interpretation, case outcome prediction, and contract analysis using new stress tests and expert lawyer assessments. If the primary contribution is the evaluation framework and empirical insights into LLM reliability in law, the authors may select ED. If these findings instead motivate and validate a novel law-aware LLM adaptation that advances legal reasoning performance, the authors may select main track/use inspired.The boundary shifts when evaluation is the core intellectual contribution versus when it serves as evidence for a domain-driven modeling advance

Please note that there will be no possibility to switch tracks or types and that papers cannot be submitted to multiple tracks or types simultaneously. Irrelevant or duplicate papers risk desk rejection from all tracks. It is the authors’ responsibility to carefully read the relevant CfPs and identify the most appropriate track. 

We also provide the following examples* – some straightforward, some more ambiguous – to provide authors with more guidance on the different tracks: 

  • ImageNet: A large-scale hierarchical image database (Deng et al., 2009): A dataset for computer vision applications with a demonstration of its value in three tasks. The primary contribution relates to the dataset → ED.

  • Inherent Trade-Offs in the Fair Determination of Risk Scores (Kleinberg et al., 2016). This paper investigates the impossibility of satisfying multiple fairness criteria simultaneously. While it relies heavily on a theoretical framing, its main contribution is that of a negative, rigorously demonstrated and surprising result that is not demonstrated via empirical evaluations → main track / negative result.

  • Learning skillful medium-range global weather forecasting (Lam et al., 2023):  Using graph neural networks for 10 day weather predictions. This is a novel application of GNNs to a specific domain that includes domain-specific metrics (e.g. prediction of extreme events). The work is neither methodology or evaluation focused (while including both aspects) and clearly advances a real-world use case → main track / use-case inspired.

  • The Illusion of Readiness in Health AI (Gu et al., 2025): A focus on evaluations for a use-case inspired application in healthcare providing negative results through experimentation. While this is a use-case inspired, negative result, the primary focus of this paper is on experimental evaluations → ED.

  • Fairness Through Awareness (Dwork et al., 2011): the definition and implementation of individual fairness, with a secondary contribution that provides an algorithm to improve on this metric. The paper relies heavily on a theoretical framework. While the authors could consider different options (main/general due to algorithm development or main/theory), we believe the ED track is most appropriate given that the main contribution is about the definition of a new fairness metric. → ED.

*Please note this is not an endorsement or assessment of the quality of the paper’s contribution.

The main contribution of my paper is use-case inspired and includes evaluations. Should I choose the main track (use-inspired) or ED?

If the paper’s primary contribution is to define new methodologies for evaluations of this use case or highlight surprising negative results obtained from empirical evaluations, ED would be a suitable track. On the other hand, if evaluations are part of the work but not the primary focus (e.g. a novel method has been defined for the use case and is thoroughly evaluated), the main track might be more suitable. See the examples in the question above.

In all cases, the authors should select their track based on how they would like their paper to be evaluated (see CfPs). The framing of the paper should match that of the track.

Please note that there will be no possibility to switch tracks or types and that papers cannot be submitted to multiple tracks or types simultaneously. Irrelevant or duplicate papers risk desk rejection from all tracks. It is the authors’ responsibility to carefully read the relevant CfPs and identify the most appropriate track. 

Are there guidelines for submissions which are from the 2024/2025 Competitions track, e.g., reporting on competition results?

No, there are no special guidelines. Please follow the ED CFP and data hosting guidelines. Your submission will be reviewed according to the same standards alongside all other ED track submissions. We suggest you review the revised scope of the E&D track carefully when framing your work. 

My paper highlights a negative result. Is ED a suitable track?

Negative results, as long as they bring new insights and are thoroughly demonstrated via empirical evaluations, are welcome in ED track. A non-exhaustive list includes failure modes of current benchmarks, failure modes of AI systems in deployment and/or with human-computer interactions. If the main contribution is a theoretical demonstration of a negative result (e.g. impossibility theorem or counter-examples), authors can consider the main track / negative result topic instead. Please see (how do I choose)  above for guidance on how to choose between tracks for ambiguous cases.

My main contribution is a training dataset. Does it still fit the scope of E&D?

Yes. Training datasets are welcome as long as the work clearly demonstrates their value in improving (downstream) evaluations, e.g., task performance, robustness, fairness, privacy, and alignment. The metric(s) and task(s) the dataset is designed to improve upon should be clearly stated, along with any assumptions and limitations. Submissions that propose a dataset with the “potential” for machine learning or task improvement without this demonstration are not in scope.

How should I include code in my submission?

You will be asked to provide a URL to a hosting platform (e.g., GitHub, Bitbucket). All code should be documented and executable. If your submission is double-blind, you can use an anonymization service or another method to submit your code anonymously.

My submission is a benchmark consisting of an environment for evaluation only/audits an existing benchmark using publicly available data/is a theoretical framework for comparing evaluation designs. Do I need to follow the data-hosting guidelines?

No. If your submission does not introduce new data, you do not need to follow data-hosting guidelines. You do need to follow code-hosting guidelines if your submission includes new code or tools. The dataset-hosting and Croissant requirements apply only to submissions that introduce new datasets.  

Dataset hosting FAQs

The Croissant format can’t handle the file type(s) in my dataset submission. What should I do?

You should still submit a Croissant file. You can choose to provide only dataset-level metadata and a description of the resources in the dataset (FileObject and FileSet). You can omit RecordSets in this scenario. The recommended Croissant-compatible data hosting platforms and the Croissant Baker tool (see the guidelines on data hosting) should handle this gracefully for you, but you may need to address it manually if you self-host your dataset.

I have a submission consisting of multiple datasets. How do I submit the Croissant files?

You should submit a Croissant file for every dataset (and check whether they are all valid). You can combine the .json files into a .zip folder and upload that during submission. In addition, the dataset URL entered in OpenReview should refer to a webpage that documents the collection of datasets as a whole. The URLs for the individual datasets must be included in the individual Croissant files.

How do we handle our submission which includes a private hold-out set which we wish to keep private and unreleased, e.g., to avoid potential contamination?

You should mention that you have a private hold-out set and describe it in your paper, but the main contribution of your paper should be the publicly released portion of your dataset. The publicly released portion of your dataset needs to conform to the data hosting guidelines. It may also contain a public validation and test set collected using the same protocol as the private one.

My submission includes a synthetic dataset. Does it need to be documented and hosted in the same way?

Yes. All data hosting guidelines apply to synthetic datasets as well.

I don’t want to make my dataset publicly accessible at the time of submission. What are my options?

Both the Harvard Dataverse and Kaggle platforms offer private URL preview link sharing. This means your dataset is accessible only to those who have the special URL, e.g., reviewers. Note that you will be required to make your dataset public by the camera-ready deadline. Failure to do so may result in removal from the conference and proceedings.

Can I make changes to my dataset after I have made my submission to Open Review?

You can make changes until the submission deadline. After the submission deadline, we will perform automated verification checks of your dataset to assist in streamlining and standardizing reviews. If it changes in a way that invalidates the original reviews at any time between the submission deadline and by the camera ready deadline or publication of proceedings, we reserve the right to remove it from the conference or proceedings.

I am experiencing problems with the platform I am using to release my dataset. What should I do?

We have worked with maintainers of the dataset hosting platforms to identify the appropriate contact information for authors to use for support in case of issues or help with workarounds for storage quotas, etc. Find this contact information in the ED data hosting guidelines  

I need to require credentialized (AKA gated) access to my dataset

This will be possible on the condition that a credentialization is necessary for the public good (e.g. because of ethically sensitive medical data), and that an established credentialization procedure is in place that is 1) open to a large section of the public; 2) provides rapid response and access to the data; and 3) is guaranteed to be maintained for many years. A good example here is PhysioNet Credentialing, where users must first understand how to handle data with human subjects, yet is open to anyone who has learned and agrees with the rules.
This should be seen as an exceptional measure, and NOT as a way to limit access to data for other reasons (e.g., to shield data behind a Data Transfer Agreement). Misuse would be grounds for desk rejection. During submission, you can indicate that your dataset involves open credentialized access, in which case the necessity, openness, and efficiency of the credentialization process itself will also be checked.

Our dataset requires credentialized access. How do we preserve single-blind review, i.e., ensure the identities of reviewers aren’t shared with authors?

If it’s possible to share a private preview link rather than requiring credentials, you may try that. Or, you can make an account, give it view access to the dataset, and share login details with reviewers. After submission, you can send a private message visible only to reviewers on Open Review.

I have an extremely large dataset. How do I allow reviewers to properly evaluate it?

Please make sure that the full dataset is available at submission time. You can *in addition* provide ways to help reviewers explore your dataset. This could be a notebook that downloads a portion of the data and helps you explore it, or a bespoke solution appropriate for your dataset.
We also generally require large datasets (> 4GB) to provide a smaller data sample (ideally hosted in the same way). If you make a sample, also explain how you created that sample.

Our submission involves using existing public datasets. Do we need to host these in accordance with the data hosting guidelines?

No, but you should make any code used to modify or otherwise use the public datasets, e.g., for a new benchmark that you are submitting, accessible and executable (meaning you will need to provide publicly accessible links to the data sources used). You also should not claim the existing public datasets as part of your submission.

 

What are the data hosting requirements for benchmarks that use existing datasets?

If the datasets themselves are not part of your contribution (e.g., they are only slightly processed for the benchmark), then there is no need to host them or provide Croissant files for them. You can reference the original public datasets by their source URLs and provide your preprocessing code. However, if you do significantly process or curate the datasets (i.e., this is a key part of your contribution), then the data hosting guidelines do apply.

The online app for checking the validity of croissant files runs for a long time and times out.

This can happen when you have a dataset on Hugging Face. The app may be rate-limited, which causes an error and automatic restarts. If this happens, we recommend validating your Croissant file locally. You can click the three dots at the top right of the app to get code to run it locally, or clone the repository and run it in your own HF Space. Please see the bottom of the online app for detailed instructions.