Skip to yearly menu bar Skip to main content


NeurIPS Evaluations & Datasets 2026 Reviewing Guidelines

The Evaluations and Datasets (E&D) track follows the same timeline and overall reviewing process as the main track.

Timeline

Review period May 29 - June 25, 2026
Emergency review period July 6 - July 20
AC initial meta-review period July 13 - July 22
Paper Reviews released July 22
Authors rebuttal July 22 - July 27
Author + Reviewer + AC discussion period July 27 - August 03
Reviewer + AC discussion period August 03 - August 10
Meta-Review period August 10 - August 17
Paper Decision notification September 24

 

More information on each of these steps, as well as FAQs, can be found in the extended reviewing guidelines.


In this document you will find:

 

Use of Agents and Large Language Models  

In agreement with the Main Track, our reviewing policy reflects:

(a) the importance of protecting the confidentiality of submissions, which are shared in trust with reviewers, ACs, SACs, and PCs solely for the purpose of reviewing (see “Confidentiality of Submissions” in the Main Track Handbook); and

(b) the fact that the community is still developing best practices around the use of agents and large language models (LLMs) in peer review workflows.

Reviewers must not upload, share, or expose confidential submission content to external or hosted LLM services or autonomous agents. This includes any systems that may retain, train on, or otherwise store submission data outside approved review infrastructure. Unlike the Main Track, which is conducting a separate experiment involving sanctioned LLM support through OpenReview, no such experiment applies to the ED track. See Main Track Handbook Section “Reviewer Use of Agents and Large Language Models”.

The use of locally run/offline models (e.g. LLMs operating entirely on the reviewer’s own hardware without transmitting data externally) may be permitted for limited assistance tasks such as rewriting reviews for clarity or summarizing reviewer notes, provided that:

  • submission confidentiality is strictly preserved;
  • the reviewer remains fully responsible for the content, accuracy, and originality of the review;
  • the use of such tools does not reduce review quality or depth of engagement with the paper; and
  • generated or edited content may be subject to screening for low-quality or inappropriate LLM-assisted reviewing practices.

 

Reviews identified as low quality may be further investigated for inappropriate LLM use, and serious violations may result in escalation to the ACs/PCs and possible desk rejection of the reviewers’ associated submissions.

Use of LLMs and Agents by authors.

Following the Main Track policy (see Main Track Handbook), authors should clearly document any important, original, or non-standard use of agents or LLMs in their methodology, consistent with the Main Track policy and scientific transparency standards (e.g., using an LLM as part of the proposed method). Routine uses such as spell checking, grammar editing, or basic coding assistance do not need to be disclosed. Authors remain fully responsible for all paper content, including text, figures, and references, and must ensure AI-assisted outputs are correct, original, and scientifically responsible. This includes verifying citations and avoiding hallucinated or misleading content, in accordance with the NeurIPS Code of Conduct. Agents and LLMs cannot be listed as authors. Attempts to manipulate the reviewing process, including prompt injection attacks, are strictly prohibited under NeurIPS policy.

If reviewers identify prompt injections, hidden instructions, or other attempts to manipulate the review process, they should immediately report them to the ACs/PCs.

 

Reviewing Guidelines  

For the 2026 E&D Track, reviewers are asked to evaluate submissions along the core NeurIPS dimensions — Quality, Clarity, Significance, and Originality — while interpreting these criteria in the context of the specific E&D contribution type identified by the authors (see below). These dimensions should inform the assessment of both Strengths/Contributions and Limitations/Weaknesses in the reviewer form. In addition, the E&D Track places particular emphasis on responsible data practices, transparency, and reproducibility. Submissions involving datasets, benchmarks, or reusable executable artifacts are subject to additional track-specific policies regarding anonymization, dataset metadata and documentation, and code availability.

The ED Track therefore introduces:

  1. Track-wide policies emphasizing responsible data practices, transparency, and reproducibility, particularly for dataset and executable-artifact submissions.
  2. Contribution-specific reviewing guidance to contextualize the general NeurIPS criteria

 

 

Track-wide Policies  

 
Anonymity

Double-blind review is the default for all ED submissions. Single-blind review is only permitted for dataset-centered submissions where anonymizing the data release is unfeasible due to scientific or ethical concerns. This also applies to dataset and code URLs. If a tools/methods paper is not anonymized, flag this to your Area Chair. Reviewers should use their best judgment when assessing whether partial de-anonymization is justified by the nature of the contribution, in line with the CFP.

 

Submissions with a Dataset contribution

If a submission introduces a new dataset, regardless of the contribution type, please evaluate it using both the provided automated tools and your expert judgment.

Meta-data requirements: All submissions that contribute datasets are required to provide a valid Croissant file. Following the full paper submission deadline, we conducted compliance checks and desk-rejected submissions that clearly failed to meet this requirement. However, some non-compliant cases may have remained undetected. If you encounter a submission missing a Croissant file or providing an invalid/placeholder one, please flag it to your Area Chair.

Dataset Accessibility: Datasets should be available to all reviewers, ACs, and SACs at the time of submission, and without a personal request to the PI. We check whether we can access the dataset automatically based on the Croissant meta-data, but it is possible that data is still manually accessible to you. There could be gated datasets where the authors have provided a secret link outside of the metadata, or authors may have sent access instructions to you via a message on OpenReview. In case the dataset is not easily accessible or otherwise cannot be reviewed, please flag it to your Area Chair.

Meta-data completeness: Authors were asked to include a minimal set of Croissant and Responsible AI (RAI) metadata as described in the E&D data hosting guidelines. To facilitate the review process, you will be provided with an automated dataset reviewer report generated from the submission's Croissant file. This will be posted on the paper’s OpenReview page. This report checks technical compliance, such as URL accessibility, hosting platform validity, and the presence of core and RAI metadata.

Apply human expert judgment: The automated report is a helpful checklist, but it does not replace your professional judgment. Check the metadata to ensure that the authors adequately documented the data, including its Responsible AI aspects such as data limitations, known biases, personal/sensitive information, intended use cases, and social impact. When metadata is missing or incomplete, apply a proportional judgment: a minor format violation or unclear text can be addressed during the rebuttal, while more serious missingness should be flagged to your AC. Empty or placeholder text should be flagged to your AC and may be desk-rejected.

Verify evaluative purpose: Datasets should not be treated merely as endpoints without a clear explanation on how they can be used in downstream evaluations. Check if the authors clearly articulated what specific evaluative claims their dataset supports, under what assumptions those claims hold, and what limitations constrain them.

 

Submissions with a code or executable artifact contribution

Enforce the mandatory code policy for artifacts: If the primary contribution is a reusable executable artifact (such as an evaluation toolkit, benchmarking platform, or data generator), code release is mandatory at the time of submission. Verify that the provided code_URL is accessible, executable, properly anonymized, and sufficiently documented. If you encounter a submission missing a code URL or providing an invalid/placeholder one, please flag it to your Area Chair.

Assess the "Code Submission Justification" for analytical work: For submissions that are analytical, conceptual, or methodological, code release is highly encouraged but not strictly mandatory. If authors did not provide code, they were required to fill out a code_submission_justification field. You must evaluate this justification: does the paper provide enough detail for you to meaningfully assess the validity of the scientific claims without access to the code? If not, the submission may be subject to rejection. These considerations should be part of your assessment and any concerns should be reported in the ‘reproducibility’ part of the review form.

 

Flagging ethics concerns

We provide two options to flag ethics concerns:

  1. The reviewer believes that a direct discussion with the authors (e.g. during the rebuttal/discussion period) is sufficient to clarify or address concerns
  2. The reviewer recommends to add an additional reviewer familiar with ethics issues in machine learning.

The latter option should be used when **reviewers are uncertain about certain ethical aspects of a submission and an expert opinion is required**. Reviewers may contact their AC for guidance when in doubt.

Examples requiring an ethics review:

  • The reviewer is uncertain about whether the collection of a dataset respects privacy and consent.
  • A dataset for the evaluation of safety guardrails on child safety without the appropriate gated access.

Examples not requiring an ethics review:

  • A contribution highlighting discrimination in a model’s output and proposing metrics or mitigation strategies.
  • A benchmark to assess failure modes in AI agents that includes clear access rules where necessary.

 

Guidelines by ED contribution types  

The general NeurIPS reviewing guidelines define four criteria:

  • Quality: Is the submission technically sound, and are the claims well-supported by empirical, analytical, or conceptual arguments? Is this a complete piece of work or work in progress? Are the authors careful and honest about evaluating both the strengths and weaknesses of their work?
  • Clarity: Is the submission clearly written, well-organized, and detailed enough to enable reproducibility?
  • Significance: Are the results impactful for the community? Does it advance our understanding of AI evaluation, provide unique data, or address a difficult task better than prior work? Are others (researchers or practitioners) likely to use the ideas or build on them? 
  • Originality: Does the work provide new insights, deepen understanding, or highlight important properties of existing methods? Originality does not necessarily require introducing an entirely new method. Providing novel insights, exposing failure modes, evaluating existing methods, or framing new metrics is equally valuable.

However, these criteria may be interpreted for different contribution types. This year, E&D authors were asked to indicate the primary contribution type of their paper. The purpose of these contribution types is to help contextualize the review process across the broad range of submissions represented in the E&D Track. We provide the following guidelines for each of the possible contribution types:

Evaluations

Datasets

 

At the same time, many submissions may naturally span multiple categories. The selected contribution type should therefore be viewed as a guide to the primary framing of the work rather than a rigid label. Reviewers are encouraged to evaluate submissions holistically and use the category-specific guidance as interpretive support rather than as a separate evaluation rubric. For every submission, we also ask that all reviewers reflect on the evaluative claims made by the authors, given the focus on evaluations as a scientific object of interest.

 

Contribution type: Datasets and Data Resources 

New datasets, dataset collections, dataset generators, curated evaluation datasets, reinforcement learning environments

Quality: Scrutinize data collection and curation practices. Are all claims well-supported? 

Clarity: See the Track-wide Policies on required documentation above. 

Significance: Datasets should not be treated merely as endpoints. Evaluate the dataset's evaluative role: what specific claims it supports, under what assumptions, and what limitations apply.

Originality: data domain, task coverage, or data collection methodology that enables new evaluative claims. Datasets-as-endpoints don't meet the bar on their own.

Mandatory checks: 

  • Is a valid Croissant file present?
  • Is the RAI metadata complete (biases, limits, intended use, sensitive info)
  • Is the dataset accessible without PI request
  • Is the the paper double-blind or justified to be single-blind
  • Check the automated reviewer report

 

Contribution type: Benchmark Design and Benchmark Analysis 

New benchmarks, benchmark redesign, benchmarking methodologies, benchmark saturation or overfitting studies, analyses of benchmark limitations or failure modes.

Quality: Evaluate the rigor of task design, item selection, and evaluation setup. Assess whether benchmark saturation or overfitting is adequately addressed. 

Clarity: Benchmark construction, task design choices, and evaluation protocols are clearly described and reproducible. Could another team re-run the benchmark from this paper?

Significance: Does the benchmark address a meaningful gap or enable reliable comparison across methods? Does it change how we measure progress in this area?

Originality: Originality may be achieved through novel task design, evaluation setup, or analysis that reveals properties of existing benchmarks. Beating a baseline is not required.

Mandatory checks: 

  • Must always be double-blind
  • Code is available to reproduce benchmark strongly encouraged
  • If no code is provided: evaluate whether the code justification field in OpenReview provides a convincing explanation.
  • If introducing a new dataset, does the work comply with the “Datasets and Data Resources” type discussed above.

 

Contribution type: Evaluation Methodology and Metrics 

Evaluation protocols, metric design and validation, experimental methodology, statistical evaluation methods, comparisons of evaluation designs, refining evaluation setups.

Quality: Assess the statistical soundness, rigor, and validity of proposed evaluation protocols or metrics. 

Clarity: Methods, assumptions, and evaluation protocols are clearly described and sufficiently detailed to be reproducible. Is it sufficiently detailed for independent replication? 

Significance: Assess whether the proposed methodology or metric meaningfully advances evaluation practice beyond existing approaches. Does it shift conclusions if applied to existing benchmarks? 

Originality: Originality may be achieved by novel metrics or by showing how different evaluation assumptions lead to different scientific conclusions. New framing is sufficient - no need to beat a baseline. 

Mandatory checks:

  • Must always be double-blind
  • Including code for metric implementation is encouraged
  • If no code is provided: evaluate whether the code justification field in OpenReview provides a convincing explanation.

 

Contribution type: Evaluation Tools, Frameworks, and Infrastructure 

Benchmarking platforms, evaluation toolkits, metrics libraries, evaluation pipelines, visualization or analysis tools supporting evaluation

Quality: Usability, design, and documentation of the tool. Is it robust and well-engineered, not just a demo?

Clarity: Documentation is thorough enough for users to install, run, and extend the tool independently.

Significance: The tool addresses a genuine gap in the evaluation ecosystem and has potential for broad adoption.

Originality: Novel tool design, workflow, or infrastructure approach that meaningfully goes beyond existing solutions.

Mandatory checks:

  • Must always be double-blind
  • In most cases, code release is mandatory; The Code URL should be accessible and executable
  • The code is properly anonymized
  • No code = reject unless a convincing justification is provided

 

Contribution type: Dataset Documentation, Auditing, and Responsible Data 

Dataset audits, dataset bias or quality analysis, responsible dataset development frameworks, documentation methodologies (e.g., Data Cards).

Quality: Assess the rigor and coverage of the documentation framework, auditing methodology, or responsible data practices.

Clarity: The documentation or audit approach is clearly described and transferable. Could the reader apply it to another dataset?

Significance: Assess whether the contribution meaningfully advances standards for responsible data practices or reveals important limitations of existing datasets.

Originality: Novel documentation standards, auditing protocols, or responsible data frameworks that go beyond existing approaches (e.g., datasheets, model cards).

Mandatory checks:

  • Must be double-blind (or justified single-blind)
  • Code/tooling provided if the audit is automated
  • If no code in provided, evaluate code justification field

 

Contribution type: Reproducibility, Auditing, and Stress-Testing of Evaluations 

Replication studies, auditing prior evaluations, stress-testing evaluation pipelines, robustness analyses of evaluation claims, meta-analysis of benchmarks.

Quality: Findings must be grounded in rigorous, systematic analysis — not superficial observations. If a negative result: is it deep and carefully controlled? 

Clarity: Sufficient methodological detail to assess the validity of claims independently, especially when code is not provided. 

Significance: The work yields meaningful insights about the robustness or limitations of existing evaluations. Negative results are valuable when rigorously supported. 

Originality: New insights from stress-testing or auditing existing evaluations, exposing failure modes, or demonstrating the limits of established evaluation practices. 

Mandatory checks:

  • Must be double-blind 
  • Providing code to reproduce or audit the work is encouraged
  • If no code is provided: evaluate the code justification carefully. Claims must be independently verifiable.

 

Contribution type: Data-Centric Methods and Empirical Analyses 

Data-centric AI methods, systematic analyses of systems on datasets, analyses of ML competitions, empirical studies on novel datasets, negative or critical empirical results.

Quality: Empirical analyses must be grounded in rigorous methodology and careful experimentation.

Clarity: Sufficient methodological detail to assess the validity of empirical claims, especially when code is not provided. Can the reader evaluate the claims without running the code?

Significance: The work yields meaningful insights about data properties or evaluation practice through empirical analysis. Negative findings are valued when well-supported.

Originality: Novel empirical methodology or new insights from data-centric analysis that advance understanding beyond existing work.

Mandatory checks:

  • Must be double-blind 
  • Providing code for experiments is encouraged
  • If no code is provided: evaluate the code justification carefully. Claims must be independently verifiable.

 

Contribution type: Human-Centered and Interaction-Based Evaluation 

User studies, human-in-the-loop evaluation, red-teaming, human-AI interaction evaluation

Quality: Assess rigor of user studies, human-in-the-loop protocols, or red-teaming methodologies. Verify that human subject research protocols were followed (e.g., fair wages, IRB accreditation).

Clarity: Study design, participant selection, task protocols, and evaluation criteria are clearly described and reproducible.

Significance: Findings meaningfully advance understanding of human-AI interaction, safety, or human evaluation methodology.

Originality: Novel evaluation methodology for human-AI interaction, or new insights into the limits of human evaluator judgment.

Mandatory checks:

  • IIRB accreditation or equivalent institutional documentation is provided
  • A confirmation must be provided of fair compensation for all participants

 

 

Flow of Tasks and Expectations 

Your assignments and tasks will appear at the reviewer console in OpenReview

Preparation

Read and agree to abide by the NeurIPS code of conduct. Read the policies pertaining to everyone (e.g. around conflicts of interest, setting up an OpenReview profile) and authors (e.g. dual submission policy, double blind reviewing) in the handbook. Read about the 2026 experiment to better understand how reviewers use LLMs and how that impacts reviews.

Bid on papers

Your bids are an important input to the paper matching process. Please be cognizant of our anti-collusion policies (above).

Check paper assignments

As soon as you are notified of papers to review, you are expected to log in to OpenReview to check for conflicts and to check that papers fall within your area of expertise. If you don’t feel qualified to review a paper that was assigned to you, please communicate this to your AC right away. These assignments may change during the first week, as some reviewers and ACs request re-assignments. Please watch for notification email from Openreview.

Write thoughtful reviews

Be fair and precise. Your review should focus on scientific content and clarity. Do not let personal feelings affect your review, and please make your review as informative and substantiated as possible. Superficial, uninformed reviews without evidence are worse than no review as they may contribute noise to the review process. For example, if you argue about the lack of novelty, please provide appropriate references and point to existing mechanisms within – vague statements are unfairly difficult for authors to address.A good review is useful to all parties involved: authors, other reviewers and AC/SACs. Try to keep your feedback constructive when possible. Finally, please ensure to thoroughly comment on technical aspects of work rather than focusing only on paper organisation or its grammar. 

Feel free to use the NeurIPS paper checklist included in each paper as a tool when preparing your review. Remember that answering “no” to some questions is typically not grounds for rejection. In general, authors should be rewarded rather than punished for being up front about the limitations of their work and any potential negative societal impact. You are encouraged to think through whether any critical points are missing and provide these as feedback for the authors.

Finally, be thoughtful. The paper you are reviewing may have been written by a first year graduate student who is submitting to a conference for the first time and you don't want to crush their spirits. In general, avoid wording that may be perceived as rude or offensive. Relatedly, when writing your review, please keep in mind that after decisions have been made, reviews and meta-reviews of accepted papers as well as your discussion with the authors will be made public (but reviewer and SAC/AC identities will remain anonymous); authors of rejected papers will have the option to make this information public for their rejected papers as well.

Reviewing FAQs

Q: What about minor formatting violations? 

A: While filtering for formatting violations has already been applied post submission, there still might be submissions that were not caught. We allow page excess of the paper limit up to 5 lines.  Per the call for papers, submissions must use the official LaTeX style file (Microsoft Word is not accepted), and modifications to formatting (e.g., margins, font sizes, or page dimensions) are not permitted.

In addition, the main content of the paper must fit within the 9-page limit. The main content includes all core sections (e.g., Introduction through Conclusion, Discussion, Limitations, and Future Work). We allow Appendices to include Broader Impact, Ethical Considerations or Data and code availability statements. Submissions that violate these requirements are subject to desk rejection and we ask that you report to your AC. 

Q: What if I’ve seen similar work in a NeurIPS/ICML workshop?

A: We allow work that has been submitted to non-archival workshops to be submitted to NeurIPS. To maintain anonymity, do not mention the workshop paper in your review.

Q: Can I recommend ‘accept’ or ‘reject’ all the papers in my stack?

A: Yes. Please accept and reject papers based on their own merits. You do not have to match the conference acceptance rate.

Q: Do I have to read the supplementary material?

A: You are not required to read it, but you are welcome to.

Q: Can I read the previous reviews of a paper if it is a resubmission?

A: You should not actively seek out previous reviews because it could violate anonymity in our double-blind review process, but if you have read them previously, that is okay.

Q: What should I do if I have already reviewed this paper at another venue?

A: Do not assume that the paper hasn’t changed. Read the paper carefully, and make sure you write a high quality review.

Q: Can I invite a sub-reviewer to help with my reviews?

A: No, sub-reviewers are not allowed. Conflicts of interest cannot be properly checked unless reviewers are officially in the system, and sub-reviewers would not be able to participate in the discussion, which is a critical phase of the review process.

Q: Is the use of LLMs or AI agents allowed in preparing a submission (authors)?

A: Yes, but authors must comply with the NeurIPS policy on “Author Use of Agents and Large Language Models” as described in the main handbook (https://neurips.cc/Conferences/2026/MainTrackHandbook). Methodologically significant or non-standard use of LLMs/agents should be disclosed in the paper. Authors remain responsible for all content in the submission, including ensuring factual correctness and avoiding hallucinated citations or results.

 

Read author responses and discuss papers

At the start of the discussion period, please carefully read all other reviews, the meta-review, and the author responses to all reviews for the papers assigned to you.

  • As you read each author’s response, please keep an open mind. The authors may address some points you raised in your review during the discussion period. Make an effort to update your understanding of the paper when new information is presented, and revise your review to reflect this. If the author’s response didn’t change your opinion about the paper, please acknowledge that you have read and considered it.  
  • To minimize the chance of misunderstandings during the reviewing process, we will allow for a rolling discussion with the authors during the discussion period.  If you need to communicate with the authors, you can make a comment visible to them on the paper’s page.
  • Participating in discussions is a critical part of your role as a reviewer.  The discussion period is especially important for borderline papers and papers for which the reviewers’ assessments differ, and we hope that you take discussions seriously.  If your evaluation of the paper has changed, please revise your review and explain the change.
  • When discussing a paper, remember that different people have different backgrounds and different points of view. Reviewer consensus is valuable—only rarely are unanimous assessments overruled—but it is not mandatory.

After the discussion period, ACs will make initial accept/reject decisions with SACs before the Author Notification. Your workload during this period should be light, but if ACs come back to you with additional questions, please respond promptly.

 

Contact for Questions and Concerns

The Area Chair (AC) assigned to a paper should be your first point of contact for that paper. ACs are the principal contact for reviewers during the whole reviewing process. ACs are responsible for recommending reviewers for submissions, ensuring that all submissions receive quality reviews, facilitating discussions among reviewers, writing meta-reviews, evaluating the quality of reviews, and making decision recommendations. In addition to questions about reviewing, you should also contact your AC if you suspect notice any unethical or suspect behavior e.g. plagiarism, papers that are not anonymized (note: if someone pressures you to provide a positive or negative review, please escalate that to the scientific integrity chairs right away). Your AC is also your first point of contact if you have an emergency and are delayed in reviews.

Finally, if you have ethics-related concerns regarding the content of the paper, you may flag the submission for additional review by ethics reviewers. The comments from the ethics reviewers will be visible to all reviewers, the AC, and the authors. You may use their comments to inform your deliberations.

You can contact the AC by leaving a comment in OpenReview with the AC as a reader. (SACs – whose job it is to oversee the work of ACs – and track Chairs – who oversee the entire process – will also be listed as readers, but will not be notified.) If you encounter a situation that you are unable to resolve with your AC, please contact the Chairs. Please refrain from writing to the Track Chairs at their own email addresses. 

 

Executing Code & Clicking on Links

Please remember that just like any other untrusted code, any submitted code may contain security vulnerabilities. When running any submitted code, please make sure you are doing this in a secure environment because this code is not vetted by our submission system. We recommend running source code (1) inside a Docker container, or (2) a Virtual Machine image (using VirtualBox or VMWare), or (3) on a network-isolated cloud instance. You may wish to also be cautious about accessing other web links provided from the paper, as these may contain vulnerabilities or may log visitor IP addresses.

 

Double-blind Reviewing: For Reviewers

Please do not attempt to find out the identities of the authors for any of your assigned submissions (e.g., by searching on arXiv).  This would constitute an active violation of the double-blind reviewing policy.

Note that the ED track allows signle-blind in cases of dataset-centered submissions where anonymizing the data release is unfeasible due to scientific or ethical concerns. This applies to dataset and code URLs.