Timezone: »

Datasets and Benchmarks
Dataset and Benchmark Poster Session 3
Joaquin Vanschoren · Serena Yeung

Thu Dec 09 08:30 AM -- 10:00 AM (PST) @ None

The Datasets and Benchmarks track serves as a novel venue for high-quality publications, talks, and posters on highly valuable machine learning datasets and benchmarks, as well as a forum for discussions on how to improve dataset development. Datasets and benchmarks are crucial for the development of machine learning methods, but also require their own publishing and reviewing guidelines. For instance, datasets can often not be reviewed in a double-blind fashion, and hence full anonymization will not be required. On the other hand, they do require additional specific checks, such as a proper description of how the data was collected, whether they show intrinsic bias, and whether they will remain accessible.

 - Q-Pain: A Question Answering Dataset to Measure Social Bias in Pain Management (Poster) []  [Paper] [ Visit Poster at Spot C3 in Virtual World ]    Recent advances in Natural Language Processing (NLP), and specifically automated Question Answering (QA) systems, have demonstrated both impressive linguistic fluency and a pernicious tendency to reflect social biases. In this study, we introduce Q-Pain, a dataset for assessing bias in medical QA in the context of pain management, one of the most challenging forms of clinical decision-making. Along with the dataset, we propose a new, rigorous framework, including a sample experimental design, to measure the potential biases present when making treatment decisions. We demonstrate its use by assessing two reference Question-Answering systems, GPT-2 and GPT-3, and find statistically significant differences in treatment between intersectional race-gender subgroups, thus reaffirming the risks posed by AI in medical settings, and the need for datasets like ours to ensure safety before medical AI applications are deployed. Cécile Logé · Emily Ross · David Dadey · Saahil Jain · Adriel Saporta · Andrew Ng · Pranav Rajpurkar 🔗 - Modeling Worlds in Text (Poster) []  [Paper] [ Visit Poster at Spot F3 in Virtual World ]    We provide a dataset that enables the creation of learning agents that can build knowledge graph-based world models of interactive narratives. Interactive narratives---or text-adventure games---are partially observable environments structured as long puzzles or quests in which an agent perceives and interacts with the world purely through textual natural language. Each individual game typically contains hundreds of locations, characters, and objects---each with their own unique descriptions---providing an opportunity to study the problem of giving language-based agents the structured memory necessary to operate in such worlds. Our dataset provides 24198 mappings between rich natural language observations and: (1) knowledge graphs that reflect the world state in the form of a map; (2) natural language actions that are guaranteed to cause a change in that particular world state. The training data is collected across 27 games in multiple genres and contains a further 7836 heldout instances over 9 additional games in the test set. We further provide baseline models using rules-based, question-answering, and sequence learning approaches in addition to an analysis of the data and corresponding learning tasks. Prithviraj Ammanabrolu · Mark Riedl 🔗 - OmniPrint: A Configurable Printed Character Synthesizer (Poster) []  [Paper] [ Visit Poster at Spot F1 in Virtual World ]    We introduce OmniPrint, a synthetic data generator of isolated printed characters, geared toward machine learning research. It draws inspiration from famous datasets such as MNIST, SVHN and Omniglot, but offers the capability of generating a wide variety of printed characters from various languages, fonts and styles, with customized distortions. We include 935 fonts from 27 scripts and many types of distortions. As a proof of concept, we show various use cases, including an example of meta-learning dataset designed for the upcoming MetaDL NeurIPS 2021 competition. OmniPrint is available at https://github.com/SunHaozhe/OmniPrint. Haozhe Sun · Wei-Wei Tu · Isabelle Guyon 🔗 - Benchmarking Bias Mitigation Algorithms in Representation Learning through Fairness Metrics (Poster) []  [Paper] [ Visit Poster at Spot A3 in Virtual World ]    With the recent expanding attention of machine learning researchers and practitioners to fairness, there is a void of a common framework to analyze and compare the capabilities of proposed models in deep representation learning. In this paper, we evaluate different fairness methods trained with deep neural networks on a common synthetic dataset and a real-world dataset to obtain better insights on how these methods work. In particular, we train about 3000 different models in various setups, including imbalanced and correlated data configurations, to verify the limits of the current models and better understand in which setups they are subject to failure. Our results show that the bias of models increase as datasets become more imbalanced or datasets attributes become more correlated, the level of dominance of correlated sensitive dataset features impact bias, and the sensitive information remains in the latent representation even when bias-mitigation algorithms are applied. Overall, we present a dataset, propose various challenging evaluation setups, and rigorously evaluate recent promising bias-mitigation algorithms in a common framework and publicly release this benchmark, hoping the research community would take it as a common entry point for fair deep learning. Charan Reddy · Deepak Sharma · Soroush Mehri · Adriana Romero Soriano · Samira Shabanian · Sina Honari 🔗 - An Extensible Benchmark Suite for Learning to Simulate Physical Systems (Poster) []  [Paper] [ Visit Poster at Spot A1 in Virtual World ]    Simulating physical systems is a core component of scientific computing, encompassing a wide range of physical domains and applications. Recently, there has been a surge in data-driven methods to complement traditional numerical simulations methods, motivated by the opportunity to reduce computational costs and/or learn new physical models leveraging access to large collections of data. However, the diversity of problem settings and applications has led to a plethora of approaches, each one evaluated on a different setup and with different evaluation metrics. We introduce a set of benchmark problems to take a step towards unified benchmarks and evaluation protocols. We propose four representative physical systems, as well as a collection of both widely used classical time integrators and representative data-driven methods (kernel-based, MLP, CNN, Nearest-Neighbors). Our framework allows to evaluate objectively and systematically the stability, accuracy, and computational efficiency of data-driven methods. Additionally, it is configurable to permit adjustments for accommodating other learning tasks and for establishing a foundation for future developments in machine learning for scientific computing. Karl Otness · Arvi Gjoka · Joan Bruna · Daniele Panozzo · Benjamin Peherstorfer · Teseo Schneider · Denis Zorin 🔗 - The Multi-Agent Behavior Dataset: Mouse Dyadic Social Interactions (Poster) []  [Paper] [ Visit Poster at Spot F0 in Virtual World ]    Multi-agent behavior modeling aims to understand the interactions that occur between agents. We present a multi-agent dataset from behavioral neuroscience, the Caltech Mouse Social Interactions (CalMS21) Dataset. Our dataset consists of trajectory data of social interactions, recorded from videos of freely behaving mice in a standard resident-intruder assay. To help accelerate behavioral studies, the CalMS21 dataset provides benchmarks to evaluate the performance of automated behavior classification methods in three settings: (1) for training on large behavioral datasets all annotated by a single annotator, (2) for style transfer to learn inter-annotator differences in behavior definitions, and (3) for learning of new behaviors of interest given limited training data. The dataset consists of 6 million frames of unlabeled tracked poses of interacting mice, as well as over 1 million frames with tracked poses and corresponding frame-level behavior annotations. The challenge of our dataset is to be able to classify behaviors accurately using both labeled and unlabeled tracking data, as well as being able to generalize to new settings. Jennifer J Sun · Tomomi Karigo · Dipam Chakraborty · Sharada Mohanty · Benjamin Wild · Quan Sun · Chen Chen · David Anderson · Pietro Perona · Yisong Yue · Ann Kennedy 🔗 - Reinforcement Learning Benchmarks for Traffic Signal Control (Poster) []  [Paper] [ Visit Poster at Spot E1 in Virtual World ]    We propose a toolkit for developing and comparing reinforcement learning (RL)-based traffic signal controllers. The toolkit includes implementation of state-of-the-art deep-RL algorithms for signal control along with benchmark control problems that are based on realistic traffic scenarios. Importantly, the toolkit allows a first-of-its-kind comparison between state-of-the-art RL-based signal controllers while providing benchmarks for future comparisons. Consequently, we compare and report the relative performance of current RL algorithms. The experimental results suggest that previous algorithms are not robust to varying sensing assumptions and non-stylized intersection layouts. When more realistic signal layouts and advanced sensing capabilities are assumed, a distributed deep-Q learning approach is shown to outperform previously reported state-of-the-art algorithms in many cases. James Ault · Guni Sharon 🔗 - MiniHack the Planet: A Sandbox for Open-Ended Reinforcement Learning Research (Poster) []  [Paper] [ Visit Poster at Spot B3 in Virtual World ]    Progress in deep reinforcement learning (RL) is heavily driven by the availability of challenging benchmarks used for training agents. However, benchmarks that are widely adopted by the community are not explicitly designed for evaluating specific capabilities of RL methods. While there exist environments for assessing particular open problems in RL (such as exploration, transfer learning, unsupervised environment design, or even language-assisted RL), it is generally difficult to extend these to richer, more complex environments once research goes beyond proof-of-concept results. We present MiniHack, a powerful sandbox framework for easily designing novel RL environments. MiniHack is a one-stop shop for RL experiments with environments ranging from small rooms to complex, procedurally generated worlds. By leveraging the full set of entities and environment dynamics from NetHack, one of the richest grid-based video games, MiniHack allows designing custom RL testbeds that are fast and convenient to use. With this sandbox framework, novel environments can be designed easily, either using a human-readable description language or a simple Python interface. In addition to a variety of RL tasks and baselines, MiniHack can wrap existing RL benchmarks and provide ways to seamlessly add additional complexity. Mikayel Samvelyan · Rob Kirk · Vitaly Kurin · Jack Parker-Holder · Minqi Jiang · Eric Hambro · Fabio Petroni · Heiner Kuttler · Edward Grefenstette · Tim Rocktäschel 🔗 - Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative Tasks (Poster) []  [Paper] [ Visit Poster at Spot B0 in Virtual World ]    Multi-agent deep reinforcement learning (MARL) suffers from a lack of commonly-used evaluation tasks and criteria, making comparisons between approaches difficult. In this work, we provide a systematic evaluation and comparison of three different classes of MARL algorithms (independent learning, centralised multi-agent policy gradient, value decomposition) in a diverse range of cooperative multi-agent learning tasks. Our experiments serve as a reference for the expected performance of algorithms across different learning tasks, and we provide insights regarding the effectiveness of different learning approaches. We open-source EPyMARL, which extends the PyMARL codebase to include additional algorithms and allow for flexible configuration of algorithm implementation details such as parameter sharing. Finally, we open-source two environments for multi-agent research which focus on coordination under sparse rewards. Georgios Papoudakis · Filippos Christianos · Lukas Schäfer · Stefano Albrecht 🔗 - Which priors matter? Benchmarking models for learning latent dynamics (Poster) []  [Paper] [ Visit Poster at Spot F1 in Virtual World ]    Learning dynamics is at the heart of many important applications of machine learning (ML), such as robotics and autonomous driving. In these settings, ML algorithms typically need to reason about a physical system using high dimensional observations, such as images, without access to the underlying state. Recently, several methods have proposed to integrate priors from classical mechanics into ML models to address the challenge of physical reasoning from images. In this work, we take a sober look at the current capabilities of these models. To this end, we introduce a suite consisting of 17 datasets with visual observations based on physical systems exhibiting a wide range of dynamics. We conduct a thorough and detailed comparison of the major classes of physically inspired methods alongside several strong baselines. While models that incorporate physical priors can often learn latent spaces with desirable properties, our results demonstrate that these methods fail to significantly improve upon standard techniques. Nonetheless, we find that the use of continuous and time-reversible dynamics benefits models of all classes. Alex Botev · Drew Jaegle · Peter Wirnsberger · Daniel Hennes · Irina Higgins 🔗 - The Neural MMO Platform for Massively Multiagent Research (Poster) []  [Paper] [ Visit Poster at Spot A3 in Virtual World ]    Neural MMO is a computationally accessible research platform that combines large agent populations, long time horizons, open-ended tasks, and modular game systems. Existing environments feature subsets of these properties, but Neural MMO is the first to combine them all. We present Neural MMO as free and open source software with active support, ongoing development, documentation, and additional training, logging, and visualization tools to help users adapt to this new setting. Initial baselines on the platform demonstrate that agents trained in large populations explore more and learn a progression of skills. We raise other more difficult problems such as many-team cooperation as open research questions which Neural MMO is well-suited to answer. Finally, we discuss current limitations of the platform, potential mitigations, and plans for continued development. Joseph Suarez · Yilun Du · Clare Zhu · Igor Mordatch · Phillip Isola 🔗 - A Procedural World Generation Framework for Systematic Evaluation of Continual Learning (Poster) []  [Paper] [ Visit Poster at Spot A2 in Virtual World ]    Several families of continual learning techniques have been proposed to alleviate catastrophic interference in deep neural network training on non-stationary data. However, a comprehensive comparison and analysis of limitations remains largely open due to the inaccessibility to suitable datasets. Empirical examination not only varies immensely between individual works, it further currently relies on contrived composition of benchmarks through subdivision and concatenation of various prevalent static vision datasets. In this work, our goal is to bridge this gap by introducing a computer graphics simulation framework that repeatedly renders only upcoming urban scene fragments in an endless real-time procedural world generation process. At its core lies a modular parametric generative model with adaptable generative factors. The latter can be used to flexibly compose data streams, which significantly facilitates a detailed analysis and allows for effortless investigation of various continual learning schemes. Timm Hess · Martin Mundt · Iuliia Pliushch · Visvanathan Ramesh 🔗 - Brax - A Differentiable Physics Engine for Large Scale Rigid Body Simulation (Poster) []  [Paper] [ Visit Poster at Spot B1 in Virtual World ]    We present Brax, an open source library for \textbf{r}igid \textbf{b}ody simulation with a focus on performance and parallelism on accelerators, written in JAX. We present results on a suite of tasks inspired by the existing reinforcement learning literature, but remade in our engine. Additionally, we provide reimplementations of PPO, SAC, ES, and direct policy optimization in JAX that compile alongside our environments, allowing the learning algorithm and the environment processing to occur on the same device, and to scale seamlessly on accelerators. Finally, we include notebooks that facilitate training of performant policies on common MuJoCo-like tasks in minutes. Daniel Freeman · Erik Frey · Anton Raichuk · Sertan Girgin · Igor Mordatch · Olivier Bachem 🔗 - CCNLab: A Benchmarking Framework for Computational Cognitive Neuroscience (Poster) []  [Paper] [ Visit Poster at Spot E3 in Virtual World ]    CCNLab is a benchmark for evaluating computational cognitive neuroscience models on empirical data. As a starting point, its focus is classical conditioning, which studies how animals predict reward and punishment in the environment. CCNLab includes a collection of simulations of seminal experiments expressed under a common API, as wells as tools for visualizing and comparing simulated data with empirical data. CCNLab is broad, incorporating representative experiments from different categories of phenomena; flexible, allowing the straightforward addition of new experiments; and easy-to-use, so researchers can focus on developing better models. We envision CCNLab as a testbed for unifying computational theories of learning in the brain. We also hope that it can broadly accelerate neuroscience research and facilitate interaction between the fields of neuroscience, psychology, and artificial intelligence. Nikhil Bhattasali · Momchil Tomov · Samuel J Gershman 🔗 - Addressing "Documentation Debt" in Machine Learning: A Retrospective Datasheet for BookCorpus (Poster) []  [Paper] [ Visit Poster at Spot D1 in Virtual World ]    This paper contributes a formal case study in retrospective dataset documentation and pinpoints several problems with the influential BookCorpus dataset. Recent work has underscored the importance of dataset documentation in machine learning research, including by addressing documentation debt'' for datasets that have been used widely but documented sparsely. BookCorpus is one such dataset. Researchers have used BookCorpus to train OpenAI's GPT-N models and Google's BERT models, but little to no documentation exists about the dataset's motivation, composition, collection process, etc. We offer a retrospective datasheet with key context and information about BookCorpus, including several notable deficiencies. In particular, we find evidence that (1) BookCorpus violates copyright restrictions for many books, (2) BookCorpus contains thousands of duplicated books, and (3) BookCorpus exhibits significant skews in genre representation. We also find hints of other potential deficiencies that call for future research, such as lopsided author contributions. While more work remains, this initial effort to provide a datasheet for BookCorpus offers a cautionary case study and adds to growing literature that urges more careful, systematic documentation of machine learning datasets. Jack Bandy · Nick Vincent 🔗 - Generating Datasets of 3D Garments with Sewing Patterns (Poster) []  [Paper] [ Visit Poster at Spot F2 in Virtual World ]    Garments are ubiquitous in both real and many of the virtual worlds. They are highly deformable objects, exhibit an immense variety of designs and shapes, and yet, most garments are created from a set of regularly shaped flat pieces. Exploration of garment structure presents a peculiar case for an object structure estimation task and might prove useful for downstream tasks of neural 3D garment modeling and reconstruction by providing strong prior on garment shapes. To facilitate research in these directions, we propose a method for generating large synthetic datasets of 3D garment designs and their sewing patterns. Our method consists of a flexible description structure for specifying parametric sewing pattern templates and the automatic generation pipeline to produce garment 3D models with little-to-none manual intervention. To add realism, the pipeline additionally creates corrupted versions of the final meshes that imitate artifacts of 3D scanning.With this pipeline, we created the first large-scale synthetic dataset of 3D garment models with their sewing patterns. The dataset contains more than 20000 garment design variations produced from 19 different base types. Seven of these garment types are specifically designed to target evaluation of the generalization across garment sewing pattern topologies. Maria Korosteleva · Sung-Hee Lee 🔗 - Teach Me to Explain: A Review of Datasets for Explainable Natural Language Processing (Poster) []  [Paper] [ Visit Poster at Spot A2 in Virtual World ]    Explainable Natural Language Processing (ExNLP) has increasingly focused on collecting human-annotated textual explanations. These explanations are used downstream in three ways: as data augmentation to improve performance on a predictive task, as supervision to train models to produce explanations for their predictions, and as a ground-truth to evaluate model-generated explanations. In this review, we identify 65 datasets with three predominant classes of textual explanations (highlights, free-text, and structured), organize the literature on annotating each type, identify strengths and shortcomings of existing collection methodologies, and give recommendations for collecting ExNLP datasets in the future. Sarah Wiegreffe · Ana Marasovic 🔗 - B-Pref: Benchmarking Preference-Based Reinforcement Learning (Poster) []  [Paper] [ Visit Poster at Spot E0 in Virtual World ]    Reinforcement learning (RL) requires access to a reward function that incentivizes the right behavior, but these are notoriously hard to specify for complex tasks. Preference-based RL provides an alternative: learning policies using a teacher's preferences without pre-defined rewards, thus overcoming concerns associated with reward engineering. However, it is difficult to quantify the progress in preference-based RL due to the lack of a commonly adopted benchmark. In this paper, we introduce B-Pref: a benchmark specially designed for preference-based RL. A key challenge with such a benchmark is providing the ability to evaluate candidate algorithms quickly, which makes relying on real human input for evaluation prohibitive. At the same time, simulating human input as giving perfect preferences for the ground truth reward function is unrealistic. B-Pref alleviates this by simulating teachers with a wide array of irrationalities, and proposes metrics not solely for performance but also for robustness to these potential irrationalities. We showcase the utility of B-Pref by using it to analyze algorithmic design choices, such as selecting informative queries, for state-of-the-art preference-based RL algorithms. We hope that B-Pref can serve as a common starting point to study preference-based RL more systematically. Source code is available at https://github.com/rll-research/B-Pref. Kimin Lee · Laura Smith · Anca Dragan · Pieter Abbeel 🔗 - Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks (Poster) []  [Paper] [ Visit Poster at Spot D0 in Virtual World ]    We identify label errors in the test sets of 10 of the most commonly-used computer vision, natural language, and audio datasets, and subsequently study the potential for these label errors to affect benchmark results. Errors in test sets are numerous and widespread: we estimate an average of 3.3% errors across the 10 datasets, where for example 2916 label errors comprise 6% of the ImageNet validation set. Putative label errors are identified using confident learning algorithms and then human-validated via crowdsourcing (54% of the algorithmically-flagged candidates are indeed erroneously labeled). Traditionally, machine learning practitioners choose which model to deploy based on test accuracy — our findings advise caution here, proposing that judging models over correctly labeled test sets may be more useful, especially for noisy real-world datasets. Surprisingly, we find that lower capacity models may be practically more useful than higher capacity models in real-world datasets with high proportions of erroneously labeled data. For example, on ImageNet with corrected labels: ResNet-18 outperforms ResNet-50 if the prevalence of originally mislabeled test examples increases by just 6%. On CIFAR-10 with corrected labels: VGG-11 outperforms VGG-19 if the prevalence of originally mislabeled test examples increases by just 5%. Curtis Northcutt · Anish Athalye · Jonas Mueller 🔗 - CommonsenseQA 2.0: Exposing the Limits of AI through Gamification (Poster) []  [Paper] [ Visit Poster at Spot F0 in Virtual World ]    Constructing benchmarks that test the abilities of modern natural language understanding models is difficult - pre-trained language models exploit artifacts in benchmarks to achieve human parity, but still fail on adversarial examples and make errors that demonstrate a lack of common sense. In this work, we propose gamification as a framework for data construction. The goal of players in the game is to compose questions that mislead a rival AI while using specific phrases for extra points. The game environment leads to enhanced user engagement and simultaneously gives the game designer control over the collected data, allowing us to collect high-quality data at scale. Using our method we create CommonsenseQA 2.0, which includes 14,343 yes/no questions, and demonstrate its difficulty for models that are orders-of-magnitude larger than the AI used in the game itself.Our best baseline, the T5-based Unicorn with 11B parameters achieves an accuracy of 70.2%, substantially higher than GPT-3 (52.9%) in a few-shot inference setup. Both score well below human performance which is at 94.1%. Alon Talmor · Ori Yoran · Ronan Le Bras · Chandra Bhagavatula · Yoav Goldberg · Yejin Choi · Jonathan Berant 🔗 - The Caltech Off-Policy Policy Evaluation Benchmarking Suite (Poster) []  [Paper] [ Visit Poster at Spot B2 in Virtual World ] We offer an experimental benchmark for off-policy policy evaluation (OPE) in reinforcement learning, which is a key problem in many safety critical applications. Given the increasing interest in deploying learning-based methods, there has been a flurry of recent proposals for OPE method, leading to a need for standardized empirical analyses. Our work takes a strong focus on diversity of experimental design to enable stress testing of OPE methods. We provide a comprehensive benchmarking suite to study the interplay of different attributes on method performance. We also show how to distill the results into a summarized set of guidelines for OPE in practice. Our software package is open-sourced and we invite interested researchers to further contribute to the benchmark. Cameron Voloshin · Hoang Le · Nan Jiang · Yisong Yue 🔗 - ThreeDWorld: A Platform for Interactive Multi-Modal Physical Simulation (Poster) []  [Paper] [ Visit Poster at Spot A0 in Virtual World ]    We introduce ThreeDWorld (TDW), a platform for interactive multi-modal physical simulation. TDW enables the simulation of high-fidelity sensory data and physical interactions between mobile agents and objects in rich 3D environments. Unique properties include: real-time near-photo-realistic image rendering; a library of objects and environments, and routines for their customization; generative procedures for efficiently building classes of new environments; high-fidelity audio rendering; realistic physical interactions for a variety of material types, including cloths, liquid, and deformable objects; customizable avatars” that embody AI agents; and support for human interactions with VR devices. TDW’s API enables multiple agents to interact within a simulation and returns a range of sensor and physics data representing the state of the world. We present initial experiments enabled by TDW in emerging research directions in computer vision, machine learning, and cognitive science, including multi-modal physical scene understanding, physical dynamics predictions, multi-agent interactions, models that ‘learn like a child’, and attention studies in humans and neural networks. Chuang Gan · Jeremy Schwartz · Seth Alter · Damian Mrowca · Martin Schrimpf · James Traer · Julian De Freitas · Jonas Kubilius · Abhishek Bhandwaldar · Nick Haber · Megumi Sano · Kuno Kim · Elias Wang · Michael Lingelbach · Aidan Curtis · Kevin Feigelis · Daniel Bear · Dan Gutfreund · David Cox · Antonio Torralba · James J DiCarlo · Josh Tenenbaum · Josh McDermott · Dan Yamins 🔗 - Physion: Evaluating Physical Prediction from Vision in Humans and Machines (Poster) []  [Paper] [ Visit Poster at Spot G1 in Virtual World ]    While machine learning algorithms excel at many challenging visual tasks, it is unclear that they can make predictions about commonplace real world physical events. Here, we present a visual and physical prediction benchmark that precisely measures this capability. In realistically simulating a wide variety of physical phenomena – rigid and soft-body collisions, stable multi-object configurations, rolling and sliding, projectile motion – our dataset presents a more comprehensive challenge than existing benchmarks. Moreover, we have collected human responses for our stimuli so that model predictions can be directly compared to human judgements. We compare an array of algorithms – varying in their architecture, learning objective, input-output structure, and training data – on their ability to make diverse physical predictions. We find that graph neural networks with access to the physical state best capture human behavior, whereas among models that receive only visual input, those with object-centric representations or pretraining do best but fall far short of human accuracy. This suggest that extracting physically meaningful representations of scenes is the main bottleneck to achieving human-like visual prediction. We thus demonstrate how our benchmark can identify areas for improvement and measure progress on this key aspect of physical understanding. Daniel Bear · Elias Wang · Damian Mrowca · Felix Binder · Hsiao-Yu Tung · Pramod RT · Cameron Holdaway · Sirui Tao · Kevin Smith · Fan-Yun Sun · Fei-Fei Li · Nancy Kanwisher · Josh Tenenbaum · Dan Yamins · Judith Fan 🔗 - CARLA: A Python Library to Benchmark Algorithmic Recourse and Counterfactual Explanation Algorithms (Poster) []  [Paper] [ Visit Poster at Spot F2 in Virtual World ]    Counterfactual explanations provide means for prescriptive model explanations by suggesting actionable feature changes (e.g., increase income) that allow individuals to achieve favourable outcomes in the future (e.g., insurance approval).Choosing an appropriate method is a crucial aspect for meaningful counterfactual explanations. As documented in recent reviews, there exists a quickly growing literature with available methods. Yet, in the absence of widely available open--source implementations, the decision in favour of certain models is primarily based on what is readily available. Going forward -- to guarantee meaningful comparisons across explanation methods -- we present \texttt{CARLA} (\textbf{C}ounterfactual \textbf{A}nd \textbf{R}ecourse \textbf{L}ibr\textbf{A}ry), a python library for benchmarking counterfactual explanation methods across both different data sets and different machine learning models. In summary, our work provides the following contributions: (i) an extensive benchmark of 11 popular counterfactual explanation methods, (ii) a benchmarking framework for research on future counterfactual explanation methods, and (iii) a standardized set of integrated evaluation measures and data sets for transparent and extensive comparisons of these methods.We have open sourced \texttt{CARLA} and our experimental results on \href{https://github.com/indyfree/CARLA}{Github}, making them available as competitive baselines. We welcome contributions from other research groups and practitioners. Martin Pawelczyk · Sascha Bielawski · Johan Van den Heuvel · Tobias Richter · Gjergji. Kasneci 🔗 - It's COMPASlicated: The Messy Relationship between RAI Datasets and Algorithmic Fairness Benchmarks (Poster) []  []  [Paper] [ Visit Poster at Spot A1 in Virtual World ]    Risk assessment instrument (RAI) datasets, particularly ProPublica's COMPAS dataset, are commonly used in algorithmic fairness papers due to benchmarking practices of comparing algorithms on datasets used in prior work. In many cases, this data is used as a benchmark to demonstrate good performance without accounting for the complexities of criminal justice (CJ) processes. We show that pretrial RAI datasets contain numerous measurement biases and errors inherent to CJ pretrial evidence and due to disparities in discretion and deployment, are limited in making claims about real-world outcomes, making the datasets a poor fit for benchmarking under assumptions of ground truth and real-world impact. Conventional practices of simply replicating previous data experiments may implicitly inherit or edify normative positions without explicitly interrogating assumptions. With context of how interdisciplinary fields have engaged in CJ research, algorithmic fairness practices are misaligned for meaningful contribution in the context of CJ, and would benefit from transparent engagement with normative considerations and values related to fairness, justice, and equality. These factors prompt questions about whether benchmarks for intrinsically socio-technical systems like the CJ system can exist in a beneficial and ethical way. Michelle Bao · Angela Zhou · Samantha Zottola · Brian Brubach · Sarah Desmarais · Aaron Horowitz · Kristian Lum · Suresh Venkatasubramanian 🔗 - Automatic Construction of Evaluation Suites for Natural Language Generation Datasets (Poster) []  [Paper] [ Visit Poster at Spot F3 in Virtual World ]    Machine learning approaches applied to NLP are often evaluated by summarizing their performance in a single number, for example accuracy. Since most test sets are constructed as an i.i.d. sample from the overall data, this approach overly simplifies the complexity of language and encourages overfitting to the head of the data distribution. As such, rare language phenomena or text about underrepresented groups are not equally included in the evaluation. To encourage more in-depth model analyses, researchers have proposed the use of multiple test sets, also called challenge sets, that assess specific capabilities of a model. In this paper, we develop a framework based on this idea which is able to generate controlled perturbations and identify subsets in text-to-scalar, text-to-text, or data-to-text settings. By applying this framework to the GEM generation benchmark, we propose an evaluation suite made of 80 challenge sets, demonstrate the kinds of analyses that it enables and shed light onto the limits of current generation models. Simon Mille · Kaustubh Dhole · Saad Mahamood · Laura Perez-Beltrachini · Varun Prashant Gangal · Mihir Kale · Emiel van Miltenburg · Sebastian Gehrmann 🔗 - Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research (Poster) []  [Paper] [ Visit Poster at Spot C1 in Virtual World ]    Benchmark datasets play a central role in the organization of machine learning research. They coordinate researchers around shared research problems and serve as a measure of progress towards shared goals. Despite the foundational role benchmarking practices play in the field, relatively little attention has been paid to the dynamics of benchmark dataset use and resuse within and across machine learning subcommunities. In this work we dig into these dynamics, by studying how dataset usage patterns differ across different machine learning subcommunities and across time from 2015-2020. We find increasing concentration on fewer and fewer datasets within task communities, significant adoption of datasets from other tasks, and concentration across the field on datasets have been introduced by researchers situated within a small number of elite institutions. Our results have implications for scientific evaluation, AI ethics, and equity and access within the field. Bernard Koch · Emily Denton · Alex Hanna · Jacob G Foster 🔗 - Dynamic Environments with Deformable Objects (Poster) []  [Paper] [ Visit Poster at Spot C3 in Virtual World ]    We propose a set of environments with dynamic tasks that involve highly deformable topologically non-trivial objects. These environments facilitate easy experimentation: offer fast runtime, support large-scale parallel data generation, are easy to connect to reinforcement learning frameworks with OpenAI Gym API. We offer several types of benchmark tasks with varying levels of complexity, provide variants with procedurally generated cloth objects and randomized material textures. Moreover, we allow users to customize the tasks: import custom objects and textures, adjust size and material properties of deformable objects.We prioritize dynamic aspects of the tasks: forgoing 2D tabletop manipulation in favor of 3D tasks, with gravity and inertia playing a non-negligible role. Such advanced challenges require insights from multiple fields: machine learning and computer vision to process high-dimensional inputs, methods from computer graphics and topology to inspire structured and interpretable representations, insights from robotics to learn advanced control policies. We aim to help researches from these fields contribute their insights and simplify establishing interdisciplinary collaborations. Rika Antonova · peiyang shi · Hang Yin · Zehang Weng · Danica Kragic 🔗 - An Empirical Investigation of Representation Learning for Imitation (Poster) []  [Paper] [ Visit Poster at Spot B3 in Virtual World ]    Imitation learning often needs a large demonstration set in order to handle the full range of situations that an agent might find itself in during deployment. However, collecting expert demonstrations can be expensive. Recent work in vision, reinforcement learning, and NLP has shown that auxiliary representation learning objectives can reduce the need for large amounts of expensive, task-specific data. This paper investigates whether similar benefits apply to imitation learning. Specifically, we propose a modular framework for constructing representation learning algorithms, then use our framework to evaluate the utility of representation learning for imitation across several environment suites. In the settings we evaluate, we find that existing algorithms for image-based representation learning provide limited value relative to a well-tuned baseline with image augmentations. To explain this result, we investigate differences between imitation learning and other settings where representation learning has provided significant benefit, such as image classification. Finally, we release a well-documented codebase which both replicates our findings and provides a modular framework for creating new representation learning algorithms out of reusable components. Cynthia Chen · Sam Toyer · Cody Wild · Scott Emmons · Ian Fischer · Kuang-Huei Lee · Neel Alex · Steven Wang · Ping Luo · Stuart Russell · Pieter Abbeel · Rohin Shah 🔗 - OpenML Benchmarking Suites (Poster) []  [Paper] [ Visit Poster at Spot B1 in Virtual World ]    Machine learning research depends on objectively interpretable, comparable, and reproducible algorithm benchmarks. We advocate the use of curated, comprehensive suites of machine learning tasks to standardize the setup, execution, and reporting of benchmarks. We enable this through software tools that help to create and leverage these benchmarking suites. These are seamlessly integrated intothe OpenML platform, and accessible through interfaces in Python, Java, and R. OpenML benchmarking suites (a) are easy to use through standardized data formats, APIs, and client libraries; (b) come with extensive meta-information on the included datasets; and (c) allow benchmarks to be shared and reused in future studies. We then present a first, carefully curated and practical benchmarking suite for classification: the OpenML Curated Classification benchmarking suite 2018 (OpenML-CC18). Finally, we discuss use cases and applications which demonstrate the usefulness of OpenML benchmarking suites and the OpenML-CC18 in particular. Bernd Bischl · Giuseppe Casalicchio · Matthias Feurer · Pieter Gijsbers · Frank Hutter · Michel Lang · Rafael Gomes Mantovani · Jan van Rijn · Joaquin Vanschoren 🔗 - Systematic Evaluation of Causal Discovery in Visual Model Based Reinforcement Learning (Poster) []  [Paper] [ Visit Poster at Spot E3 in Virtual World ] Inducing causal relationships from observations is a classic problem in machine learning. Most work in causality starts from the premise that the causal variables themselves are observed. However, for AI agents such as robots trying to make sense of their environment, the only observables are low-level variables like pixels in images. To generalize well, an agent must induce high-level variables, particularly those which are causal or are affected by causal variables. A central goal for AI and causality is thus the joint discovery of abstract representations and causal structure. However, we note that existing environments for studying causal induction are poorly suited for this objective because they have complicated task-specific causal graphs which are impossible to manipulate parametrically (e.g., number of nodes, sparsity, causal chain length, etc.). In this work, our goal is to facilitate research in learning representations of high-level variables as well as causal structures among them. In order to systematically probe the ability of methods to identify these variables and structures, we design a suite of benchmarking RL environments. We evaluate various representation learning algorithms from the literature and find that explicitly incorporating structure and modularity in models can help causal induction in model-based reinforcement learning. Nan Rosemary Ke · Aniket Didolkar · Sarthak Mittal · Anirudh Goyal ALIAS PARTH GOYAL · Guillaume Lajoie · Stefan Bauer · Danilo Jimenez Rezende · Michael Mozer · Yoshua Bengio · Chris Pal 🔗 - RB2: Robotic Manipulation Benchmarking with a Twist (Poster) []  [Paper] [ Visit Poster at Spot D3 in Virtual World ]    Benchmarks offer a scientific way to compare algorithms using objective performance metrics. Good benchmarks have two features: (a) they should be widely useful for many research groups; (b) and they should produce reproducible findings. In robotic manipulation research, there is a trade-off between reproducibility and broad accessibility. If the benchmark is kept restrictive (fixed hardware, objects), the numbers are reproducible but the setup becomes less general. On the other hand, a benchmark could be a loose set of protocols (e.g. YCB object set) but the underlying variation in setups make the results non-reproducible. In this paper, we re-imagine benchmarking for robotic manipulation as state-of-the-art algorithmic implementations, alongside the usual set of tasks and experimental protocols. The added baseline implementations will provide a way to easily recreate SOTA numbers in a new local robotic setup, thus providing credible relative rankings between existing approaches and new work. However, these "local rankings" could vary between different setups. To resolve this issue, we build a mechanism for pooling experimental data between labs, and thus we establish a single global ranking for existing (and proposed) SOTA algorithms. Our benchmark, called Ranking-Based Robotics Benchmark (RB2), is evaluated on tasks that are inspired from clinically validated Southampton Hand Assessment Procedures. Our benchmark was run across two different labs and reveals several surprising findings. For example, extremely simple baselines like open-loop behavior cloning, outperform more complicated models (e.g. closed loop, RNN, Offline-RL, etc.) that are preferred by the field. We hope our fellow researchers will use \name to improve their research's quality and rigor. Sudeep Dasari · Jianren Wang · Joyce Hong · Shikhar Bahl · Yixin Lin · Austin Wang · Abitha Thankaraj · Karanbir Chahal · Berk Calli · Saurabh Gupta · David Held · Lerrel Pinto · Deepak Pathak · Vikash Kumar · Abhinav Gupta 🔗 - Really Doing Great at Estimating CATE? A Critical Look at ML Benchmarking Practices in Treatment Effect Estimation (Poster) []  [Paper] [ Visit Poster at Spot B2 in Virtual World ]    The machine learning (ML) toolbox for estimation of heterogeneous treatment effects from observational data is expanding rapidly, yet many of its algorithms have been evaluated only on a very limited set of semi-synthetic benchmark datasets. In this paper, we investigate current benchmarking practices for ML-based conditional average treatment effect (CATE) estimators, with special focus on empirical evaluation based on the popular semi-synthetic IHDP benchmark. We identify problems with current practice and highlight that semi-synthetic benchmark datasets, which (unlike real-world benchmarks used elsewhere in ML) do not necessarily reflect properties of real data, can systematically favor some algorithms over others -- a fact that is rarely acknowledged but of immense relevance for interpretation of empirical results. Further, we argue that current evaluation metrics evaluate performance only for a small subset of possible use cases of CATE estimators, and discuss alternative metrics relevant for applications in personalized medicine. Additionally, we discuss alternatives for current benchmark datasets, and implications of our findings for benchmarking in CATE estimation. Alicia Curth · David Svensson · Jim Weatherall · Mihaela van der Schaar 🔗 - Chest ImaGenome Dataset for Clinical Reasoning (Poster) []  [Paper] [ Visit Poster at Spot H0 in Virtual World ]    Despite the progress in automatic detection of radiologic findings from chest X-ray (CXR) images in recent years, a quantitative evaluation of the explainability of these models is hampered by the lack of locally labeled datasets for different findings. With the exception of a few expert-labeled small-scale datasets for specific findings, such as pneumonia and pneumothorax, most of the CXR deep learning models to date are trained on global "weak" labels extracted from text reports, or trained via a joint image and unstructured text learning strategy. Inspired by the Visual Genome effort in the computer vision community, we constructed the first Chest ImaGenome dataset with a scene graph data structure to describe 242,072 images. Local annotations are automatically produced using a joint rule-based natural language processing (NLP) and atlas-based bounding box detection pipeline. Through a radiologist constructed CXR ontology, the annotations for each CXR are connected as an anatomy-centered scene graph, useful for image-level reasoning and multimodal fusion applications. Overall, we provide: i) 1,256 combinations of relation annotations between 29 CXR anatomical locations (objects with bounding box coordinates) and their attributes, structured as a scene graph per image, ii) over 670,000 localized comparison relations (for improved, worsened, or no change) between the anatomical locations across sequential exams, as well as ii) a manually annotated gold standard scene graph dataset from 500 unique patients. Joy T Wu · Nkechinyere Agu · Ismini Lourentzou · Arjun Sharma · Joseph Alexander Paguio · Jasper Seth Yao · Edward C Dee · William Mitchell · Satyananda Kashyap · Andrea Giovannini · Leo Anthony Celi · Mehdi Moradi 🔗 - Mitigating dataset harms requires stewardship: Lessons from 1000 papers (Poster) []  [Paper] [ Visit Poster at Spot C0 in Virtual World ]    Machine learning datasets have elicited concerns about privacy, bias, and unethical applications, leading to the retraction of prominent datasets such as DukeMTMC, MS-Celeb-1M, and Tiny Images. In response, the machine learning community has called for higher ethical standards in dataset creation. To help inform these efforts, we studied three influential but ethically problematic face and person recognition datasets—DukeMTMC, MS-Celeb-1M, and Labeled Faces in the Wild (LFW)—by analyzing nearly 1000 papers that cite them. We found that the creation of derivative datasets and models, broader technological and social change, the lack of clarity of licenses, and dataset management practices can introduce a wide range of ethical concerns. We conclude by suggesting a distributed approach to harm mitigation that considers the entire life cycle of a dataset. Kenny Peng · Arunesh Mathur · Arvind Narayanan 🔗 - Artsheets for Art Datasets (Poster) []  [Paper] [ Visit Poster at Spot E1 in Virtual World ]    Machine learning (ML) techniques are increasingly being employed within a variety of creative domains. For example, ML tools are being used to analyze the authenticity of artworks, to simulate artistic styles, and to augment human creative processes. While this progress has opened up new creative avenues, it has also paved the way for adverse downstream effects such as cultural appropriation (e.g., cultural misrepresentation, offense, and undervaluing) and representational harm. Many such concerning issues stem from the training data in ways that diligent evaluation can uncover, prevent, and mitigate. We posit that, when developing an arts-based dataset, it is essential to consider the social factors that influenced the process of conception and design, and the resulting gaps must be examined in order to maximize understanding of the dataset's meaning and future impact. Each dataset creator's decision produces opportunities, but also omissions. Each choice, moreover, builds on preexisting histories of the data's formation and handling across time by prior actors including, but not limited to, art collectors, galleries, libraries, archives, museums, and digital repositories. To illuminate the aforementioned aspects, we provide a checklist of questions customized for use with art datasets in order to help guide assessment of the ways that dataset design may either perpetuate or shift exclusions found in repositories of art data. The checklist is organized to address the dataset creator's motivation together with dataset provenance, composition, collection, pre-processing, cleaning, labeling, use (including data generation), distribution, and maintenance. Two case studies exemplify the value and application of our questionnaire. Ramya Srinivasan · Emily Denton · Jordan Famularo · Negar Rostamzadeh · Fernando Diaz · Beth Coleman 🔗 - An Empirical Study of Graph Contrastive Learning (Poster) []  []  [Paper] [ Visit Poster at Spot E0 in Virtual World ]    Graph Contrastive Learning (GCL) establishes a new paradigm for learning graph representations without human annotations. Although remarkable progress has been witnessed recently, the success behind GCL is still left somewhat mysterious. In this work, we first identify several critical design considerations within a general GCL paradigm, including augmentation functions, contrasting modes, contrastive objectives, and negative mining techniques. Then, to understand the interplay of different GCL components, we conduct extensive, controlled experiments over a set of benchmark tasks on datasets across various domains. Our empirical studies suggest a set of general receipts for effective GCL, e.g., simple topology augmentations that produce sparse graph views bring promising performance improvements; contrasting modes should be aligned with the granularities of end tasks. In addition, to foster future research and ease the implementation of GCL algorithms, we develop an easy-to-use library PyGCL, featuring modularized CL components, standardized evaluation, and experiment management. We envision this work to provide useful empirical evidence of effective GCL algorithms and offer several insights for future research. Yanqiao Zhu · Yichen Xu · Qiang Liu · Shu Wu 🔗 - Monash Time Series Forecasting Archive (Poster) []  [Paper] [ Visit Poster at Spot D3 in Virtual World ]    Many businesses nowadays rely on large quantities of time series data making time series forecasting an important research area. Global forecasting models and multivariate models that are trained across sets of time series have shown huge potential in providing accurate forecasts compared with the traditional univariate forecasting models that work on isolated series. However, there are currently no comprehensive time series forecasting archives that contain datasets of time series from similar sources available for researchers to evaluate the performance of new global or multivariate forecasting algorithms over varied datasets. In this paper, we present such a comprehensive forecasting archive containing 25 publicly available time series datasets from varied domains, with different characteristics in terms of frequency, series lengths, and inclusion of missing values. We also characterise the datasets, and identify similarities and differences among them, by conducting a feature analysis. Furthermore, we present the performance of a set of standard baseline forecasting methods over all datasets across ten error metrics, for the benefit of researchers using the archive to benchmark their forecasting algorithms. Rakshitha W Godahewa · Christoph Bergmeir · Geoffrey Webb · Rob Hyndman · Pablo Montero-Manso 🔗 - Synthetic Benchmarks for Scientific Research in Explainable Machine Learning (Poster) []  [Paper] [ Visit Poster at Spot G0 in Virtual World ]    As machine learning models grow more complex and their applications become more high-stakes, tools for explaining model predictions have become increasingly important. This has spurred a flurry of research in model explainability and has given rise to feature attribution methods such as LIME and SHAP. Despite their widespread use, evaluating and comparing different feature attribution methods remains challenging: evaluations ideally require human studies, and empirical evaluation metrics are often data-intensive or computationally prohibitive on real-world datasets. In this work, we address this issue by releasing XAI-BENCH: a suite of synthetic datasets along with a library for benchmarking feature attribution algorithms. Unlike real-world datasets, synthetic datasets allow the efficient computation of conditional expected values that are needed to evaluate ground-truth Shapley values and other metrics. The synthetic datasets we release offer a wide variety of parameters that can be configured to simulate real-world data. We demonstrate the power of our library by benchmarking popular explainability techniques across several evaluation metrics and across a variety of settings. The versatility and efficiency of our library will help researchers bring their explainability methods from development to deployment. Our code is available at https://github.com/abacusai/xai-bench. Yang Liu · Sujay Khandagale · Colin White · Willie Neiswanger 🔗 - A Toolbox for Construction and Analysis of Speech Datasets (Poster) []  [Paper] [ Visit Poster at Spot B0 in Virtual World ]    Automatic Speech Recognition and Text-to-Speech systems are primarily trained in a supervised fashion and require high-quality, accurately labeled speech datasets. In this work, we examine common problems with speech data and introduce a toolbox for the construction and interactive error analysis of speech datasets. The construction tool is based on K{\"u}rzinger et al. work, and, to the best of our knowledge, the dataset exploration tool is the world's first open-source tool of this kind. We demonstrate how to apply these tools to create a Russian speech dataset and analyze existing speech datasets (Multilingual LibriSpeech, Mozilla Common Voice). The tools are open sourced as a part of the NeMo framework. Evelina Bakhturina · Vitaly Lavrukhin · Boris Ginsburg 🔗 - Evaluating Bayes Error Estimators on Real-World Datasets with FeeBee (Poster) []  [Paper] [ Visit Poster at Spot E2 in Virtual World ]    The Bayes error rate (BER) is a fundamental concept in machine learning that quantifies the best possible accuracy any classifier can achieve on a fixed probability distribution. Despite years of research on building estimators of lower and upper bounds for the BER, these were usually compared only on synthetic datasets with known probability distributions, leaving two key questions unanswered: (1) How well do they perform on realistic, non-synthetic datasets?, and (2) How practical are they? Answering these is not trivial. Apart from the obvious challenge of an unknown BER for real-world datasets, there are two main aspects any BER estimator needs to overcome in order to be applicable in real-world settings: (1) the computational and sample complexity, and (2) the sensitivity and selection of hyper-parameters.In this work, we propose FeeBee, the first principled framework for analyzing and comparing BER estimators on modern real-world datasets with unknown probability distribution. We achieve this by injecting a controlled amount of label noise and performing multiple evaluations on a series of different noise levels, supported by a theoretical result which allows drawing conclusions about the evolution of the BER. By implementing and analyzing 7 multi-class BER estimators on 6 commonly used datasets of the computer vision and NLP domains, FeeBee allows a thorough study of these estimators, clearly identifying strengths and weaknesses of each, whilst being easily deployable on any future BER estimator. Cedric Renggli · Luka Rimanic · Nora Hollenstein · Ce Zhang 🔗 - Alchemy: A benchmark and analysis toolkit for meta-reinforcement learning agents (Poster) []  [Paper] [ Visit Poster at Spot D2 in Virtual World ]    There has been rapidly growing interest in meta-learning as a method for increasing the flexibility and sample efficiency of reinforcement learning. One problem in this area of research, however, has been a scarcity of adequate benchmark tasks. In general, the structure underlying past benchmarks has either been too simple to be inherently interesting, or too ill-defined to support principled analysis. In the present work, we introduce a new benchmark for meta-RL research, emphasizing transparency and potential for in-depth analysis as well as structural richness. Alchemy is a 3D video game, implemented in Unity, which involves a latent causal structure that is resampled procedurally from episode to episode, affording structure learning, online inference, hypothesis testing and action sequencing based on abstract domain knowledge. We evaluate a pair of powerful RL agents on Alchemy and present an in-depth analysis of one of these agents. Results clearly indicate a frank and specific failure of meta-learning, providing validation for Alchemy as a challenging benchmark for meta-RL. Concurrent with this report, we are releasing Alchemy as public resource, together with a suite of analysis tools and sample agent trajectories. Jane Wang · Michael King · Nicolas Porcel · Zeb Kurth-Nelson · Tina Zhu · Charlie Deck · Peter Choy · Mary Cassin · Malcolm Reynolds · Francis Song · Gavin Buttimore · David Reichert · Neil Rabinowitz · Loic Matthey · Demis Hassabis · Alexander Lerchner · Matt Botvinick 🔗 - FFA-IR: Towards an Explainable and Reliable Medical Report Generation Benchmark  (Poster) []  [Paper] [ Visit Poster at Spot G3 in Virtual World ]    The automatic generation of long and coherent medical reports given medical images (e.g. Chest X-ray and Fundus Fluorescein Angiography (FFA)) has great potential to support clinical practice. Researchers have explored advanced methods from computer vision and natural language processing to incorporate medical domain knowledge for the generation of readable medical reports. However, existing medical report generation (MRG) benchmarks lack both explainable annotations and reliable evaluation tools, hindering the current research advances from two aspects: firstly, existing methods can only predict reports without accurate explanation, undermining the trustworthiness of the diagnostic methods; secondly, the comparison among the predicted reports from different MRG methods is unreliable using the evaluation metrics of natural-language generation (NLG). To address these issues, in this paper, we propose an explainable and reliable MRG benchmark based on FFA Images and Reports (FFA-IR). Specifically, FFA-IR is large, with 10,790 reports along with 1,048,584 FFA images from clinical practice; it includes explainable annotations, based on a schema of 46 categories of lesions; and it is bilingual, providing both English and Chinese reports for each case. Besides using the widely used NLG metrics, we propose a set of nine human evaluation criteria to evaluate the generated reports. We envision FFA-IR as a testbed for explainable and reliable medical report generation. We also hope that it can broadly accelerate medical imaging research and facilitate interaction between the fields of medical imaging, computer vision, and natural language processing. Mingjie Li · Wenjia Cai · Rui Liu · Yuetian Weng · Xiaoyun Zhao · Cong Wang · Xin Chen · Zhong Liu · Caineng Pan · Mengke Li · yingfeng zheng · Yizhi Liu · Flora Salim · Karin Verspoor · Xiaodan Liang · Xiaojun Chang 🔗 - An Information Retrieval Approach to Building Datasets for Hate Speech Detection (Poster) []  [Paper] [ Visit Poster at Spot D2 in Virtual World ]    Building a benchmark dataset for hate speech detection presents several challenges. Firstly, because hate speech is relatively rare -- e.g., less than 3% of Twitter posts are hateful -- random sampling of tweets to annotate is inefficient in capturing hate speech. A common practice is to only annotate tweets containing known "hate words", but this risks yielding a biased benchmark that only partially captures the real-world phenomenon of interest. A second challenge is that definitions of hate speech tend to be highly variable and subjective. Annotators having diverse prior notions of hate speech may not only disagree with one another but also struggle to conform to specified labeling guidelines. Our key insight is that the rarity and subjectivity of hate speech are akin to that of relevance in information retrieval (IR). This connection suggests that well-established methodologies for creating IR test collections might also be usefully applied to create better benchmark datasets for hate speech detection. Firstly, in order to intelligently and efficiently select which tweets to annotate, we apply established IR techniques of pooling and active learning. Secondly, to improve both consistency and value of annotations, we apply task decomposition and annotator rationale techniques. Using the above techniques, we share a new benchmark dataset for hate speech detection on Twitter with broader coverage than prior datasets. We also show a dramatic drop in accuracy of existing detection models when tested on these broader forms of hate. Collected annotator rationales not only provide documented support for labeling decisions but also create exciting future work opportunities for dual-supervision and/or explanation generation in modeling. Md Mustafizur Rahman · Dinesh Balakrishnan · Dhiraj Murthy · Mucahid Kutlu · Matt Lease 🔗 - Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation (Poster) []  [Paper] [ Visit Poster at Spot C1 in Virtual World ]    \textit{Off-policy evaluation} (OPE) aims to estimate the performance of hypothetical policies using data generated by a different policy. Because of its huge potential impact in practice, there has been growing research interest in this field. There is, however, no real-world public dataset that enables the evaluation of OPE, making its experimental studies unrealistic and irreproducible. With the goal of enabling realistic and reproducible OPE research, we present \textit{Open Bandit Dataset}, a public logged bandit dataset collected on a large-scale fashion e-commerce platform, ZOZOTOWN. Our dataset is unique in that it contains a set of \textit{multiple} logged bandit datasets collected by running different policies on the same platform. This enables experimental comparisons of different OPE estimators for the first time. We also develop Python software called \textit{Open Bandit Pipeline} to streamline and standardize the implementation of batch bandit algorithms and OPE. Our open data and software will contribute to fair and transparent OPE research and help the community identify fruitful research directions. We provide extensive benchmark experiments of existing OPE estimators using our dataset and software. The results open up essential challenges and new avenues for future OPE research. Yuta Saito · Shunsuke Aihara · Megumi Matsutani · Yusuke Narita 🔗 - ManiSkill: Generalizable Manipulation Skill Benchmark with Large-Scale Demonstrations (Poster) []  [Paper] [ Visit Poster at Spot C2 in Virtual World ]    Object manipulation from 3D visual inputs poses many challenges on building generalizable perception and policy models. However, 3D assets in existing benchmarks mostly lack the diversity of 3D shapes that align with real-world intra-class complexity in topology and geometry. Here we propose SAPIEN Manipulation Skill Benchmark (ManiSkill) to benchmark manipulation skills over diverse objects in a full-physics simulator. 3D assets in ManiSkill include large intra-class topological and geometric variations. Tasks are carefully chosen to cover distinct types of manipulation challenges. Latest progress in 3D vision also makes us believe that we should customize the benchmark so that the challenge is inviting to researchers working on 3D deep learning. To this end, we simulate a moving panoramic camera that returns ego-centric point clouds or RGB-D images. In addition, we would like ManiSkill to serve a broad set of researchers interested in manipulation research. Besides supporting the learning of policies from interactions, we also support learning-from-demonstrations (LfD) methods, by providing a large number of high-quality demonstrations (~36,000 successful trajectories, ~1.5M point cloud/RGB-D frames in total). We provide baselines using 3D deep learning and LfD algorithms. All code of our benchmark (simulator, environment, SDK, and baselines) is open-sourced (\href{https://github.com/haosulab/ManiSkill}{Github repo}), and a challenge facing interdisciplinary researchers will be held based on the benchmark. Tongzhou Mu · Zhan Ling · Fanbo Xiang · Derek Yang · Xuanlin Li · Stone Tao · Zhiao Huang · Zhiwei Jia · Hao Su 🔗 - AI and the Everything in the Whole Wide World Benchmark (Poster) []  [Paper] [ Visit Poster at Spot A0 in Virtual World ]    There is a tendency across different subfields in AI to see value in a small collection of influential benchmarks, which we term general'' benchmarks. These benchmarks operate as stand-ins or abstractions for a range of anointed common problems that are frequently framed as foundational milestones on the path towards flexible and generalizable AI systems. State-of-the-art performance on these benchmarks is widely understood as indicative of progress towards these long-term goals. In this position paper, we explore how such benchmarks are designed, constructed and used in order to reveal key limitations of their framing as the functionally`general'' broad measures of progress they are set up to be. Deborah Raji · Emily Denton · Emily M. Bender · Alex Hanna · Amandalynne Paullada 🔗 - Are We Learning Yet? A Meta Review of Evaluation Failures Across Machine Learning (Poster) []  [Paper] [ Visit Poster at Spot C2 in Virtual World ]    Many subfields of machine learning share a common stumbling block: evaluation. Advances in machine learning often evaporate under closer scrutiny or turn out to be less widely applicable than originally hoped. We conduct a meta-review of 107 survey papers from natural language processing, recommender systems, computer vision, reinforcement learning, computational biology, graph learning, and more, organizing the wide range of surprisingly consistent critique into a concrete taxonomy of observed failure modes. Inspired by measurement and evaluation theory, we divide failure modes into two categories: internal and external validity. Internal validity issues pertain to evaluation on a learning problem in isolation, such as improper comparisons to baselines or overfitting from test set re-use. External validity relies on relationships between different learning problems, for instance, whether progress on a learning problem translates to progress on seemingly related tasks. Thomas I. Liao · Rohan Taori · Deborah Raji · Ludwig Schmidt 🔗 - Isaac Gym: High Performance GPU Based Physics Simulation For Robot Learning (Poster) []  [Paper] [ Visit Poster at Spot C0 in Virtual World ]    Isaac Gym offers a high-performance learning platform to train policies for a wide variety of robotics tasks entirely on GPU. Both physics simulation and neural network policy training reside on GPU and communicate by directly passing data from physics buffers to PyTorch tensors without ever going through CPU bottlenecks. This leads to blazing fast training times for complex robotics tasks on a single GPU with 2-3 orders of magnitude improvements compared to conventional RL training that uses a CPU-based simulator and GPUs for neural networks. We host the results and videos at https://sites.google.com/view/isaacgym-nvidia and Isaac Gym can be downloaded at https://developer.nvidia.com/isaac-gym. The benchmark and environments are available at https://github.com/NVIDIA-Omniverse/IsaacGymEnvs. Viktor Makoviychuk · Lukasz Wawrzyniak · Yunrong Guo · Michelle Lu · Kier Storey · Miles Macklin · David Hoeller · Nikita Rudin · Arthur Allshire · Ankur Handa · Gavriel State 🔗 - Hardware Design and Accurate Simulation for Benchmarking of 3D Reconstruction Algorithms (Poster) []  [Paper] [ Visit Poster at Spot D0 in Virtual World ]    Images of a real scene taken with a camera commonly differ from synthetic images of a virtual replica of the same scene, despite advances in light transport simulation and calibration. By explicitly co-developing the scanning hardware and rendering pipeline we are able to achieve negligible per-pixel difference between the real image taken by the camera and the synthesized image on geometrically complex calibration object with known material properties. This approach provides an ideal test-bed for developing data-driven algorithms in the area of 3D reconstruction, as the synthetic data is indistinguishable from real data and can be generated at large scale.Pixel-wise matching also provides an effective way to quantitatively evaluate data-driven reconstruction algorithms.We introduce three benchmark problems using the data generated with our system: (1) a benchmark for surface reconstruction from dense point clouds, (2) a denoising procedure tailored to structured light scanning, and (3) a range scan completion algorithm for CAD models. We also provide a large collection of high-resolution scans that allow our system and benchmarks to be used without having to reproduce the hardware setup. Sebastian Koch · Yurii Piadyk · Markus Worchel · Marc Alexa · Claudio Silva · Denis Zorin · Daniele Panozzo 🔗 - The Medkit-Learn(ing) Environment: Medical Decision Modelling through Simulation (Poster) []  [Paper] [ Visit Poster at Spot D1 in Virtual World ]    The goal of understanding decision-making behaviours in clinical environments is of paramount importance if we are to bring the strengths of machine learning to ultimately improve patient outcomes. Mainstream development of algorithms is often geared towards optimal performance in tasks that do not necessarily translate well into the medical regime---due to several factors including the lack of public availability of realistic data, the intrinsically offline nature of the problem, as well as the complexity and variety of human behaviours. We therefore present a new benchmarking suite designed specifically for medical sequential decision modelling: the Medkit-Learn(ing) Environment, a publicly available Python package providing simple and easy access to high-fidelity synthetic medical data. While providing a standardised way to compare algorithms in a realistic medical setting, we employ a generating process that disentangles the policy and environment dynamics to allow for a range of customisations, thus enabling systematic evaluation of algorithms’ robustness against specific challenges prevalent in healthcare. Alex Chan · Ioana Bica · Alihan Hüyük · Dan Jarrett · Mihaela van der Schaar 🔗 - URLB: Unsupervised Reinforcement Learning Benchmark (Poster) []  [Paper] [ Visit Poster at Spot E2 in Virtual World ]    Deep Reinforcement Learning (RL) has emerged as a powerful paradigm to solve a range of complex yet specific control tasks. Training generalist agents that can quickly adapt to new tasks remains an outstanding challenge. Recent advances in unsupervised RL have shown that pre-training RL agents with self-supervised intrinsic rewards can result in efficient adaptation. However, these algorithms have been hard to compare and develop due to the lack of a unified benchmark. To this end, we introduce the Unsupervised Reinforcement Learning Benchmark (URLB). URLB consists of two phases: reward-free pre-training and downstream task adaptation with extrinsic rewards. Building on the DeepMind Control Suite, we provide twelve continuous control tasks from three domains for evaluation and open-source code for eight leading unsupervised RL methods. We find that the implemented baselines make progress but are not able to solve URLB and propose directions for future research. Misha Laskin · Denis Yarats · Hao Liu · Kimin Lee · Albert Zhan · Kevin Lu · Catherine Cang · Lerrel Pinto · Pieter Abbeel 🔗 - What Would Jiminy Cricket Do? Towards Agents That Behave Morally (Poster) []  [Paper] [ Visit Poster at Spot G2 in Virtual World ]    When making everyday decisions, people are guided by their conscience, an internal sense of right and wrong, to behave morally. By contrast, artificial agents may behave immorally when trained on environments that ignore moral concerns, such as violent video games. With the advent of generally capable agents that pretrain on many environments, mitigating inherited biases towards immoral behavior will become necessary. However, prior work on aligning agents with human values and morals focuses on small-scale settings lacking in semantic complexity. To enable research in larger, more realistic settings, we introduce Jiminy Cricket, an environment suite of 25 text-based adventure games with thousands of semantically rich, morally salient scenarios. Via dense annotations for every possible action, Jiminy Cricket environments robustly evaluate whether agents can act morally while maximizing reward. To improve moral behavior, we leverage language models with commonsense moral knowledge and develop strategies to mediate this knowledge into actions. In extensive experiments, we find that our artificial conscience approach can steer agents towards moral behavior without sacrificing performance. Dan Hendrycks · Mantas Mazeika · Andy Zou · Sahil Patel · Christine Zhu · Jesus Navarro · Dawn Song · Bo Li · Jacob Steinhardt 🔗

#### Author Information

##### Joaquin Vanschoren (Eindhoven University of Technology)

Joaquin Vanschoren is an Assistant Professor in Machine Learning at the Eindhoven University of Technology. He holds a PhD from the Katholieke Universiteit Leuven, Belgium. His research focuses on meta-learning and understanding and automating machine learning. He founded and leads OpenML.org, a popular open science platform that facilitates the sharing and reuse of reproducible empirical machine learning data. He obtained several demo and application awards and has been invited speaker at ECDA, StatComp, IDA, AutoML@ICML, CiML@NIPS, AutoML@PRICAI, MLOSS@NIPS, and many other occasions, as well as tutorial speaker at NIPS and ECMLPKDD. He was general chair at LION 2016, program chair of Discovery Science 2018, demo chair at ECMLPKDD 2013, and co-organizes the AutoML and meta-learning workshop series at NIPS 2018, ICML 2016-2018, ECMLPKDD 2012-2015, and ECAI 2012-2014. He is also editor and contributor to the book 'Automatic Machine Learning: Methods, Systems, Challenges'.