Timezone: »
Sat 7:30 a.m. - 7:45 a.m.
|
Opening Remarks
SlidesLive Video » |
Divyansh Kaushik 🔗 |
Sat 7:45 a.m. - 8:15 a.m.
|
Invited Keynote by Jason Weston
(
Keynote
)
SlidesLive Video » |
Jason Weston 🔗 |
Sat 8:15 a.m. - 8:25 a.m.
|
Near-Negative Distinction: Giving a Second Life to Human Evaluation Datasets
(
Oral
)
link »
SlidesLive Video » Precisely assessing the progress in natural language generation (NLG) tasks is challenging, and human evaluation is often necessary.However, human evaluation is usually costly, difficult to reproduce, and non-reusable.In this paper, we propose a new and simple automatic evaluation method for NLG called Near-Negative Distinction (NND) that repurposes prior human annotations into NND tests.To pass an NND test, an NLG model must place a higher likelihood on a high-quality output candidate than on a near-negative candidate with a known error.Model performance is established by the number of NND tests a model passes, as well as the distribution over task-specific errors the model fails on.Through experiments on three NLG tasks (question generation, question answering, and summarization), we show that NND achieves a higher correlation with human judgments than standard NLG evaluation metrics. We invite the community to adopt NND as a generic method for NLG evaluation and contribute new NND test collections. |
Philippe Laban · Chien-Sheng Wu · Wenhao Liu · Caiming Xiong 🔗 |
Sat 8:25 a.m. - 8:35 a.m.
|
Are GAN Biased? Evaluating GAN-Generated Facial Images via Crowdsourcing
(
Oral
)
link »
SlidesLive Video » Generative models produce astonishingly high-resolution and realistic facial images. However, reliably evaluating the quality of these images remains challenging, not to mention performing a systematic investigation of the potential biases in generative adversarial models (GAN). In this paper, we argue that crowdsourcing can be used to measure the biases in GAN quantitatively. We showcase an investigation that examines whether GAN-generated facial images with darker skin tones are of worse quality. We ask crowd workers to guess whether the image is real or fake, and use this as a proxy metric for estimating the quality of facial images generated by state-of-the-art GANs. The results show that GANs generate worse quality images with darker skin tones as compared to images with lighter skin tones. |
Hangzhi Guo · Lizhen Zhu · Ting-Hao Huang 🔗 |
Sat 8:35 a.m. - 8:45 a.m.
|
Towards Credible Human Evaluation of Open-Domain Dialog Systems Using Interactive Setup
(
Oral
)
link »
SlidesLive Video » Evaluating open-domain conversation models has been an open challenge due to the open-ended nature of conversations. In addition to static evaluations, recent work has started to explore a variety of per-turn and per-dialog interactive evaluation mechanisms and provide advice on the best setup. In this work, we adopt the interactive evaluation framework and further apply to multiple models with a focus on per-turn evaluation techniques. Apart from the widely used setting where participants select the best response among different candidates at each turn, one more novel per-turn evaluation setting is adopted, where participants can select all appropriate responses with different fallback strategies to continue the conversation when no response is selected. We evaluate these settings based on sensitivity and consistency using four GPT2-based models that differ in model sizes or fine-tuning data. To better generalize to any model groups with no prior assumptions on their rankings and control evaluation costs for all setups, we also propose a methodology to estimate the required sample size given a minimum performance gap of interest before running most experiments. Our comprehensive human evaluation results shed light on how to conduct credible human evaluations of open domain dialog systems using the interactive setup, and suggest additional future directions. |
Sijia Liu · Patrick Lange · Behnam Hedayatnia · Alexandros Papangelis · Di Jin · Andrew Wirth · Yang Liu · Dilek Hakkani-Tur 🔗 |
Sat 8:45 a.m. - 9:30 a.m.
|
Panel on Technical Challenges Associated with Reliable Human Evaluations of Generative Models
(
Discussion Panel
)
|
Long Ouyang · Tongshuang Wu · Zachary Lipton 🔗 |
Sat 9:30 a.m. - 11:00 a.m.
|
Lunch Break
|
🔗 |
Sat 11:00 a.m. - 11:50 a.m.
|
Discussion on Policy Challenges Associated with Generative Models
(
Discussion Panel
)
|
Irene Solaiman · Russell Wald · Yonadav Shavit · Long Ouyang 🔗 |
Sat 11:50 a.m. - 12:00 p.m.
|
Human Evaluation of Text-to-Image Models on a Multi-Task Benchmark
(
Oral
)
link »
SlidesLive Video » We provide a new multi-task benchmark for evaluating text-to-image models and perform a human evaluation comparing two of the most common open source (Stable Diffusion) and commercial (DALL-E 2) models. Twenty computer science AI graduate students evaluated the two models, on three tasks, at three difficulty levels, across ten prompts each, providing 3,600 ratings. Text-to-image generation has seen rapid progress to the point that many recent models have demonstrated their ability to create realistic high-resolution images for various prompts. However, current text-to-image methods and the broader body of research in vision-language understanding still struggle with intricate text prompts that contain many objects with multiple attributes and relationships. We introduce a new text-to-image benchmark that contains a suite of fifty tasks and applications that capture a model’s ability to handle different features of a text prompt. For example, asking a model to generate a varying number of the same object to measure its ability to count or providing a text prompt with several objects that each have a different attribute to correctly identify its ability to match objects and attributes. Rather than subjectively evaluating text-to-image results on a set of prompts, our new multi-task benchmark consists of challenge tasks at three difficulty levels (easy, medium, and hard) along with human ratings for each generated image. |
Vitali Petsiuk · Alexander E. Siemenn · Saisamrit Surbehera · Qi Qi Chin · Keith Tyser · Gregory Hunter · Arvind Raghavan · Yann Hicke · Bryan Plummer · Ori Kerret · Tonio Buonassisi · Kate Saenko · Armando Solar-Lezama · Iddo Drori
|
Sat 12:00 p.m. - 12:10 p.m.
|
Can There be Art Without an Artist?
(
Oral
)
link »
SlidesLive Video » Generative Adversarial Network (GAN) based art has proliferated in the past year, going from a shiny new tool to generate fake human faces to a stage where anyone can generate thousands of artistic images with minimal effort. Some of these images are now ``good'' enough to win accolades from qualified judges. In this paper, we explore how Generative Models have impacted artistry, not only from a qualitative point of view, but also from an angle of exploitation of artisans --both via plagiarism, where models are trained on their artwork without permission, and via profit shifting, where profits in the art market have shifted from art creators to model owners or to traders in the unregulated secondary crypto market. This confluence of factors risks completely detaching humans from the artistic process, devaluing the labor of artists and distorting the public perception of the value of art. |
Avijit Ghosh · Genoveva Fossas 🔗 |
Sat 12:10 p.m. - 12:20 p.m.
|
Best Prompts for Text-to-Image Models and How to Find Them
(
Oral
)
link »
SlidesLive Video » Recent progress in generative models, especially in text-guided diffusion models, has enabled the production of aesthetically-pleasing imagery resembling the works of professional human artists. However, one has to carefully compose the textual description, called the prompt, and augment it with a set of clarifying keywords. Since aesthetics are challenging to evaluate computationally, human feedback is needed to determine the optimal prompt formulation and keyword combination. In this paper, we present a human-in-the-loop approach to learning the most useful combination of prompt keywords using a genetic algorithm. We also show how such an approach can improve the aesthetic appeal of images depicting the same descriptions. |
Nikita Pavlichenko · Fedor Zhdanov · Dmitry Ustalov 🔗 |
Sat 12:20 p.m. - 12:30 p.m.
|
Evaluation of Synthetic Datasets for Conversational Recommender Systems
(
Oral
)
link »
SlidesLive Video » For researchers leveraging Large-Language Models (LLMs) in the generation of training datasets, especially for conversational recommender systems - the absence of robust evaluation frameworks has been a long-standing problem. The efficiency brought about by LLMs in the data generation phase is impeded during the process of evaluation of the generated data, since it generally requires human-raters to ensure that the data generated is of high quality and has sufficient diversity. Since the quality of training data is critical for downstream applications, it is important to develop metrics that evaluate the quality holistically and identify biases. In this paper, we present a framework that takes a multi-faceted approach towards evaluating datasets produced by generative models and discuss the advantages and limitations of various evaluation methods. |
Harsh Lara · Manoj Tiwari 🔗 |
Sat 12:30 p.m. - 12:45 p.m.
|
Coffee Break
|
🔗 |
Sat 12:45 p.m. - 1:35 p.m.
|
Panel and QnA with Science Funders Interested in Reliable Human Evaluation of Generative Models
(
Panel
)
SlidesLive Video » |
Brittany Smith · Eric Sears · Yonadav Shavit 🔗 |
Sat 1:35 p.m. - 1:45 p.m.
|
Operationalizing Specifications, In Addition to Test Sets for Evaluating Constrained Generative Models
(
Oral
)
link »
SlidesLive Video » In this work we present some recommendations on the evaluation of state of the art generative models for constrained generation tasks. The progress on generative models has been rapid in recent years. These large scale models have had three impacts: firstly, the fluency of generation in both language and vision modalities has rendered common average-case evaluation metrics much less useful in diagnosing system errors. Secondly, the user expectations around these models and their feted public releases have made the technical problem of out of domain generalization less useful. Thirdly, the same substrate models now form the basis of a number of applications, driven both by the utility of their representations as well as phenomena such as in-context learning. Consequently, our evaluation methodologies haven't adapted to these changes. More concretely, while the methods of interacting with models have seen a rise in their abstraction-level, a similar rise has not been observed in the evaluation practices. In this paper, we argue that the scale of generative models could be exploited to raise the abstraction level at which evaluation itself is conducted and provide recommendations for the same. Our recommendations are based on leveraging specifications as a powerful instrument to evaluate generation quality and are readily applicable to a variety of tasks. |
Vikas Raunak · Matt Post · Arul Menezes 🔗 |
Sat 1:45 p.m. - 1:55 p.m.
|
Sensemaking Interfaces for Human Evaluation of Language Model Outputs
(
Oral
)
link »
SlidesLive Video » Ensuring a language model doesn't generate problematic text is difficult. Traditional evaluation methods, like automatic measures or human annotation, can fail to detect all problems, whether because system designers were not aware of a kind of problem they should attempt to detect, or because an automatic measure fails to reliably detect certain kinds of problems. In this paper we propose sensemaking tools as a robust and open-ended method to evaluate the large number of linguistic outputs produced by a language model. We demonstrate one potential sensemaking interface based on concordance tables, showing that we are able to detect problematic outputs and distributional shifts in minutes, despite not knowing exactly what kind of problems to look for. |
Katy Gero · Jonathan Kummerfeld · Elena Glassman 🔗 |
Sat 1:55 p.m. - 2:05 p.m.
|
The Reasonable Effectiveness of Diverse Evaluation Data
(
Oral
)
link »
SlidesLive Video » In this paper, we present findings from an semi-experimental exploration of rater diversity and its influence on safety annotations of conversations generated by humans talking to a generative AI-chat bot. We find significant differences in judgments produced by raters from different geographic regions and annotation platforms, and correlate these perspectives with demographic sub-groups. Our work helps define best practices in model development– specifically human evaluation of generative models– on the backdrop of growing work on socio-technical evaluations. |
Lora Aroyo · Mark Diaz · Christopher M. Homan · Vinodkumar Prabhakaran · Alex Taylor · Ding Wang 🔗 |
Sat 2:05 p.m. - 2:15 p.m.
|
Closing Remarks
SlidesLive Video » |
Jessica Huynh 🔗 |
Author Information
Divyansh Kaushik (Carnegie Mellon University)
Jennifer Hsia (Carnegie Mellon University)
Jessica Huynh (CMU, Carnegie Mellon University)
Yonadav Shavit (Harvard University)
Samuel Bowman (NYU + Anthropic)
Ting-Hao Huang (Pennsylvania State University)
Douwe Kiela (Hugging Face)
Zachary Lipton (Carnegie Mellon University)
Eric Michael Smith (Meta AI)
Eric is a research engineer at Meta AI, focusing on algorithmic bias in language models and chatbot evaluation. Prior to Meta AI, Eric was a machine learning engineer at Blue Apron, creating and maintaining demand forecast models. Eric was a fellow of Insight Data Science and holds a doctorate in physics from Princeton University for biophysics research in precision measurements of gene expression in the fruit fly embryo.
More from the Same Authors
-
2021 : Model-Free Learning for Continuous Timing as an Action »
Helen Zhou · David Childers · Zachary Lipton -
2022 : Downstream Datasets Make Surprisingly Good Pretraining Corpora »
Kundan Krishna · Saurabh Garg · Jeffrey Bigham · Zachary Lipton -
2022 : Disentangling the Mechanisms Behind Implicit Regularization in SGD »
Zachary Novack · Simran Kaur · Tanya Marwah · Saurabh Garg · Zachary Lipton -
2022 : Perturbation Augmentation for Fairer NLP »
Rebecca Qian · Candace Ross · Jude Fernandes · Eric Michael Smith · Douwe Kiela · Adina Williams -
2022 : RLSBench: A Large-Scale Empirical Study of Domain Adaptation Under Relaxed Label Shift »
Saurabh Garg · Nick Erickson · James Sharpnack · Alexander Smola · Sivaraman Balakrishnan · Zachary Lipton -
2022 : Local Causal Discovery for Estimating Causal Effects »
Shantanu Gupta · David Childers · Zachary Lipton -
2022 : On the Maximum Hessian Eigenvalue and Generalization »
Simran Kaur · Jeremy M Cohen · Zachary Lipton -
2022 : Two-Turn Debate Does Not Help Humans Answer Hard Reading Comprehension Questions »
Alicia Parrish · Harsh Trivedi · Nikita Nangia · Jason Phang · Vishakh Padmakumar · Amanpreet Singh Saimbhi · Samuel Bowman -
2023 Poster: Deep Equilibrium Based Neural Operators for Steady-State PDEs »
Tanya Marwah · Ashwini Pokle · J. Zico Kolter · Zachary Lipton · Jianfeng Lu · Andrej Risteski -
2023 Poster: Complementary Benefits of Contrastive Learning and Self-Training Under Distribution Shift »
Saurabh Garg · Amrith Setlur · Zachary Lipton · Sivaraman Balakrishnan · Virginia Smith · Aditi Raghunathan -
2023 Poster: Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting »
Miles Turpin · Julian Michael · Ethan Perez · Samuel Bowman -
2023 Poster: Tools for Verifying Proofs-of-Training-Data »
Dami Choi · Yonadav Shavit · David Duvenaud -
2023 Poster: Online Label Shift: Optimal Dynamic Regret meets Practical Algorithms »
Dheeraj Baby · Saurabh Garg · Tzu-Ching Yen · Sivaraman Balakrishnan · Zachary Lipton · Yu-Xiang Wang -
2023 Poster: OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents »
Hugo Laurençon · Lucile Saulnier · Leo Tronchon · Stas Bekman · Amanpreet Singh · Anton Lozhkov · Thomas Wang · Siddharth Karamcheti · Alexander Rush · Douwe Kiela · Matthieu Cord · Victor Sanh -
2023 Poster: DataPerf: Benchmarks for Data-Centric AI Development »
Mark Mazumder · Colby Banbury · Xiaozhe Yao · Bojan Karlaš · William Gaviria Rojas · Sudnya Diamos · Greg Diamos · Lynn He · Alicia Parrish · Hannah Rose Kirk · Jessica Quaye · Charvi Rastogi · Douwe Kiela · David Jurado · David Kanter · Rafael Mosquera · Will Cukierski · Juan Ciro · Lora Aroyo · Bilge Acun · Lingjiao Chen · Mehul Raje · Max Bartolo · Evan Sabri Eyuboglu · Amirata Ghorbani · Emmett Goodman · Addison Howard · Oana Inel · Tariq Kane · Christine R. Kirkpatrick · D. Sculley · Tzu-Sheng Kuo · Jonas Mueller · Tristan Thrush · Joaquin Vanschoren · Margaret Warren · Adina Williams · Serena Yeung · Newsha Ardalani · Praveen Paritosh · Ce Zhang · James Zou · Carole-Jean Wu · Cody Coleman · Andrew Ng · Peter Mattson · Vijay Janapa Reddi -
2023 Workshop: Socially Responsible Language Modelling Research (SoLaR) »
Usman Anwar · David Krueger · Samuel Bowman · Jakob Foerster · Su Lin Blodgett · Roberta Raileanu · Alan Chan · Katherine Lee · Laura Ruis · Robert Kirk · Yawen Duan · Xin Chen · Kawin Ethayarajh -
2022 : Sam Bowman: What's the deal with AI safety? »
Samuel Bowman -
2022 : Closing Remarks »
Jessica Huynh -
2022 : Panel and QnA with Science Funders Interested in Reliable Human Evaluation of Generative Models »
Brittany Smith · Eric Sears · Yonadav Shavit -
2022 : Discussion on Policy Challenges Associated with Generative Models »
Irene Solaiman · Russell Wald · Yonadav Shavit · Long Ouyang -
2022 : Panel on Technical Challenges Associated with Reliable Human Evaluations of Generative Models »
Long Ouyang · Tongshuang Wu · Zachary Lipton -
2022 : Are GAN Biased? Evaluating GAN-Generated Facial Images via Crowdsourcing »
Hangzhi Guo · Lizhen Zhu · Ting-Hao Huang -
2022 : Opening Remarks »
Divyansh Kaushik -
2022 Poster: Characterizing Datapoints via Second-Split Forgetting »
Pratyush Maini · Saurabh Garg · Zachary Lipton · J. Zico Kolter -
2022 Poster: Unsupervised Learning under Latent Label Shift »
Manley Roberts · Pranav Mani · Saurabh Garg · Zachary Lipton -
2022 Poster: Domain Adaptation under Open Set Label Shift »
Saurabh Garg · Sivaraman Balakrishnan · Zachary Lipton -
2021 : Invited talk 9 »
Samuel Bowman -
2021 : Facebook - Data Centric Infrastructure »
Douwe Kiela -
2021 Demonstration: Demonstrations 4 »
Douwe Kiela · Barbara Caputo · Marco Ciccone -
2021 Competition: Competition Track Day 4: Overviews + Breakout Sessions »
Douwe Kiela · Marco Ciccone · Barbara Caputo -
2021 Poster: True Few-Shot Learning with Language Models »
Ethan Perez · Douwe Kiela · Kyunghyun Cho -
2021 : Invited talk - Douwe Kiela »
Douwe Kiela -
2021 Competition: Competition Track Day 3: Overviews + Breakout Sessions »
Douwe Kiela · Marco Ciccone · Barbara Caputo -
2021 Demonstration: Demonstrations 3 »
Douwe Kiela · Barbara Caputo · Marco Ciccone -
2021 Poster: Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking »
Zhiyi Ma · Kawin Ethayarajh · Tristan Thrush · Somya Jain · Ledell Wu · Robin Jia · Christopher Potts · Adina Williams · Douwe Kiela -
2021 Demonstration: Demonstrations 2 »
Douwe Kiela · Barbara Caputo · Marco Ciccone -
2021 : Intro »
Douwe Kiela -
2021 Panel: The Role of Benchmarks in the Scientific Progress of Machine Learning »
Lora Aroyo · Samuel Bowman · Isabelle Guyon · Joaquin Vanschoren -
2021 Competition: Competition Track Day 2: Overviews + Breakout Sessions »
Douwe Kiela · Marco Ciccone · Barbara Caputo -
2021 Competition: Competition Track Day 1: Overviews + Breakout Sessions »
Douwe Kiela · Marco Ciccone · Barbara Caputo -
2021 : Introduction Competion Day 1 »
Douwe Kiela -
2021 Poster: Human-Adversarial Visual Question Answering »
Sasha Sheng · Amanpreet Singh · Vedanuj Goswami · Jose Magana · Tristan Thrush · Wojciech Galuba · Devi Parikh · Douwe Kiela -
2021 Demonstration: Demonstrations 1 »
Douwe Kiela · Barbara Caputo · Marco Ciccone -
2021 : Introduction »
Douwe Kiela -
2020 : Q & A and Panel Session with Dan Weld, Kristen Grauman, Scott Yih, Emma Brunskill, and Alex Ratner »
Kristen Grauman · Wen-tau Yih · Alexander Ratner · Emma Brunskill · Douwe Kiela · Daniel S. Weld -
2020 : Invited Talk by Alex Ratner (Snorkel, University of Washington) »
Divyansh Kaushik -
2020 : Break »
Divyansh Kaushik -
2020 : Invited Talk by Emma Brunskill (Stanford) »
Divyansh Kaushik -
2020 : Break »
Divyansh Kaushik -
2020 : Invited Talk by Scott Yih (FAIR) »
Divyansh Kaushik -
2020 : Break »
Divyansh Kaushik -
2020 : Invited Talk by Kristen Grauman (UT Austin, FAIR) »
Divyansh Kaushik -
2020 : Break »
Divyansh Kaushik -
2020 : Invited Talk by Dan Weld (University of Washington) »
Divyansh Kaushik -
2020 : Break »
Divyansh Kaushik -
2020 : Contributed Talk 1: Fairness Under Partial Compliance »
Jessica Dai · Zachary Lipton -
2020 : Q & A and Panel Session with Tom Mitchell, Jenn Wortman Vaughan, Sanjoy Dasgupta, and Finale Doshi-Velez »
Tom Mitchell · Jennifer Wortman Vaughan · Sanjoy Dasgupta · Finale Doshi-Velez · Zachary Lipton -
2020 : Invited Talk by Finale Doshi-Velez (Harvard) »
Divyansh Kaushik -
2020 : Break »
Divyansh Kaushik -
2020 : Invited Talk by Sanjoy Dasgupta (UCSD) »
Divyansh Kaushik -
2020 : Break »
Divyansh Kaushik -
2020 : Invited Talk by Jenn Wortman Vaughan (Microsoft Research) »
Divyansh Kaushik -
2020 : Break »
Divyansh Kaushik -
2020 : Invited Talk by Tom Mitchell (Carnegie Mellon) »
Divyansh Kaushik -
2020 Workshop: HAMLETS: Human And Model in the Loop Evaluation and Training Strategies »
Divyansh Kaushik · Bhargavi Paranjape · Forough Arabshahi · Yanai Elazar · Yixin Nie · Max Bartolo · Polina Kirichenko · Pontus Lars Erik Saito Stenetorp · Mohit Bansal · Zachary Lipton · Douwe Kiela -
2020 : Opening Remarks »
Divyansh Kaushik · Bhargavi Paranjape · Douwe Kiela -
2020 : The Hateful Memes Challenge: Live award ceremony and winner presentations »
Douwe Kiela -
2020 : The Hateful Memes Challenge: Competition Overview »
Douwe Kiela -
2020 Poster: A Unified View of Label Shift Estimation »
Saurabh Garg · Yifan Wu · Sivaraman Balakrishnan · Zachary Lipton -
2020 Poster: The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes »
Douwe Kiela · Hamed Firooz · Aravind Mohan · Vedanuj Goswami · Amanpreet Singh · Pratik Ringshia · Davide Testuggine -
2020 Poster: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks »
Patrick Lewis · Ethan Perez · Aleksandra Piktus · Fabio Petroni · Vladimir Karpukhin · Naman Goyal · Heinrich Küttler · Mike Lewis · Wen-tau Yih · Tim Rocktäschel · Sebastian Riedel · Douwe Kiela -
2020 Poster: Learning Optimal Representations with the Decodable Information Bottleneck »
Yann Dubois · Douwe Kiela · David Schwab · Ramakrishna Vedantam -
2020 Spotlight: Learning Optimal Representations with the Decodable Information Bottleneck »
Yann Dubois · Douwe Kiela · David Schwab · Ramakrishna Vedantam -
2019 : Audrey Durand, Douwe Kiela, Kamalika Chaudhuri moderated by Yann Dauphin »
Audrey Durand · Kamalika Chaudhuri · Yann Dauphin · Orhan Firat · Dilan Gorur · Douwe Kiela -
2019 : Douwe Kiela - Benchmarking Progress in AI: A New Benchmark for Natural Language Understanding »
Douwe Kiela -
2019 Workshop: Emergent Communication: Towards Natural Language »
Abhinav Gupta · Michael Noukhovitch · Cinjon Resnick · Natasha Jaques · Angelos Filos · Marie Ossenkopf · Angeliki Lazaridou · Jakob Foerster · Ryan Lowe · Douwe Kiela · Kyunghyun Cho -
2019 Poster: Can Unconditional Language Models Recover Arbitrary Sentences? »
Nishant Subramani · Samuel Bowman · Kyunghyun Cho -
2019 Poster: Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift »
Stephan Rabanser · Stephan Günnemann · Zachary Lipton -
2019 Poster: Learning Robust Global Representations by Penalizing Local Predictive Power »
Haohan Wang · Songwei Ge · Zachary Lipton · Eric Xing -
2019 Poster: Hyperbolic Graph Neural Networks »
Qi Liu · Maximilian Nickel · Douwe Kiela -
2019 Poster: SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems »
Alex Wang · Yada Pruksachatkun · Nikita Nangia · Amanpreet Singh · Julian Michael · Felix Hill · Omer Levy · Samuel Bowman -
2019 Spotlight: SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems »
Alex Wang · Yada Pruksachatkun · Nikita Nangia · Amanpreet Singh · Julian Michael · Felix Hill · Omer Levy · Samuel Bowman -
2019 Poster: Game Design for Eliciting Distinguishable Behavior »
Fan Yang · Liu Leqi · Yifan Wu · Zachary Lipton · Pradeep Ravikumar · Tom M Mitchell · William Cohen -
2018 : Invited Talk 1 »
Zachary Lipton -
2018 Workshop: Emergent Communication Workshop »
Jakob Foerster · Angeliki Lazaridou · Ryan Lowe · Igor Mordatch · Douwe Kiela · Kyunghyun Cho -
2018 : Panel Discussion »
Antonio Torralba · Douwe Kiela · Barbara Landau · Angeliki Lazaridou · Joyce Chai · Christopher Manning · Stevan Harnad · Roozbeh Mottaghi -
2018 : Panel on research process »
Zachary Lipton · Charles Sutton · Finale Doshi-Velez · Hanna Wallach · Suchi Saria · Rich Caruana · Thomas Rainforth -
2018 : Douwe Kiela - Learning Multimodal Embeddings »
Douwe Kiela -
2018 : Zachary Lipton »
Zachary Lipton -
2018 Poster: Does mitigating ML's impact disparity require treatment disparity? »
Zachary Lipton · Julian McAuley · Alexandra Chouldechova -
2017 Workshop: Emergent Communication Workshop »
Jakob Foerster · Igor Mordatch · Angeliki Lazaridou · Kyunghyun Cho · Douwe Kiela · Pieter Abbeel -
2017 Poster: Poincaré Embeddings for Learning Hierarchical Representations »
Maximilian Nickel · Douwe Kiela -
2017 Spotlight: Poincaré Embeddings for Learning Hierarchical Representations »
Maximilian Nickel · Douwe Kiela