Workshop
Evaluating Evaluations: Examining Best Practices for Measuring Broader Impacts of Generative AI
Avijit Ghosh ยท Usman Gohar ยท Yacine Jernite ยท Lucie-Aimรฉe Kaffee ยท Alberto Lusoli ยท Jennifer Mickel ยท Irene Solaiman ยท Arjun Subramonian ยท Zeerak Talat ยท Felix Friedrich ยท Cedric Whitney ยท Michelle Lin
East Meeting Room 16
Sun 15 Dec, 9:15 a.m. PST
Generative AI systems are becoming increasingly prevalent in society across modalities, producing content such as text, images, audio, and video, with far-reaching implications. The NeurIPS Broader Impact statement has notably shifted norms for AI publications to consider negative societal impact. However, no standard exists for how to approach these impact assessments. While new methods for evaluation of social impact are being developed, including notably through the NeurIPS Datasets and Benchmarks track, the lack of standard for documenting their applicability, utility, and disparate coverage of different social impact categories stand in the way of broad adoption by developers and researchers of generative AI systems. By bringing together experts on the science and context of evaluation and practitioners who develop and analyze technical systems, we aim to help address this issue through the work of the NeurIPS community.
Schedule
Sun 9:15 a.m. - 9:30 a.m.
|
Opening Remarks
(
Oral
)
>
SlidesLive Video |
๐ |
Sun 9:30 a.m. - 10:30 a.m.
|
Panel Discussion
(
Panel Discussion
)
>
SlidesLive Video |
๐ |
Sun 10:30 a.m. - 11:30 a.m.
|
Oral Session 1
(
Oral
)
>
SlidesLive Video |
๐ |
Sun 11:30 a.m. - 12:30 p.m.
|
Oral Session 2
(
Oral
)
>
SlidesLive Video |
๐ |
Sun 12:30 p.m. - 2:30 p.m.
|
Lunch and Poster Session
(
Poster
)
>
|
๐ |
Sun 2:30 p.m. - 3:00 p.m.
|
Oral Session 3
(
Oral
)
>
SlidesLive Video |
๐ |
Sun 3:30 p.m. - 4:05 p.m.
|
Oral Session 3 (part 2 after break)
(
Oral
)
>
|
๐ |
Sun 4:05 p.m. - 5:00 p.m.
|
What's Next - Coalition Development
(
Oral
)
>
SlidesLive Video |
๐ |
Sun 5:00 p.m. - 5:30 p.m.
|
Closing Remarks
(
Oral
)
>
SlidesLive Video |
๐ |
-
|
Surveying Surveys: Surveysโ Role in Evaluating AIโs Labor Market Impact
(
Poster
)
>
|
Cassandra Solis ๐ |
-
|
Evaluating Generative AI Systems is a Social Science Measurement Challenge
(
Oral
)
>
|
20 presentersHanna Wallach ยท Meera Desai ยท Nicholas Pangakis ยท A. Feder Cooper ยท Angelina Wang ยท Solon Barocas ยท Alexandra Chouldechova ยท Chad Atalla ยท Su Lin Blodgett ยท Emily Corvi ยท Alex Dow ยท Jean Garcia-Gathright ยท Alexandra Olteanu ยท Stefanie Reed ยท Emily Sheng ยท Dan Vann ยท Jennifer Wortman Vaughan ยท Matthew Vogel ยท Hannah Washington ยท Abigail Jacobs |
-
|
Cascaded to End-to-End: New Safety, Security, and Evaluation Questions for Audio Language Models
(
Oral
)
>
|
Luxi He ยท Xiangyu Qi ยท Inyoung Cheong ยท Prateek Mittal ยท Danqi Chen ยท Peter Henderson ๐ |
-
|
Can Vision-Language Models Replace Human Annotators: A Case Study with CelebA Dataset
(
Poster
)
>
|
Haoming Lu ยท Feifei Zhong ๐ |
-
|
Evaluating Refusal
(
Oral
)
>
|
Shira Abramovich ยท Anna J. Ma ๐ |
-
|
Fairness Dynamics During Training
(
Poster
)
>
|
Krishna Patel ยท Nivedha Sivakumar ยท Barry-John Theobald ยท Luca Zappella ยท Nicholas Apostoloff ๐ |
-
|
Provocation on Expertise in Social Impact Evaluations for Generative AI (and Beyond)
(
Poster
)
>
|
Zoe Kahn ยท Nitin Kohli ๐ |
-
|
Is ETHICS about ethics? Evaluating the ETHICS benchmark
(
Poster
)
>
|
Leif Hancox-Li ยท Borhane Blili-Hamelin ๐ |
-
|
GenAI Evaluation Maturity Framework (GEMF) to assess and improve GenAI Evaluations
(
Oral
)
>
|
Yilin Zhang ยท Frank J. Kanayet ๐ |
-
|
Democratic Perspectives and Corporate Captures of Crowdsourced Evaluations
(
Poster
)
>
|
parth sarin ยท Michelle Bao ๐ |
-
|
Motivations for Reframing Large Language Model Benchmarking for Legal Applications
(
Poster
)
>
|
Riya Ranjan ยท Megan Ma ๐ |
-
|
Evaluations Using Wikipedia without Data Leakage: From Trusting Articles to Trusting Edit Processes
(
Poster
)
>
|
Lucie-Aimรฉe Kaffee ยท Isaac Johnson ๐ |
-
|
Towards Leveraging News Media to Support Impact Assessment of AI Technologies
(
Poster
)
>
|
Mowafak Allaham ยท Kimon Kieslich ยท Nicholas Diakopoulos ๐ |
-
|
AIR-Bench 2024: Safety Evaluation Based on Risk Categories from Regulations and Policies
(
Oral
)
>
|
Kevin Klyman ๐ |
-
|
Rethinking Artistic Copyright Infringements in the Era of Text-to-Image Generative Models
(
Poster
)
>
|
Mazda Moayeri ยท Samyadeep Basu ยท Sriram Balasubramanian ยท Priyatham Kattakinda ยท Atoosa Chegini ยท Robert Brauneis ยท Soheil Feizi ๐ |
-
|
(Mis)use of nude images in machine learning research
(
Oral
)
>
|
Arshia Arya ยท Princessa Cintaqia ยท Deepak Kumar ยท Allison McDonald ยท Lucy Qin ยท Elissa Redmiles ๐ |
-
|
Critical human-AI use scenarios and interaction modes for societal impact evaluations
(
Oral
)
>
|
Lujain Ibrahim ยท Saffron Huang ยท Lama Ahmad ยท Markus Anderljung ๐ |
-
|
Gaps Between Research and Practice When Measuring Representational Harms Caused by LLM-Based Systems
(
Poster
)
>
|
Emma Harvey ยท Emily Sheng ยท Su Lin Blodgett ยท Alexandra Chouldechova ยท Jean Garcia-Gathright ยท Alexandra Olteanu ยท Hanna Wallach ๐ |
-
|
Dimensions of Generative AI Evaluation Design
(
Poster
)
>
|
Alex Dow ยท Jennifer Wortman Vaughan ยท Solon Barocas ยท Chad Atalla ยท Alexandra Chouldechova ยท Hanna Wallach ๐ |
-
|
LLMs and Personalities: Inconsistencies Across Scales
(
Poster
)
>
|
Tommaso Tosato ยท David Lemay ยท Mahmood Hegazy ยท Irina Rish ยท Guillaume Dumas ๐ |
-
|
JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding Benchmark
(
Oral
)
>
|
Shota Onohara ยท Atsuyuki Miyai ยท Yuki Imajuku ยท Kazuki Egashira ยท Jeonghun Baek ยท Xiang Yue ยท Graham Neubig ยท Kiyoharu Aizawa ๐ |
-
|
Contamination Report for Multilingual Benchmarks
(
Poster
)
>
|
Sanchit Ahuja ยท Varun Gumma ยท Sunayana Sitaram ๐ |
-
|
Provocation: Who benefits from โinclusionโ in Generative AI?
(
Oral
)
>
|
Samantha Dalal ยท Siobhan Mackenzie Hall ยท Nari Johnson ๐ |
-
|
Troubling taxonomies in GenAI evaluation
(
Poster
)
>
|
Glen Berman ยท Ned Cooper ยท Wesley Deng ยท Ben Hutchinson ๐ |
-
|
Rethinking CyberSecEval: An LLM-Aided Approach to Evaluation Critique
(
Poster
)
>
|
Suhas Hariharan ยท Zainab Ali Majid ยท Jaime Raldua Veuthey ยท Jacob Haimes ๐ |
-
|
Statistical Bias in Bias Benchmark Design
(
Poster
)
>
|
Hannah Powers ยท Ioana Baldini ยท Dennis Wei ยท Kristin P Bennett ๐ |
-
|
Assessing Bias in Metric Models for LLM Open-Ended Generation Bias Benchmarks
(
Poster
)
>
|
Nathaniel Demchak ยท Xin Guan ยท Zekun Wu ยท Ziyi Xu ยท Adriano Koshiyama ยท Emre Kazim ๐ |
-
|
Using Scenario-Writing for Identifying and Mitigating Impacts of Generative AI
(
Poster
)
>
|
Kimon Kieslich ยท Nicholas Diakopoulos ยท Natali Helberger ๐ |
-
|
A Framework for Evaluating LLMs Under Task Indeterminacy
(
Poster
)
>
|
Luke Guerdan ยท Hanna Wallach ยท Solon Barocas ยท Alexandra Chouldechova ๐ |