Tutorial

Experimental Design and Analysis for AI Researchers

Michael Mozer ⋅ Katherine Hermann ⋅ Jennifer Hu

2024 Tutorial

[ Slides]

Abstract

Simulation comparisons are often used in machine learning to argue for the superiority of one model or method over another. However, the conclusions that can be drawn from such studies are only as robust as the forethought that is put into their design and analysis. We discuss core techniques used in the experimental sciences (e.g., medicine and psychology) that are too often sidestepped by AI researchers. Topics include: classical statistical inference, hypothesis testing, one-way and multi-factor ANOVA, within- and between-subjects designs, planned vs. post-hoc contrasts, visualization of outcomes and uncertainty, and modern standards of experimental practice. We then focus on two topics of particular interest to AI researchers: (1) human evaluations of foundation models (LLMs, MLMs), e.g., in domains like intelligent tutoring; and (2) psycholinguistic explorations of foundation models, in which models are used as subjects of behavioral studies in order to reverse engineer their operation, just as psychologists and psycholinguists have done with human participants over the past century.

Video

Chat is not available.