Workshop: Human Evaluation of Generative Models

Sensemaking Interfaces for Human Evaluation of Language Model Outputs

Katy Gero · Jonathan Kummerfeld · Elena Glassman


Ensuring a language model doesn't generate problematic text is difficult. Traditional evaluation methods, like automatic measures or human annotation, can fail to detect all problems, whether because system designers were not aware of a kind of problem they should attempt to detect, or because an automatic measure fails to reliably detect certain kinds of problems. In this paper we propose sensemaking tools as a robust and open-ended method to evaluate the large number of linguistic outputs produced by a language model. We demonstrate one potential sensemaking interface based on concordance tables, showing that we are able to detect problematic outputs and distributional shifts in minutes, despite not knowing exactly what kind of problems to look for.

Chat is not available.