Keywords: [ ENLSP-Main ]
A large number of natural language processing (NLP) datasets contain crowdsourced labels. Most of the time, training set labels are generated using majority vote from individual rater's labels, which discards a significant amount of information. This work focuses on improving data-efficiency when training a model for "marginally abusive" Tweet classification. We compare majority vote to two families of alternative methods, changing the training process in two different steps: (1) aggregating individual labels using weak supervision to improve the quality of labels for model training, and (2) predicting individual labels using the multi-rater models proposed by Davani et al. . We find that majority vote is a strong baseline. Dawid-Skene and multi-rater models perform well, although the latter tend to be more susceptible to overfit. Finally, we also identify a number of practical considerations for the practitioner, such as setting a minimum number of labels per rater, or preferring soft to hard labels.