Timezone: »
In Machine Learning, a benchmark refers to an ensemble of datasets associated with one or multiple metrics together with a way to aggregate different systems performances. They are instrumental in {\it (i)} assessing the progress of new methods along different axes and {\it (ii)} selecting the best systems for practical use. This is particularly the case for NLP with the development of large pre-trained models (\textit{e.g.} GPT, BERT) that are expected to generalize well on a variety of tasks. While the community mainly focused on developing new datasets and metrics, there has been little interest in the aggregation procedure, which is often reduced to a simple average over various performance measures. However, this procedure can be problematic when the metrics are on a different scale, which may lead to spurious conclusions. This paper proposes a new procedure to rank systems based on their performance across different tasks. Motivated by the social choice theory, the final system ordering is obtained through aggregating the rankings induced by each task and is theoretically grounded. We conduct extensive numerical experiments (on over 270k scores) to assess the soundness of our approach both on synthetic and real scores (\textit{e.g.} GLUE, EXTREM, SEVAL, TAC, FLICKR). In particular, we show that our method yields different conclusions on state-of-the-art systems than the mean-aggregation procedure while being both more reliable and robust.
Author Information
Pierre Colombo (MICS CentraleSupelec)
Nathan Noiry (Télécom Paris)
Ekhine Irurozki (Télécom ParisTech)
Stephan Clémençon (Telecom ParisTech)
More from the Same Authors
-
2022 Poster: Beyond Mahalanobis Distance for Textual OOD Detection »
Pierre Colombo · Eduardo Dadalto · Guillaume Staerman · Nathan Noiry · Pablo Piantanida -
2022 Poster: The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset »
Hugo Laurençon · Lucile Saulnier · Thomas Wang · Christopher Akiki · Albert Villanova del Moral · Teven Le Scao · Leandro Von Werra · Chenghao Mou · Eduardo González Ponferrada · Huu Nguyen · Jörg Frohberg · Mario Šaško · Quentin Lhoest · Angelina McMillan-Major · Gerard Dupont · Stella Biderman · Anna Rogers · Loubna Ben allal · Francesco De Toni · Giada Pistilli · Olivier Nguyen · Somaieh Nikpoor · Maraim Masoud · Pierre Colombo · Javier de la Rosa · Paulo Villegas · Tristan Thrush · Shayne Longpre · Sebastian Nagel · Leon Weber · Manuel Muñoz · Jian Zhu · Daniel Van Strien · Zaid Alyafeai · Khalid Almubarak · Minh Chien Vu · Itziar Gonzalez-Dios · Aitor Soroa · Kyle Lo · Manan Dey · Pedro Ortiz Suarez · Aaron Gokaslan · Shamik Bose · David Adelani · Long Phan · Hieu Tran · Ian Yu · Suhas Pai · Jenny Chim · Violette Lepercq · Suzana Ilic · Margaret Mitchell · Sasha Alexandra Luccioni · Yacine Jernite -
2021 Poster: Online Matching in Sparse Random Graphs: Non-Asymptotic Performances of Greedy Algorithm »
Nathan Noiry · Vianney Perchet · Flore Sentenac -
2018 Poster: On Binary Classification in Extreme Regions »
Hamid Jalalzai · Stephan Clémençon · Anne Sabourin -
2016 Poster: On Graph Reconstruction via Empirical Risk Minimization: Fast Learning Rates and Scalability »
Guillaume Papa · Aurélien Bellet · Stephan Clémençon -
2008 Poster: Empirical performance maximization for linear rank statistics »
Stephan Clémençon · Nicolas Vayatis -
2008 Poster: On Bootstrapping the ROC Curve »
Patrice Bertail · Stephan Clémençon · Nicolas Vayatis -
2008 Poster: Overlaying classifiers: a practical approach for optimal ranking »
Stephan Clémençon · Nicolas Vayatis