Timezone: »
The BigScience Workshop was a value-driven initiative that spanned one and half years of interdisciplinary research and culminated in the creation of ROOTS, a 1.6TB multilingual dataset that was used to train BLOOM, one of the largest multilingual language models to date.In addition to the technical outcomes and artifacts, the workshop fostered multidisciplinary collaborations around large models, datasets, and their analysis. This in turn led to a wide range of research publications spanning topics from ethics to law, data governance, modeling choices and distributed training. This paper focuses on the collaborative research aspects of BigScience and takes a step back to look at the challenges of large-scale participatory research, with respect to participant diversity and the tasks required to successfully carry out such a project. Our main goal is to share the lessons we learned from this experience, what we could have done better and what we did well. We show how the impact of such a social approach to scientific research goes well beyond the technical artifacts that were the basis of its inception.
Author Information
Christopher Akiki (Leipzig University)
Giada Pistilli (Sorbonne Université & CNRS)
Margot Mieskes (University of Applied Sciences Darmstadt)
Matthias Gallé (Cohere)
Machine Learning Manager at Cohere
Thomas Wolf (HuggingFace Inc.)
Suzana Ilic (Universität Innsbruck)
Yacine Jernite (Hugging Face)
Related Events (a corresponding poster, oral, or spotlight)
-
2022 : BigScience: A Case Study in the Social Construction of a Multilingual Large Language Model »
Sat. Dec 3rd 06:30 -- 06:45 PM Room
More from the Same Authors
-
2023 Poster: Scaling Data-Constrained Language Models »
Niklas Muennighoff · Alexander Rush · Boaz Barak · Teven Le Scao · Nouamane Tazi · Aleksandra Piktus · Thomas Wolf · Colin Raffel · Sampo Pyysalo -
2023 Poster: Stable Bias: Evaluating Societal Representations in Diffusion Models »
Sasha Alexandra Luccioni · Christopher Akiki · Margaret Mitchell · Yacine Jernite -
2023 Oral: Scaling Data-Constrained Language Models »
Niklas Muennighoff · Alexander Rush · Boaz Barak · Teven Le Scao · Nouamane Tazi · Aleksandra Piktus · Thomas Wolf · Colin Raffel · Sampo Pyysalo -
2022 : Towards Openness Beyond Open Access: User Journeys through 3 Open AI Collaboratives »
Jennifer Ding · Christopher Akiki · Yacine Jernite · Anne Steele · Temi Popo -
2022 : Towards Openness Beyond Open Access: User Journeys through 3 Open AI Collaboratives »
Jennifer Ding · Christopher Akiki · Yacine Jernite · Anne Steele · Temi Popo -
2022 Poster: The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset »
Hugo Laurençon · Lucile Saulnier · Thomas Wang · Christopher Akiki · Albert Villanova del Moral · Teven Le Scao · Leandro Von Werra · Chenghao Mou · Eduardo González Ponferrada · Huu Nguyen · Jörg Frohberg · Mario Šaško · Quentin Lhoest · Angelina McMillan-Major · Gerard Dupont · Stella Biderman · Anna Rogers · Loubna Ben allal · Francesco De Toni · Giada Pistilli · Olivier Nguyen · Somaieh Nikpoor · Maraim Masoud · Pierre Colombo · Javier de la Rosa · Paulo Villegas · Tristan Thrush · Shayne Longpre · Sebastian Nagel · Leon Weber · Manuel Muñoz · Jian Zhu · Daniel Van Strien · Zaid Alyafeai · Khalid Almubarak · Minh Chien Vu · Itziar Gonzalez-Dios · Aitor Soroa · Kyle Lo · Manan Dey · Pedro Ortiz Suarez · Aaron Gokaslan · Shamik Bose · David Adelani · Long Phan · Hieu Tran · Ian Yu · Suhas Pai · Jenny Chim · Violette Lepercq · Suzana Ilic · Margaret Mitchell · Sasha Alexandra Luccioni · Yacine Jernite -
2021 : Training Transformers Together »
Alexander Borzunov · Max Ryabinin · Tim Dettmers · quentin lhoest · Lucile Saulnier · Michael Diskin · Yacine Jernite · Thomas Wolf -
2018 : The Conversational Intelligence Challenge 2 (ConvAI2) : Winners talks & spotlights »
Thomas Wolf · Xuezheng Peng · Christian Saam · Henry Elder · Rauf Kurbanov · Mohammad Shadab Alam -
2013 Poster: Discovering Hidden Variables in Noisy-Or Networks using Quartet Tests »
Yacine Jernite · Yoni Halpern · David Sontag