Timezone: »
As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the foreground. This paper documents the data creation and curation efforts undertaken by BigScience to assemble the Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus, a 1.6TB dataset spanning 59 languages that was used to train the 176-billion-parameter BigScience Large Open-science Open-access Multilingual (BLOOM) language model. We further release a large initial subset of the corpus and analyses thereof, and hope to empower large-scale monolingual and multilingual modeling projects with both the data and the processing tools, as well as stimulate research around this large multilingual corpus.
Author Information
Hugo Laurençon (Hugging Face)
Lucile Saulnier (Hugging Face)
Thomas Wang (Hugging Face)
Christopher Akiki (Leipzig University)
Albert Villanova del Moral (CNRS)
Teven Le Scao (Hugging Face)
Leandro Von Werra (ETHZ - ETH Zurich)
Chenghao Mou (Docusign, Inc.)
Eduardo González Ponferrada (Apple)
Huu Nguyen
Jörg Frohberg (Humboldt Universität Berlin)
Mario Šaško
Quentin Lhoest (Hugging Face)
Angelina McMillan-Major (University of Washington)
Gerard Dupont (Mavenoid)
Stella Biderman (EleutherAI)
Anna Rogers (University of Copenhagen)
Loubna Ben allal (Hugging Face)
Francesco De Toni (University of Western Australia)
Giada Pistilli (Sorbonne Université & CNRS)
Olivier Nguyen (Twitch)
Somaieh Nikpoor (Government)
Maraim Masoud (NA)
A recent graduate in Machine Learning.
Pierre Colombo (MICS CentraleSupelec)
Javier de la Rosa (National Library of Norway)
Paulo Villegas (Telefonica Research)
Tristan Thrush (Hugging Face)

I'm a research engineer at Hugging Face. Previously, I was a research associate at Facebook AI Research, supervised by Douwe Kiela and then Adina Williams. Before that, I was a research associate at MIT Brain and Cognitive Sciences, supervised by Roger Levy. I Received my MEng in Computer Science with a concentration in Artificial Intelligence under Patrick Winston at the MIT Computer Science and Artificial Intelligence Lab. I received my BS also at MIT in Computer Science, with a minor in Linguistics and a minor in Math. While I was a student, I did research with the Perception Systems Group at NASA's Jet Propulsion Lab. My topics of research include natural language processing, dataset creation, model evaluation, and multimodal models.
Shayne Longpre (Massachusetts Institute of Technology)
Sebastian Nagel
Leon Weber (LMU Munich)
Manuel Muñoz
Jian Zhu (University of British Columbia)
Daniel Van Strien (British Library)
Zaid Alyafeai (King Fahad University of Petroleum and Minerals)
Khalid Almubarak
Minh Chien Vu (DETOMO Inc.)
Itziar Gonzalez-Dios (Universidad del País Vasco)
Aitor Soroa (University of the Basque Country. UPV/EHU.)
Kyle Lo (Allen Institute for AI)
Manan Dey (SAP)
Pedro Ortiz Suarez (German Research Center for AI)
Aaron Gokaslan (Cornell University)
Shamik Bose
David Adelani (University College London)

I am a Research Fellow (or DeepMind Academic Fellow) at University College London, UK
Long Phan (VietAI)
Hieu Tran (VietAI Research)
Ian Yu (Groupby Inc)
Suhas Pai
Jenny Chim (Queen Mary University London)
Violette Lepercq (Hugging Face)
Suzana Ilic (Universität Innsbruck)
Margaret Mitchell (Hugging Face)
Sasha Alexandra Luccioni (Hugging Face)
Yacine Jernite (Hugging Face)
More from the Same Authors
-
2020 : Analyzing Sustainability Reports Using Natural Language Processing »
Alexandra Luccioni -
2021 Spotlight: Habitat 2.0: Training Home Assistants to Rearrange their Habitat »
Andrew Szot · Alexander Clegg · Eric Undersander · Erik Wijmans · Yili Zhao · John Turner · Noah Maestre · Mustafa Mukadam · Devendra Singh Chaplot · Oleksandr Maksymets · Aaron Gokaslan · Vladimír Vondruš · Sameer Dharur · Franziska Meier · Wojciech Galuba · Angel Chang · Zsolt Kira · Vladlen Koltun · Jitendra Malik · Manolis Savva · Dhruv Batra -
2021 : Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI »
Santhosh Kumar Ramakrishnan · Aaron Gokaslan · Erik Wijmans · Oleksandr Maksymets · Alexander Clegg · John Turner · Eric Undersander · Wojciech Galuba · Andrew Westbury · Angel Chang · Manolis Savva · Yili Zhao · Dhruv Batra -
2022 : Bias Assessment of Text-to-Image Models »
Sasha Alexandra Luccioni · Clémentine Fourrier · Nathan Lambert · Unso Eun Seo Jo · Irene Solaiman · Helen Ngo · Nazneen Rajani · Giada Pistilli · Yacine Jernite · Margaret Mitchell -
2022 : Active Learning Over Multiple Domains in Natural Language Tasks »
Shayne Longpre · Julia Reisler · Edward Huang · Yi Lu · Andrew Frank · Nikhil Ramesh · Chris DuBois -
2022 : YOSM: Yorùbá Sentiment Corpus for Movie Reviews »
Iyanuoluwa Shode · David Adelani · Anna Feldman -
2022 : Panel: Assessing AI’s impacts on greenhouse gas emissions and climate change adaptation »
George Kamiya · Sasha Alexandra Luccioni · Costa Samaras -
2022 Panel: Panel 3C-6: LAION-5B: An open… & The BigScience ROOTS… »
Hugo Laurençon · Christoph Schuhmann -
2022 : Panel »
Vinay Prabhu · Lamtharn (Hanoi) Hantrakul · Stella Biderman · Deven Desai -
2022 : BigScience: A Case Study in the Social Construction of a Multilingual Large Language Model »
Christopher Akiki · Giada Pistilli · Margot Mieskes · Matthias Gallé · Thomas Wolf · Suzana Ilic · Yacine Jernite -
2022 : EleutherAI: Going Beyond "Open Science" to "Science in the Open" »
Jason Phang · Herbie Bradley · Leo Gao · Louis Castricato · Stella Biderman -
2022 : Towards Openness Beyond Open Access: User Journeys through 3 Open AI Collaboratives »
Jennifer Ding · Christopher Akiki · Yacine Jernite · Anne Steele · Temi Popo -
2022 : BigScience: A Case Study in the Social Construction of a Multilingual Large Language Model »
Christopher Akiki · Giada Pistilli · Margot Mieskes · Matthias Gallé · Thomas Wolf · Suzana Ilic · Yacine Jernite -
2022 : Towards Openness Beyond Open Access: User Journeys through 3 Open AI Collaboratives »
Jennifer Ding · Christopher Akiki · Yacine Jernite · Anne Steele · Temi Popo -
2022 : EleutherAI: Going Beyond "Open Science" to "Science in the Open" »
Jason Phang · Herbie Bradley · Leo Gao · Louis Castricato · Stella Biderman -
2022 Poster: BigBio: A Framework for Data-Centric Biomedical Natural Language Processing »
Jason Fries · Leon Weber · Natasha Seelam · Gabriel Altay · Debajyoti Datta · Samuele Garda · Sunny Kang · Rosaline Su · Wojciech Kusa · Samuel Cahyawijaya · Fabio Barth · Simon Ott · Matthias Samwald · Stephen Bach · Stella Biderman · Mario Sänger · Bo Wang · Alison Callahan · Daniel León Periñán · Théo Gigant · Patrick Haller · Jenny Chim · Jose Posada · John Giorgi · Karthik Rangasai Sivaraman · Marc Pàmies · Marianna Nezhurina · Robert Martin · Michael Cullan · Moritz Freidank · Nathan Dahlberg · Shubhanshu Mishra · Shamik Bose · Nicholas Broad · Yanis Labrak · Shlok Deshmukh · Sid Kiblawi · Ayush Singh · Minh Chien Vu · Trishala Neeraj · Jonas Golde · Albert Villanova del Moral · Benjamin Beilharz -
2022 Poster: What are the best Systems? New Perspectives on NLP Benchmarking »
Pierre Colombo · Nathan Noiry · Ekhine Irurozki · Stephan Clémençon -
2022 Poster: Beyond Mahalanobis Distance for Textual OOD Detection »
Pierre Colombo · Eduardo Dadalto · Guillaume Staerman · Nathan Noiry · Pablo Piantanida -
2022 Social: Ethics Review - Open Discussion »
Deborah Raji · William Isaac · Cherie Poland · Alexandra Luccioni -
2022 Poster: Multi-LexSum: Real-world Summaries of Civil Rights Lawsuits at Multiple Granularities »
Zejiang Shen · Kyle Lo · Lauren Yu · Nathan Dahlberg · Margo Schlanger · Doug Downey -
2022 : Cross-lingual Transfer for Named Entity Recognition: A study on African Languages »
David Adelani -
2022 : Cross-lingual Transfer for Named Entity Recognition: A study on African Languages »
David Adelani -
2021 Poster: Distributed Deep Learning In Open Collaborations »
Michael Diskin · Alexey Bukhtiyarov · Max Ryabinin · Lucile Saulnier · quentin lhoest · Anton Sinitsin · Dmitry Popov · Dmitry V. Pyrkin · Maxim Kashirin · Alexander Borzunov · Albert Villanova del Moral · Denis Mazur · Ilia Kobelev · Yacine Jernite · Thomas Wolf · Gennady Pekhimenko -
2021 Poster: TöRF: Time-of-Flight Radiance Fields for Dynamic Scene View Synthesis »
Benjamin Attal · Eliot Laidlaw · Aaron Gokaslan · Changil Kim · Christian Richardt · James Tompkin · Matthew O'Toole -
2021 Poster: FLEX: Unifying Evaluation for Few-Shot NLP »
Jonathan Bragg · Arman Cohan · Kyle Lo · Iz Beltagy -
2021 Poster: Habitat 2.0: Training Home Assistants to Rearrange their Habitat »
Andrew Szot · Alexander Clegg · Eric Undersander · Erik Wijmans · Yili Zhao · John Turner · Noah Maestre · Mustafa Mukadam · Devendra Singh Chaplot · Oleksandr Maksymets · Aaron Gokaslan · Vladimír Vondruš · Sameer Dharur · Franziska Meier · Wojciech Galuba · Angel Chang · Zsolt Kira · Vladlen Koltun · Jitendra Malik · Manolis Savva · Dhruv Batra -
2021 : Training Transformers Together »
Alexander Borzunov · Max Ryabinin · Tim Dettmers · quentin lhoest · Lucile Saulnier · Michael Diskin · Yacine Jernite · Thomas Wolf -
2019 : Lunch + Poster Session »
Frederik Gerzer · Bill Yang Cai · Pieter-Jan Hoedt · Kelly Kochanski · Soo Kyung Kim · Yunsung Lee · Sunghyun Park · Sharon Zhou · Martin Gauch · Jonathan Wilson · Joyjit Chatterjee · Shamindra Shrotriya · Dimitri Papadimitriou · Christian Schön · Valentina Zantedeschi · Gabriella Baasch · Willem Waegeman · Gautier Cosne · Dara Farrell · Brendan Lucier · Letif Mones · Caleb Robinson · Tafara Chitsiga · Victor Kristof · Hari Prasanna Das · Yimeng Min · Alexandra Puchko · Alexandra Luccioni · Kyle Story · Jason Hickey · Yue Hu · Björn Lütjens · Zhecheng Wang · Renzhi Jing · Genevieve Flaspohler · Jingfan Wang · Saumya Sinha · Qinghu Tang · Armi Tiihonen · Ruben Glatt · Muge Komurcu · Jan Drgona · Juan Gomez-Romero · Ashish Kapoor · Dylan J Fitzpatrick · Alireza Rezvanifar · Adrian Albert · Olya (Olga) Irzak · Kara Lamb · Ankur Mahesh · Kiwan Maeng · Frederik Kratzert · Sorelle Friedler · Niccolo Dalmasso · Alex Robson · Lindiwe Malobola · Lucas Maystre · Yu-wen Lin · Surya Karthik Mukkavili · Brian Hutchinson · Alexandre Lacoste · Yanbing Wang · Zhengcheng Wang · Yinda Zhang · Victoria Preston · Jacob Pettit · Draguna Vrabie · Miguel Molina-Solana · Tonio Buonassisi · Andrew Annex · Tunai P Marques · Catalin Voss · Johannes Rausch · Max Evans -
2018 : Bias and fairness in AI »
Timnit Gebru · Margaret Mitchell · Brittny-Jade E Saunders -
2013 Poster: Discovering Hidden Variables in Noisy-Or Networks using Quartet Tests »
Yacine Jernite · Yoni Halpern · David Sontag