Poster

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

Hugo Laurençon ⋅ Lucile Saulnier ⋅ Thomas Wang ⋅ Christopher Akiki ⋅ Albert Villanova del Moral ⋅ Teven Le Scao ⋅ Leandro Von Werra ⋅ Chenghao Mou ⋅ Eduardo González Ponferrada ⋅ Huu Nguyen ⋅ Jörg Frohberg ⋅ Mario Šaško ⋅ Quentin Lhoest ⋅ Angelina McMillan-Major ⋅ Gerard Dupont ⋅ Stella Biderman ⋅ Anna Rogers ⋅ Loubna Ben allal ⋅ Francesco De Toni ⋅ Giada Pistilli ⋅ Olivier Nguyen ⋅ Somaieh Nikpoor ⋅ Maraim Masoud ⋅ Pierre Colombo ⋅ Javier de la Rosa ⋅ Paulo Villegas ⋅ Tristan Thrush ⋅ Shayne Longpre ⋅ Sebastian Nagel ⋅ Leon Weber ⋅ Manuel Muñoz ⋅ Jian Zhu ⋅ Daniel Van Strien ⋅ Zaid Alyafeai ⋅ Khalid Almubarak ⋅ Minh Chien Vu ⋅ Itziar Gonzalez-Dios ⋅ Aitor Soroa ⋅ Kyle Lo ⋅ Manan Dey ⋅ Pedro Ortiz Suarez ⋅ Aaron Gokaslan ⋅ Shamik Bose ⋅ David Adelani ⋅ Long Phan ⋅ Hieu Tran ⋅ Ian Yu ⋅ Suhas Pai ⋅ Jenny Chim ⋅ Violette Lepercq ⋅ Suzana Ilic ⋅ Margaret Mitchell ⋅ Sasha Alexandra Luccioni ⋅ Yacine Jernite

Keywords: Multilingual BigScience Language Modeling dataset

2022 Poster

[ Paper] [ Poster] [ OpenReview]

Abstract

As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the foreground. This paper documents the data creation and curation efforts undertaken by BigScience to assemble the Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus, a 1.6TB dataset spanning 59 languages that was used to train the 176-billion-parameter BigScience Large Open-science Open-access Multilingual (BLOOM) language model. We further release a large initial subset of the corpus and analyses thereof, and hope to empower large-scale monolingual and multilingual modeling projects with both the data and the processing tools, as well as stimulate research around this large multilingual corpus.

Video

Chat is not available.