Timezone: »
Training and evaluating language models increasingly requires the construction of meta-datasets -- diverse collections of curated data with clear provenance. Natural language prompting has recently lead to improved zero-shot generalization by transforming existing, supervised datasets into a variety of novel instruction tuning tasks, highlighting the benefits of meta-dataset curation. While successful in general-domain text, translating these data-centric approaches to biomedical language modeling remains challenging, as labeled biomedical datasets are significantly underrepresented in popular data hubs. To address this challenge, we introduce BigBio a community library of 126+ biomedical NLP datasets, currently covering 13 task categories and 10+ languages. BigBio facilitates reproducible meta-dataset curation via programmatic access to datasets and their metadata, and is compatible with current platforms for prompt engineering and end-to-end few/zero shot language model evaluation. We discuss our process for task schema harmonization, data auditing, contribution guidelines, and outline two illustrative use cases: zero-shot evaluation of biomedical prompts and large-scale, multi-task learning. BigBio is an ongoing community effort and is available at https://github.com/bigscience-workshop/biomedical
Author Information
Jason Fries (Stanford University)
Leon Weber (LMU Munich)
Natasha Seelam (Sherlock Biosciences)
Gabriel Altay (Tempus Labs)
Debajyoti Datta (University of Virginia)
Samuele Garda (Department of Computer Science, Humboldt University Berlin, Humboldt Universität Berlin)
Sunny Kang (Immuneering)
Rosaline Su (Workhuman)
Wojciech Kusa (TU Wien)
Samuel Cahyawijaya (The Hong Kong University of Science and Technology)
Fabio Barth (Humboldt Universität Berlin)
Simon Ott (Medical University of Vienna)
Matthias Samwald (Institute of Artificial Intelligence, Medical University of Vienna)
Stephen Bach (Brown University)
Stella Biderman (EleutherAI)
Mario Sänger (Humboldt Universität Berlin)
Bo Wang (Massachusetts General Hospital)
Alison Callahan (Stanford University)
Daniel León Periñán (Max Delbrück Center for Molecular Medicine)
Théo Gigant (L2S - Centrale Supélec)
Patrick Haller (Humboldt Universität Berlin)
Jenny Chim (Queen Mary University London)
Jose Posada (Universidad del Norte)
PhD in Biomedical Informatics from the University of Pittsburgh. Assistant Professor at the Department of Systems and Computer Engineering. His research interests are the generation of high-quality clinical evidence through OHDSI federated network and, the application of NLP to clinical text using AI. Prior to his current role he was a Senior Clinical Data Scientist at Stanford University, where he co-developed the second generation of their clinical research data warehouse STARR-OMOP.
John Giorgi
Karthik Rangasai Sivaraman (BITS Pilani)
Marc Pàmies (Barcelona Supercomputing Center)
Marianna Nezhurina (Kuban State University of Technology)
Robert Martin (Department of Computer Science, Humboldt University Berlin, Humboldt Universität Berlin)
Michael Cullan (Arizona State University)
Moritz Freidank (Visium SA)
Nathan Dahlberg
Shubhanshu Mishra (shubhanshu.com)
Shamik Bose
Nicholas Broad (Stanford University)
Yanis Labrak (Avignon University)
👨🏻🎓 PhD. student in Computer Science (CS), Avignon University 🇫🇷 🏛 Research Scientist - Machine Learning in Healthcare
Shlok Deshmukh (Elucidata, Inc.)
Sid Kiblawi (Microsoft)
Ayush Singh (Cigna)

Ayush Singh is a Research Engineer working at the intersection of Natural Language Processing, Machine Learning and Healthcare. He has worked on applications ranging from modeling physics of fetal brain in-situ MRI scans to style transfer in natural language generation. Recently, he has been working on using clinical knowledge graphs for query reasoning in large language models. Ayush is driven to improve the state of education and healthcare using AI. When not working, he paints and watches cats & dogs videos on internet.
Minh Chien Vu (DETOMO Inc.)
Trishala Neeraj (Cornell University)
Jonas Golde (Department of Computer Science, Humboldt University Berlin, Humboldt Universität Berlin)
Albert Villanova del Moral (CNRS)
Benjamin Beilharz (Technische Universität Darmstadt)
More from the Same Authors
-
2021 : Evaluation of mathematical questioning strategies using data collected through weak supervision »
Debajyoti Datta · Maria Phillips · James P. Bywater · Jennifer L. Chiu · Ginger S. Watson · Laura E Barnes · Donald Brown -
2022 : scPerturb: Information Resource for Harmonized Single-Cell Perturbation Data »
Tessa Green · Stefan Peidli · Ciyue Shen · Torsten Gross · Joseph Min · Samuele Garda · Jake Taylor-King · Debora Marks · Augustin Luna · Nils Blüthgen · Chris Sander -
2022 : PyTAIL - Interactive and Incremental Learning of NLP Models with Human in the Loop for Online Data »
Shubhanshu Mishra · Jana Diesner -
2022 : Panel »
Vinay Prabhu · Lamtharn (Hanoi) Hantrakul · Stella Biderman · Deven Desai -
2022 : EleutherAI: Going Beyond "Open Science" to "Science in the Open" »
Jason Phang · Herbie Bradley · Leo Gao · Louis Castricato · Stella Biderman -
2022 : EleutherAI: Going Beyond "Open Science" to "Science in the Open" »
Jason Phang · Herbie Bradley · Leo Gao · Louis Castricato · Stella Biderman -
2022 Poster: Tight Lower Bounds on Worst-Case Guarantees for Zero-Shot Learning with Attributes »
Alessio Mazzetto · Cristina Menghini · Andrew Yuan · Eli Upfal · Stephen Bach -
2022 Poster: TweetNERD - End to End Entity Linking Benchmark for Tweets »
Shubhanshu Mishra · Aman Saini · Raheleh Makki · Sneha Mehta · Aria Haghighi · Ali Mollahosseini -
2022 Poster: The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset »
Hugo Laurençon · Lucile Saulnier · Thomas Wang · Christopher Akiki · Albert Villanova del Moral · Teven Le Scao · Leandro Von Werra · Chenghao Mou · Eduardo González Ponferrada · Huu Nguyen · Jörg Frohberg · Mario Šaško · Quentin Lhoest · Angelina McMillan-Major · Gerard Dupont · Stella Biderman · Anna Rogers · Loubna Ben allal · Francesco De Toni · Giada Pistilli · Olivier Nguyen · Somaieh Nikpoor · Maraim Masoud · Pierre Colombo · Javier de la Rosa · Paulo Villegas · Tristan Thrush · Shayne Longpre · Sebastian Nagel · Leon Weber · Manuel Muñoz · Jian Zhu · Daniel Van Strien · Zaid Alyafeai · Khalid Almubarak · Minh Chien Vu · Itziar Gonzalez-Dios · Aitor Soroa · Kyle Lo · Manan Dey · Pedro Ortiz Suarez · Aaron Gokaslan · Shamik Bose · David Adelani · Long Phan · Hieu Tran · Ian Yu · Suhas Pai · Jenny Chim · Violette Lepercq · Suzana Ilic · Margaret Mitchell · Sasha Alexandra Luccioni · Yacine Jernite -
2022 Poster: Multi-LexSum: Real-world Summaries of Civil Rights Lawsuits at Multiple Granularities »
Zejiang Shen · Kyle Lo · Lauren Yu · Nathan Dahlberg · Margo Schlanger · Doug Downey -
2021 : Poster Session 2 »
Yueqiu Sun · Haewon Jeong · Nrupatunga . · Hengyao Bao · Tongwen Huang · Debajyoti Datta -
2021 : H+AI:: Improving mathematical questioning in teacher training »
Debajyoti Datta -
2021 Poster: Distributed Deep Learning In Open Collaborations »
Michael Diskin · Alexey Bukhtiyarov · Max Ryabinin · Lucile Saulnier · quentin lhoest · Anton Sinitsin · Dmitry Popov · Dmitry V. Pyrkin · Maxim Kashirin · Alexander Borzunov · Albert Villanova del Moral · Denis Mazur · Ilia Kobelev · Yacine Jernite · Thomas Wolf · Gennady Pekhimenko -
2020 Session: Orals & Spotlights Track 23: Graph/Meta Learning/Software »
Stephen Bach · Tom Goldstein -
2017 Workshop: Learning with Limited Labeled Data: Weak Supervision and Beyond »
Isabelle Augenstein · Stephen Bach · Eugene Belilovsky · Matthew Blaschko · Christoph Lampert · Edouard Oyallon · Emmanouil Antonios Platanios · Alexander Ratner · Christopher Ré -
2012 Poster: Scaling MPE Inference for Constrained Continuous Markov Random Fields with Consensus Optimization »
Stephen H Bach · Matthias Broecheler · Lise Getoor · Dianne P O'Leary -
2010 Poster: A Bayesian Approach to Concept Drift »
Stephen H Bach · Mark Maloof