Timezone: »

BigBio: A Framework for Data-Centric Biomedical Natural Language Processing
Jason Fries · Leon Weber · Natasha Seelam · Gabriel Altay · Debajyoti Datta · Samuele Garda · Sunny Kang · Rosaline Su · Wojciech Kusa · Samuel Cahyawijaya · Fabio Barth · Simon Ott · Matthias Samwald · Stephen Bach · Stella Biderman · Mario Sänger · Bo Wang · Alison Callahan · Daniel León Periñán · Théo Gigant · Patrick Haller · Jenny Chim · Jose Posada · John Giorgi · Karthik Rangasai Sivaraman · Marc Pàmies · Marianna Nezhurina · Robert Martin · Michael Cullan · Moritz Freidank · Nathan Dahlberg · Shubhanshu Mishra · Shamik Bose · Nicholas Broad · Yanis Labrak · Shlok Deshmukh · Sid Kiblawi · Ayush Singh · Minh Chien Vu · Trishala Neeraj · Jonas Golde · Albert Villanova del Moral · Benjamin Beilharz

Thu Dec 01 02:00 PM -- 04:00 PM (PST) @ Hall J #1012

Training and evaluating language models increasingly requires the construction of meta-datasets -- diverse collections of curated data with clear provenance. Natural language prompting has recently lead to improved zero-shot generalization by transforming existing, supervised datasets into a variety of novel instruction tuning tasks, highlighting the benefits of meta-dataset curation. While successful in general-domain text, translating these data-centric approaches to biomedical language modeling remains challenging, as labeled biomedical datasets are significantly underrepresented in popular data hubs. To address this challenge, we introduce BigBio a community library of 126+ biomedical NLP datasets, currently covering 13 task categories and 10+ languages. BigBio facilitates reproducible meta-dataset curation via programmatic access to datasets and their metadata, and is compatible with current platforms for prompt engineering and end-to-end few/zero shot language model evaluation. We discuss our process for task schema harmonization, data auditing, contribution guidelines, and outline two illustrative use cases: zero-shot evaluation of biomedical prompts and large-scale, multi-task learning. BigBio is an ongoing community effort and is available at https://github.com/bigscience-workshop/biomedical

Author Information

Jason Fries (Stanford University)
Leon Weber (LMU Munich)
Natasha Seelam (Sherlock Biosciences)
Gabriel Altay (Tempus Labs)
Debajyoti Datta (University of Virginia)
Samuele Garda (Department of Computer Science, Humboldt University Berlin, Humboldt Universität Berlin)
Sunny Kang (Immuneering)
Rosaline Su (Workhuman)
Wojciech Kusa (TU Wien)
Samuel Cahyawijaya (The Hong Kong University of Science and Technology)
Fabio Barth (Humboldt Universität Berlin)
Simon Ott (Medical University of Vienna)
Matthias Samwald (Institute of Artificial Intelligence, Medical University of Vienna)
Stephen Bach (Brown University)
Stella Biderman (EleutherAI)
Mario Sänger (Humboldt Universität Berlin)
Bo Wang (Massachusetts General Hospital)
Alison Callahan (Stanford University)
Daniel León Periñán (Max Delbrück Center for Molecular Medicine)
Théo Gigant (L2S - Centrale Supélec)
Patrick Haller (Humboldt Universität Berlin)
Jenny Chim (Queen Mary University London)
Jose Posada (Universidad del Norte)

PhD in Biomedical Informatics from the University of Pittsburgh. Assistant Professor at the Department of Systems and Computer Engineering. His research interests are the generation of high-quality clinical evidence through OHDSI federated network and, the application of NLP to clinical text using AI. Prior to his current role he was a Senior Clinical Data Scientist at Stanford University, where he co-developed the second generation of their clinical research data warehouse STARR-OMOP.

John Giorgi
Karthik Rangasai Sivaraman (BITS Pilani)
Marc Pàmies (Barcelona Supercomputing Center)
Marianna Nezhurina (Kuban State University of Technology)
Robert Martin (Department of Computer Science, Humboldt University Berlin, Humboldt Universität Berlin)
Michael Cullan (Arizona State University)
Moritz Freidank (Visium SA)
Nathan Dahlberg
Shubhanshu Mishra (shubhanshu.com)
Shamik Bose
Nicholas Broad (Stanford University)
Yanis Labrak (Avignon University)
Yanis Labrak

👨🏻‍🎓 PhD. student in Computer Science (CS), Avignon University 🇫🇷 🏛 Research Scientist - Machine Learning in Healthcare

Shlok Deshmukh (Elucidata, Inc.)
Sid Kiblawi (Microsoft)
Ayush Singh (Cigna)
Ayush Singh

Ayush Singh is a Research Engineer working at the intersection of Natural Language Processing, Machine Learning and Healthcare. He has worked on applications ranging from modeling physics of fetal brain in-situ MRI scans to style transfer in natural language generation. Recently, he has been working on using clinical knowledge graphs for query reasoning in large language models. Ayush is driven to improve the state of education and healthcare using AI. When not working, he paints and watches cats & dogs videos on internet.

Minh Chien Vu (DETOMO Inc.)
Trishala Neeraj (Cornell University)
Jonas Golde (Department of Computer Science, Humboldt University Berlin, Humboldt Universität Berlin)
Albert Villanova del Moral (CNRS)
Benjamin Beilharz (Technische Universität Darmstadt)

More from the Same Authors