Skip to yearly menu bar Skip to main content

Workshop: Deep Generative Models for Health

Large Language Models with Retrieval-Augmented Generation for Zero-Shot Disease Phenotyping

Will Thompson · David Vidmar · Jessica De Freitas · Gabriel Altay · Kabir Manghnani · Andrew Nelsen · Kellie Morland · John Pfeifer · Brandon Fornwalt · RuiJun Chen · Martin Stumpe · Riccardo Miotto

Abstract: Identifying disease phenotypes from electronic health records is critical for numerous secondary uses. Manually encoding physician knowledge into rules is particularly challenging for rare diseases due to inadequate EHR coding, necessitating review of clinical notes. Large language models (LLM) offer promise in text understanding but may not efficiently handle real-world clinical documentation. We propose a zero-shot LLM-based method enriched by retrieval-augmented generation and MapReduce, which pre-identifies disease-related text snippets to be used in parallel as queries for the LLM to establish diagnosis. This method as applied to pulmonary hypertension, a rare disease, significantly outperforms physician logic rules ($F_1$ score of 0.62 vs. 0.75) which has the potential to enhance rare disease cohort and care gap identification, expanding the scope of robust clinical research.

Chat is not available.