Timezone: »

Bridging the Gap: from Machine Learning Research to Clinical Practice
Julia Vogt · Ece Ozkan · Sonali Parbhoo · Melanie F. Pradier · Patrick Schwab · Shengpu Tang · Mario Wieser · Jiayu Yao

Tue Dec 14 05:30 AM -- 02:30 PM (PST) @ None
Event URL: https://sites.google.com/g.harvard.edu/research2clinics »

Machine learning (ML) methods often achieve superhuman performance levels, however, most existing machine learning research in the medical domain is stalled at the research paper level and is not implemented into daily clinical practice. To achieve the overarching goal of realizing the promise of cutting-edge ML techniques and bring this exciting research to fruition, we must bridge the gap between research and clinics. In this workshop, we aim to bring together ML researchers and clinicians to discuss the challenges and potential solutions on how to enable the use of state-of-the-art ML techniques in the daily clinical practice and ultimately improve healthcare by trying to answer questions like: what are the procedures that bring humans-in-the-loop for auditing ML systems for healthcare? Are the proposed ML methods robust to changes in population, distribution shifts, or other types of biases? What should the ML methods/systems fulfill to successfully deploy them in the clinics? What are failure modes of ML models for healthcare? How can we develop methods for improved interpretability of ML predictions in the context of healthcare? And many others. We will further discuss translational and implementational aspects and talk about challenges and lessons learned from integrating an ML system into clinical workflow.

Tue 5:30 a.m. - 5:40 a.m.
Opening remarks by the organizers (Short intro)
Tue 5:40 a.m. - 6:00 a.m.
Invited talk (Clinical) - Sven Wellmann (Invited talk)   
Sven Wellmann
Tue 6:05 a.m. - 6:25 a.m.
Invited talk (ML) - Michael Brudno (Invited talk)
Michael Brudno
Tue 6:30 a.m. - 7:10 a.m.
Moderated Q&A (Topic: pediatrics) (Moderated Q&A)
Tue 7:10 a.m. - 7:25 a.m.
Tue 7:25 a.m. - 8:10 a.m.
Informal exchange to encourage discussions and collaborations (Round table discussion)
Tue 8:15 a.m. - 8:35 a.m.
Spotlight Presentations (Spotlight)
Tue 8:40 a.m. - 9:25 a.m.
Poster Session 1 (Poster Session)  link »
Tue 9:30 a.m. - 10:15 a.m.
Lunch Break (Break)
Tue 10:15 a.m. - 10:35 a.m.
Invited talk (ML) - Rich Caruana (Invited talk)   
Rich Caruana
Tue 10:40 a.m. - 11:00 a.m.
Invited talk (Clinical) - Bram Stieljes (Invited talk)
Bram Stieljes
Tue 11:05 a.m. - 11:45 a.m.
Moderated Q&A (Topic: tbd) (Moderated Q&A)
Tue 11:45 a.m. - 12:00 p.m.
Tue 12:00 p.m. - 12:20 p.m.
Invited talk (ML) - Barbara Engelhardt (Invited talk (ML))
Barbara Engelhardt
Tue 12:25 p.m. - 12:45 p.m.
Invited talk (Clinical) - Roy Perlis (Invited talk)
Roy Perlis
Tue 12:50 p.m. - 1:30 p.m.
Moderated Q&A (Topic: tbd) (Moderated Q&A)
Tue 1:35 p.m. - 2:15 p.m.
Poster Session 2 (Poster Session)  link »
Tue 2:20 p.m. - 2:30 p.m.
Closing remarks by the organizers (Short intro)
Tue 2:30 p.m. - 2:30 p.m.
Workshop ends (Break)
[ Visit Poster at Spot A1 in Virtual World ]

Deep learning excels in the analysis of unstructured data and recent advancements allow to extend these techniques to survival analysis. In the context of clinical radiology, this enables, e.g., to relate unstructured volumetric images to a risk score or a prognosis of life expectancy and support clinical decision making. Medical applications are, however, associated with high criticality and consequently, neither medical personnel nor patients do usually accept black box models as reason or basis for decisions. Apart from averseness to new technologies, this is due to missing interpretability, transparency and accountability of many machine learning methods. We propose a hazard-regularized variational autoencoder that supports straightforward interpretation of deep neural architectures in the context of survival analysis, a field highly relevant in healthcare. We apply the proposed approach to abdominal CT scans of patients with liver tumors and their corresponding survival times.

Tobias Weber · Bernd Bischl · David Ruegamer
[ Visit Poster at Spot A2 in Virtual World ]

Recent strides in interpretable machine learning (ML) research reveal that models exploit undesirable patterns in the data to make predictions, which potentially causes harms in deployment. However, it is unclear how we can fix these models. We present our ongoing work, GAM Changer, an open-source interactive system to help data scientists and domain experts easily and responsibly edit their Generalized Additive Models (GAMs). With novel visualization techniques, our tool puts interpretability into action—empowering human users to analyze, validate, and align model behaviors with their knowledge and values. Built using modern web technologies, this tool runs locally in users’ computational notebooks or web browsers without requiring extra compute resources, lowering the barrier to creating more responsible ML models. GAM Changer is available at https://r2c-submission.surge.sh.

Zijie Jay Wang · Harsha Nori · Duen Horng Chau · Jennifer Wortman Vaughan · Rich Caruana
[ Visit Poster at Spot A3 in Virtual World ]

Decision support systems based on clinical notes have the potential to improve patient care by pointing doctors towards overseen risks. Predicting a patient's outcome is an essential part of such systems, for which the use of deep neural networks has shown promising results. However, the patterns learned by these networks are mostly opaque and previous work revealed flaws regarding the reproduction of unintended biases. We thus introduce an extendable testing framework that evaluates the behavior of clinical outcome models regarding changes of the input. The framework helps to understand learned patterns and their influence on model decisions. In this work, we apply it to analyse the change in behavior with regard to the patient characteristics gender, age and ethnicity. Our evaluation of three current clinical NLP models demonstrates the concrete effects of these characteristics on the models' decisions. They show that model behavior varies drastically even when fine-tuned on the same data and that allegedly best-performing models have not always learned the most medically plausible patterns.

Betty van Aken · Alexander Löser
[ Visit Poster at Spot A4 in Virtual World ]

Recent work in artificial intelligence fairness tackles discrimination by constraining optimization programs to achieve parity of some fairness statistic. Most assume certainty on the class label which is impractical in many clinical practices, such as risk stratification, medication-assisted treatment and precision medicine. Instead, we consider fairness in longitudinal censored decision making environments, where the time to an event of interest might be unknown for a subset of the study group, resulting in censorship on the class label and inapplicability of existing fairness studies. To this end, we extend and devise applicable fairness statistics as well as a new debiasing algorithm, thus providing necessary complements to these important socially sensitive tasks. Experiments on real-world censored and discriminated datasets illustrate and confirm the utility of our approach.

Wenbin Zhang · Jeremy Weiss
[ Visit Poster at Spot A5 in Virtual World ]

Accurately estimating personalized treatment effects within a single study has been challenging due to the limited sample size. Here we propose a tree-based model averaging approach to improve the estimation efficiency of conditional average treatment effects concerning the population of a target research site by leveraging models derived from potentially heterogeneous populations of other sites, but without them sharing individual-level data. To our best knowledge, there is no established model averaging approach for distributed data with a focus on improving the estimation of treatment effects. Under distributed data networks, we develop an efficient and interpretable tree-based ensemble of personalized treatment effect estimators to join results across hospital sites, while actively modeling for the heterogeneity in data sources through site partitioning. The efficiency of this approach is demonstrated by a study of causal effects of oxygen saturation on hospital mortality and backed up by comprehensive numerical results.

Xiaoqing Tan
[ Visit Poster at Spot A6 in Virtual World ]

Despite their state-of-art performance, the lack of explainability impedes the deployment of deep learning in day-to-day clinical practice. We propose REM, an explainable methodology for extracting rules from deep neural networks and combining them with rules from non-deep learning models. This allows integrating machine learning and reasoning for investigating basic and applied biological research questions. We evaluate the utility of REM in two cancer case studies and demonstrate that it can efficiently extract accurate and comprehensible rulesets from neural networks that can be readily integrated with rulesets obtained from tree-based approaches. REM provides explanation facilities for predictions and enables the clinicians to validate and calibrate the extracted rulesets with their domain knowledge. With these functionalities, REM caters for a novel and direct human-in-the-loop approach in clinical decision-making.

Zohreh Shams · Botty Dimanov · Nikola Simidjievski · Helena Andres-Terre · Paul Scherer · Urška Matjašec · Mateja Jamnik · Pietro Lió
[ Visit Poster at Spot B0 in Virtual World ]

Deploying survival models in clinical settings requires both interpretability and transferability (that is, models that are easy to deploy) [Klau et al., 2018, Boulesteix et al., 2017]. The gold standard are linear models trained on only clinical data and at most one molecular data group, such as gene expression. However, black-box methods for multi-omics integration such as BlockForest [Hornung and Wright, 2019] have recently been shown to outperform both the clinical Cox Proportional Hazards model and multi-omics adapted linear models in terms of concordance [Herrmann et al., 2021, Hornung and Wright, 2019]. Thus, there is a need to make multi-omics methods amenable to clinical settings to leverage their excellent performance. We propose to use surrogate models, a technique long used in interpretable machine learning [Molnar, 2020], to create sparse linear models as surrogates for black-box multi-omics models. We show that these surrogates yield better performance than linear models trained directly on the input datasets and still achieve relatively high sparsity levels. Our implementation is available on Github (Link is embedded when clicking on "Github" - note that the repo may give an indication as to the authors’ affiliation).

David Wissel
[ Visit Poster at Spot B1 in Virtual World ]

Representation learning is an important component in solving most Natural Language Processing(NLP) problems, including Word Sense Disambiguation(WSD). The WSD task tries to find the best meaning in a knowledge base for a word with multiple meanings(ambiguous word). WSD methods choose this best meaning based on the context, i.e., the words around the ambiguous word in the input text document. Thus, word representations may improve the effectiveness of the disambiguation models if they carry useful information from the context and the knowledge base. Most of the current representation learning approaches are that they are mostly trained on the general English text and are not domain specified. In this paper, we present a novel contextual-knowledge base aware sense representation method in the biomedical domain. The novelty in our representation is the integration of the knowledge base and the context. This representation lies in a space comparable to that of contextualized word vectors, thus allowing a word occurrence to be easily linked to its meaning by applying a simple nearest neighbor approach. Comparing our approach with state-of-the-art methods shows the effectiveness of our method in terms of text coherence.

Mozhgan saeidi
[ Visit Poster at Spot B2 in Virtual World ]

As the area of application of deep neural networks expands to areas requiring expertise, e.g., in medicine and law, more exquisite annotation processes for expert knowledge training are required. In particular, it is difficult to guarantee generalization performance in the clinical field in the case of expert knowledge training where opinions may differ even among experts on annotations. To raise the issue of the annotation generation process for expertise training of CNNs, we verified the annotations for surgical phase recognition of laparoscopic cholecystectomy and subtotal gastrectomy for gastric cancer. We produce calibrated annotations for the seven phases of cholecystectomy by analyzing the discrepancies of previously annotated labels and by discussing the criteria of surgical phases. For gastrectomy for gastric cancer has more complex twenty-one surgical phases, we generate consensus annotation by the revision process with five specialists. By training the CNN-based surgical phase recognition networks with revised annotations, we achieved improved generalization performance over models trained with original annotation under the same cross-validation settings. We showed that the expertise data annotation pipeline for deep neural networks should be more rigorous based on the type of problem to apply clinical field.

Seungbum Hong · Jiwon Lee · Bokyung Park · Ahmed Abbas Alwusaibie · Anwar Hudaish Alfadhel · SungHyun Park · Woo Jin Hyung · Min-Kook Choi
[ Visit Poster at Spot B3 in Virtual World ]

Left ventricular ejection fraction (EF) is an important indicator of echocardiography for heart disease. Because echocardiography is costly and time-consuming, there is a need for a simpler method to predict a low EF in clinical practice. In recent years, deep neural network (DNN)-based models have been used to predict low EF with high accuracy based on electrocardiography (ECG), which is easy to perform. However, DNN-based models are incomprehensive for clinicians, and lack of interpretability is one of the biggest barriers to their use. In this paper, two new methods are proposed; one is a pre-process method, and the other is an analysis method. The pre-process method extracts one heartbeat of a fixed size from ECG; therefore, it enables many traditional machine learning approaches to be applied to ECG data. The analysis method involves interpretable and unsupervised mapping of ECG using the pre-process method and reveals that one heartbeat on ECG holds information on a low EF upon numerical evaluation. The findings of an inverse analysis corresponded to previous clinical research, which suggests that the proposed method is reliable.

Hirotoshi Takeuchi · Mitsuhiko Nakamoto
[ Visit Poster at Spot B4 in Virtual World ]

Off-policy policy evaluation methods for sequential decision making can be used to help identify if a proposed decision policy is better than a current baseline policy. However, a new decision policy may be better than a baseline policy for some individuals but not others. This has motivated a push towards personalization and accurate per-state estimates of heterogeneous treatment effects (HTEs). Given the limited data present in many important applications such as health care, individual predictions can come at a cost to accuracy and confidence in such predictions. We develop a method to balance the need for personalization with confident predictions by identifying subgroups where it is possible to confidently estimate the expected difference in a new decision policy relative to a baseline. We propose a novel loss function that accounts for uncertainty during the subgroup partitioning phase. In experiments, we show that our method can be used to form accurate predictions of HTEs where other methods struggle.

Ramtin Keramati · Omer Gottesman · Leo Celi · Finale Doshi-Velez · Emma Brunskill
[ Visit Poster at Spot B5 in Virtual World ]

Agitation is one of the neuropsychiatric symptoms with high prevalence in dementia which can negatively impact the Activities of Daily Living (ADL) and the independence of individuals. Detecting agitation episodes can assist in providing People Living with Dementia (PLWD) with early and timely interventions. Analysing agitation episodes will also help identify modifiable factors such as ambient temperature and sleep as possible components causing agitation in an individual. This preliminary study presents a supervised learning model to analyse the risk of agitation in PLWD using in-home monitoring data. The in-home monitoring data includes motion sensors, physiological measurements, and the use of kitchen appliances from 46 homes of PLWD between April 2019-June 2021.
We apply a recurrent deep learning model to identify agitation episodes validated and recorded by a clinical monitoring team. We present the experiments to assess the efficacy of the proposed model. The proposed model achieves an average of 79.78% recall, 27.66% precision and 37.64% F1 scores when employing the optimal parameters, suggesting a good ability to recognise agitation events. We also discuss using machine learning models for analysing the behavioural patterns using continuous monitoring data and explore clinical applicability and the choices between specificity and specificity in home monitoring applications.

Francesca Palermo · Ramin Nilforooshan · David Sharp · Payam Barnaghi
[ Visit Poster at Spot B6 in Virtual World ]

For researchers bridging the gap between machine learning and clinical practice, predictive models drawn from a variety of data streams remain an area of intense interest. In the specialty of psychiatry, the quality of such results is at times limited by type errors, with depression being a particularly egregious example of an overloaded concept. Here, we attempt to disambiguate the notion of depression by exploring its nuances spanning various data types, including diagnosis, mood episode, and symptom axis. A proposed type system resolution is provided for fortification of type safety, in the interest of improved interpretability of predictions in the context of healthcare.

Michael A Yee
[ Visit Poster at Spot C0 in Virtual World ]

An automated feature selection pipeline was developed using several state-of-the-art feature selection techniques to select optimal features for Differentiating Patterns of Care (DPOC). The pipeline included three types of feature selection techniques; Filters, Wrappers and Embedded methods to select the top K features. Five different datasets with binary dependent variables were used and their different top K optimal features selected. The selected features were tested in the existing multi-dimensional subset scanning (MDSS) where the most anomalous subpopulations, most anomalous subsets, propensity scores, and effect of measures were recorded to test their performance. This performance was compared with four similar metrics gained after using all covariates in the dataset in the MDSS pipeline. We found out that despite the different feature selection techniques used, the data distribution is key to note when determining the technique to use

Catherine Wanjiru · William Ogallo · Girmaw Abebe Tadesse · Charles Wachira · Isaiah Onando Mulang' · Aisha Walcott-Bryant
[ Visit Poster at Spot C1 in Virtual World ]

Learning meaningful representations is challenging when the training data is scarce. Attention maps can be used to verify that a model learned the target representations. Those representations should match human understanding, be generalizable to unseen data, and not focus on potential bias in the dataset. Attention maps are designed to highlight regions of the model’s input that were discriminative for its predictions. However, different attention maps computation methods often highlight different regions of the input, with sometimes contradictory explanations for a prediction. This effect is exacerbated when the training set is small. This indicates that either the model learned incorrect representations or that the attention maps methods did not accurately estimate the model’s representations. We propose an unsupervised fine-tuning method that optimizes the consistency of attention maps and show that it improves both classification performance and the quality of attention maps. We propose an implementation for two state-of-the-art attention computation methods, Grad-CAM and Guided Backpropagation, which relies on an input masking technique. We evaluate this method on our own dataset of event detection in continuous video recordings of hospital patients aggregated and curated for this work. As a sanity check, we also evaluate the proposed method on PASCAL VOC. On the video data, we show that the method can be combined with SimCLR, a state-of-the-art self-supervised training method, to further improve classification performance. With the proposed method, we achieve a 6.6 points lift of F1 score over SimCLR alone for classification on our video dataset, a 2.9 point lift of F1 score over ResNet for classification on

Ali Mirzazadeh · Florian Dubost · Daniel Fu · Khaled Saab · Christopher Lee-Messer · Daniel Rubin
[ Visit Poster at Spot C2 in Virtual World ]

Analyzing the behavior of a population in response to disease and interventions is critical to unearth variability in healthcare as well as understand sub-populations that require specialized attention, but also to assist in designing future interventions. Two aspects become very essential in such analysis namely: i) Discovery of differentiating patterns exhibited by sub-populations, and ii) Characterization of the identified subpopulations. For the discovery phase, an array of approaches in the anomalous pattern detection literature have been employed to reveal differentiating patterns, especially to identify anomalous subgroups. However, these techniques are limited to describing the anomalous subgroups and offer little in form of insightful characterization, thereby limiting interpretability and understanding of these data-driven techniques in clinical practices. In this work, we propose an analysis of differentiated output (rather than discovery) and quantify anomalousness similarly to the counter-factual setting. To this end we design an approach to perform post-discovery analysis of anomalous subsets, in which we initially identify the most important features on the anomalousness of the subsets, then by perturbation, the approach seeks to identify the least number of changes necessary to lose anomalousness. Our approach is presented and the evaluation results on the 2019 MarketScan Commercial Claims and Medicare data, show that extra insights can be obtained by extrapolated examination of the identified subgroups.

Isaiah Onando Mulang' · William Ogallo · Girmaw Abebe Tadesse · Aisha Walcott-Bryant
[ Visit Poster at Spot C3 in Virtual World ]

Differences in clinical outcomes and costs within and between healthcare sites are a result of varying patient populations. We aim to pragmatically leverage this population heterogeneity and identify opportunities for beneficial transfer of knowledge across healthcare sites. We propose an algorithmic approach that is robust to sampling variance and yields reliable and human-interpretable insights into knowledge transfer opportunities. Our experimental results, obtained with two intensive care monitoring datasets, demonstrate the potential utility of the proposed method in clinical practice.

Willa Potosnak · Sebastian Caldas Rivera · Gilles Clermont · Kyle Miller · Artur Dubrawski
[ Visit Poster at Spot C4 in Virtual World ]

For fluid resuscitation of critically ill to be effective, it must be well calibrated in terms of timing and dosages of treatments. Both under-resuscitation due to delayed or inadequate treatment and over-resuscitation can lead to unfavorable patient outcomes. In current practice, sufficiency of resuscitation is determined using primarily invasively measured vital signs, including Arterial Pressure and SvO2. These measurements may not be available in non-acute care settings and outside of hospitals, in particular in the field when treating subjects injured in traffic accidents or wounded in combat, where only non-invasive monitoring is available to drive care. We propose a Machine Learning (ML) approach to estimate the sufficiency of fluid resuscitation utilizing only non-invasively measured vital signs. We also aim at addressing another challenge known from literature: the impact of inter-patient diversity on the ability of ML models to generalize well to previously unseen subjects. The reference to a stable personal baseline, though an effective remedy for the inter-patient diversity, is usually not available for e.g. trauma patients rushed in for care and presenting in already acute states. We propose a novel framework to address those challenges. It uses only non-invasively measured vital signs to predict sufficiency of resuscitation, and compensates for the lack of personal baselines by leveraging reference data collected from previous patients. Through comprehensive evaluation on the physiological data collected in laboratory animal experiments, we demonstrate that the proposed approach can achieve competitive performance on new patients using only non-invasive measurements without access to their personal baselines. These characteristics enable effective monitoring of fluid resuscitation in real-world acute settings with limited monitoring resources, and can help facilitate broader adoption of ML in this important subfield of healthcare.

Xinyu Li · Michael Pinsky · Artur Dubrawski
[ Visit Poster at Spot C5 in Virtual World ]

Closed-loop neuromodulation provides a powerful paradigm for the treatment of diseases, restoring function, and understanding the causal links between neural and behavioral processes, however, the complexities of interacting with the nervous system create several challenges for designing optimal closed-loop neuromodulation control systems and translating them into clinical settings. Artificial Intelligence (AI) and Reinforcement Learning (RL) can be leveraged to design intelligent closed-loop neuromodulation (iCLON) systems that can autonomously learn and adapt neuromodulation control policies in clinical settings and bridge the translational gap between pre-clinical design and clinical deployment of neuromodulation therapies. We are developing an open-source AI platform, called Neuroweaver, to enable algorithm-software-hardware co-design and deployment of translatable iCLON systems. In this paper, we present the design elements of the Neuroweaver platform that are translatability of iCLON systems.

Parisa Sarikhani · Hao-Lun Hsu · Sean Kinzer · Hadi Esmaeilzadeh · Babak Mahmoudi
[ Visit Poster at Spot C6 in Virtual World ]

Sepsis is a leading cause of mortality and its treatment is very expensive. Sepsis treatment is also very challenging because there is no consensus on what interventions work best and different patients respond very differently to the same treatment. Deep Reinforcement Learning methods can be used to come up with optimal policies for treatment strategies mirroring physician actions. In the health care scenario, the available data is mostly collected offline with no interaction with the environment, which necessitates the use of offline RL techniques. However, offline RL paradigm suffers from action distribution shifts which in turn negatively affect learning an optimal policy for the treatment. In this work, we propose to use the Conservative-Q Learning (CQL) algorithm to mitigate this shift. Experimental results on MIMIC-III dataset demonstrate that the learned policy is more similar to the physicians’ policy as compared to the policies learned from conventional deep Q Learning algorithms. The policy learned from the proposed CQL approach could help clinicians in Intensive Care Units to make better decisions while treating septic patients and improve the survival rate.

Pramod Kaushik · Raju Bapi

Author Information

Julia Vogt (ETH Zurich)
Ece Ozkan (ETH Zurich)
Sonali Parbhoo (Harvard University)
Melanie F. Pradier (Microsoft Research)
Patrick Schwab (GSK)
Shengpu Tang (University of Michigan)
Mario Wieser (Genedata AG)
Jiayu Yao (Harvard University)

More from the Same Authors