Learning Representations from Incomplete EHR Data with Dual-Masked Autoencoding
Abstract
Learning from electronic health records (EHRs) time series is challenging due to irregular sampling, heterogeneous missingness patterns, and the signal encoded in missingness itself. Prior self-supervised methods either impute data before learning, or use imputation as the objective, reducing their capacity to learn robust representations that support clinical downstream tasks. We propose the Augmented-Intrinsic Dual-Masked Autoencoder (AID-MAE), which learns directly from incomplete EHR tables by applying an intrinsic missing mask to represent naturally missing values and an augmented mask that hides a subset of observed values for reconstruction during training. AID-MAE consistently outperforms strong baselines, including XGBoost and DuETT, on mortality and length-of-stay prediction with MIMIC-IV, while learning embeddings that naturally stratify patient cohorts in the embedding space.