Artificial Intelligence for Spatial Transcriptomics: A Scoping Review of Architectures and Models
Abstract
Learning meaningful representations from multimodal spatial transcriptomics data—which integrates histology images with gene expression—is a fundamental challenge in biological vision. Spatial transcriptomics provides this data, enabling the mapping of the transcriptome onto tissue sections. This paper presents a thorough survey of representation learning in spatial omics, critically comparing over 40 deep learning frameworks. We group these models by the core tasks their learned representations are designed to solve: cell type deconvolution, spatial domain identification, gene expression imputation, 3D tissue reconstruction, and cell-cell interaction simulation. Special attention is given to the dominant architectures for representation learning in this domain, including graph neural networks, contrastive learning, and multimodal fusion methods. We evaluate representative models such as ADCL, CellMirror, and MuST for the scalability, interpretability, and biological impact of their learned embeddings. The survey also addresses common challenges that hinder representation learning, including spatial noise, modality imbalance, and low-resolution data. Finally, we outline future directions centered on building foundation models for spatial biology and improving 3D alignment. This review provides a critical guide for researchers developing foundational and task-specific representations from multimodal spatial data.