Timezone: »

 
Multi-modal Self-supervised Pre-training for Large-scale Genome Data
Shentong Mo · Xi Fu · Chenyang Hong · Yizhen Chen · Yuxuan Zheng · Xiangru Tang · Yanyan Lan · Zhiqiang Shen · Eric Xing
Event URL: https://openreview.net/forum?id=fdV-GZ4LPfn »

Open genomic regions, being accessible to regulatory proteins, could act as the on/off switch or amplifier/attenuator of gene expression, and thus reflects the defining characteristics of cell types. Many previous models make predictions from the sequence to the regulatory region, but the interaction between regulatory regions and genes could be complex and differ between cell types. Moreover, current models usually only perform well on the cell types in the training set, which are not generalizable to data-scarce scenarios. In this work, we propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT. Specifically, we simultaneously take the 1d sequence of genome data and a 2d matrix of (transcription factors × regions) as the input, where three pre-training tasks are proposed to improve the robustness and generalizability of our model. We pre-train our model on the ATAC-seq dataset with 17 million gene sequences. We evaluate our GeneBERT on various downstream tasks, including promoter prediction, transaction factor binding sites prediction, disease risks estimation, and RNA-Splicing. Extensive experiments demonstrate the effectiveness of multi-modal and self-supervised pre-training for large-scale genome data.

Author Information

Shentong Mo (CMU)
Xi Fu (Columbia University)
Chenyang Hong (The Chinese University of Hong Kong)
Yizhen Chen (The Chinese University of Hong Kong)
Yuxuan Zheng (East China Normal University)
Xiangru Tang (Yale University)
Yanyan Lan (Tsinghua University, Tsinghua University)
Zhiqiang Shen (CMU)
Eric Xing (Petuum Inc. / Carnegie Mellon University)

More from the Same Authors