Poster
in
Workshop: AI for New Drug Modalities
Probing the Embedding Space of Protein Foundation Models through Intrinsic Dimension Analysis
Soojung Yang · Juno Nam · Tynan Perez · Jinyeop Song · Xiaochen Du · Rafael Gomez-Bombarelli
Abstract:
Protein foundation models produce embeddings that are valuable for various downstream tasks, yet the structure and information content of these embeddings remain poorly understood, particularly in relation to diverse pre-training tasks and input modalities. We apply intrinsic dimension () analysis to quantify the complexity of protein embeddings from several widely used models, including ESM-2, ESM-IF, ProstT5, and ProteinMPNN. We also employ correlation (Cor) to measure the shared information between different embeddings. Our results reveal a universality in protein embeddings, with similar scales across models and strong correlations between protein and residue embeddings. We observe significant redundancy, with values much smaller than the original embedding dimensions. We also show that models capture both spatial and sequential long-range correlation, with correlation decay rate differing based on the input modalities and pre-training tasks. Lastly, we analyze mutant embeddings, revealing that mutations cluster effectively by site, and fine-tuning further reduces the to capture task-specific representations.
Chat is not available.