Skip to yearly menu bar Skip to main content


Poster
in
Workshop: AI for New Drug Modalities

Probing the Embedding Space of Protein Foundation Models through Intrinsic Dimension Analysis

Soojung Yang · Juno Nam · Tynan Perez · Jinyeop Song · Xiaochen Du · Rafael Gomez-Bombarelli


Abstract: Protein foundation models produce embeddings that are valuable for various downstream tasks, yet the structure and information content of these embeddings remain poorly understood, particularly in relation to diverse pre-training tasks and input modalities. We apply intrinsic dimension (Id) analysis to quantify the complexity of protein embeddings from several widely used models, including ESM-2, ESM-IF, ProstT5, and ProteinMPNN. We also employ Id correlation (IdCor) to measure the shared information between different embeddings. Our results reveal a universality in protein embeddings, with similar Id scales across models and strong correlations between protein and residue embeddings. We observe significant redundancy, with Id values much smaller than the original embedding dimensions. We also show that models capture both spatial and sequential long-range correlation, with correlation decay rate differing based on the input modalities and pre-training tasks. Lastly, we analyze mutant embeddings, revealing that mutations cluster effectively by site, and fine-tuning further reduces the Id to capture task-specific representations.

Chat is not available.