EmbedSimScore: Advancing Protein Similarity Analysis with Structural and Contextual Embeddings
Abstract
Accurately computing protein similarity is challenging due to the intricate interplay between local substructures and global structure within protein molecules. Traditional metrics like TM-score often focus on aligning the global structures of the proteins in a rather algorithmic way, potentially overlooking critical local-global relations and contextual comparisons. We introduce EmbedSimScore, a novel self-supervised method that generates superior structural and contextual embeddings by jointly considering both local substructures and global structures of proteins. Utilizing contrastive language-structure pre-training (CLSP) and structural contrastive learning, EmbedSimScore captures comprehensive features across different scales of protein structure. These embeddings provide a more precise and holistic means of computing protein similarities, resulting in the identification of intrinsic relations among proteins that traditional approaches overlook.