Skip to yearly menu bar Skip to main content


Poster
in
Workshop: New Frontiers of AI for Drug Discovery and Development

SALSA: Semantically-Aware Latent Space Autoencoder

Kathryn E. Kirchoff · Travis Maxfield · Alexander Tropsha · Shawn Gomez

Keywords: [ contrastive learning ] [ Molecular Data ] [ Drug Discovery ] [ Autoencoders ] [ Representation Learning ] [ Embedding Approaches ] [ transformers ]


Abstract:

For molecular representations, SMILES strings are a popular choice, as they allow for leveraging of modern NLP methodologies, one being the sequence-to-sequence autoencoder. However, an autoencoder trained solely on SMILES is insufficient to learn semantically meaningful representations, which capture structural similarities between molecules. We define native chemical similarity using chemical graphs, which enables the use of a rigorous metric, such as graph edit distance (GED). We demonstrate by example that a standard SMILES autoencoder may map structurally similar molecules to distant latent vectors, resulting in an incoherent latent space. To address this shortcoming we propose Semantically-Aware Latent Space Autoencoder (SALSA), a transformer-autoencoder modified with a contrastive objective of mapping structurally similar molecules to nearby vectors in the latent space. We evaluate semantic awareness of SALSA representations by comparing to a naive autoencoder as well as ECFP4, a molecular fingerprint commonly used in cheminformatics. We show empirically that \salsa{} learns a representation that maintains 1) structural awareness, 2) physicochemical property awareness, 3) biological property awareness, and 4) semantic continuity.

Chat is not available.