Timezone: »

 
Group SELFIES: A Robust Fragment-Based Molecular String Representation
Austin Cheng · Andy Cai · Santiago Miret · Gustavo Malkomes · Mariano Phielipp · Alan Aspuru-Guzik
Event URL: https://openreview.net/forum?id=n8hGHUfZ3Sy »

The design of functional molecules relies on the representation used: a flexible and informative representation can improve downstream generation tasks. String representations such as SMILES and SELFIES serve as the basis for chemical language models, and the robustness of SELFIES makes it naturally suited for molecular optimization with genetic algorithms. But while SMILES and SELFIES are atomic representations, several recent approaches take advantage of the inductive bias of molecular fragments. In this work, we present Group SELFIES, introducing group tokens that represent functional groups or entire substructures while maintaining robustness. Group tokens give control over which structures should be preserved during optimization. Experiments indicate that Group SELFIES improves distribution learning and improves the quality of molecules generated by simply taking random Group SELFIES strings. The code is available at \url{https://anonymous.4open.science/r/group-selfies-4D87/}.

Author Information

Austin Cheng (University of Toronto)
Andy Cai (University of Toronto)
Santiago Miret (Intel AI Lab)
Gustavo Malkomes (Intel)
Mariano Phielipp (Intel AI Labs)

Dr. Mariano Phielipp works at the Intel AI Lab inside the Intel Artificial Intelligence Products Group. His work includes research and development in deep learning, deep reinforcement learning, machine learning, and artificial intelligence. Since joining Intel, Dr. Phielipp has developed and worked on Computer Vision, Face Recognition, Face Detection, Object Categorization, Recommendation Systems, Online Learning, Automatic Rule Learning, Natural Language Processing, Knowledge Representation, Energy Based Algorithms, and other Machine Learning and AI-related efforts. Dr. Phielipp has also contributed to different disclosure committees, won an Intel division award related to Robotics, and has a large number of patents and pending patents. He has published on NeuriPS, ICML, ICLR, AAAI, IROS, IEEE, SPIE, IASTED, and EUROGRAPHICS-IEEE Conferences and Workshops.

Alan Aspuru-Guzik (University of Toronto)

More from the Same Authors