Skip to yearly menu bar Skip to main content


Poster
in
Workshop: AI for Accelerated Materials Design (AI4Mat-2023)

Tokenizer Effect on Functional Material Prediction: Investigating Contextual Word Embeddings for Knowledge Discovery

Tong Xie · Yuwei Wan · Ke Lu · Wenjie Zhang · Chunyu Kit · Bram Hoex

Keywords: [ Natural Language Processing ] [ Large Lanugage Models ] [ Contextual embeddings ] [ Material Science ]


Abstract:

Exploring the predictive capabilities of natural language processing models in material science is a subject of ongoing interest. This study examines material property prediction, relying on models to extract latent knowledge from compound names and material properties. We assessed various methods for contextual embeddings and explored pre-trained models like BERT and GPT. Our findings indicate that using information-dense embeddings from the third layer of domain-specific BERT models, such as MatBERT, combined with the context-average method, is the optimal approach for utilizing unsupervised word embeddings from material science literature to identify material-property relationships. The stark contrast between the domain-specific MatBERT and the general BERT model emphasizes the value of domain-specific training and tokenization for material prediction. Our research identifies a "tokenizer effect," highlighting the importance of specialized tokenization techniques to capture material names effectively during the pretraining phase. We discovered that a tokenizer which preserves compound names entirely, while maintaining a consistent token count, enhances the efficacy of context-aware embeddings in functional material prediction.

Chat is not available.