Skip to yearly menu bar Skip to main content

Workshop: Math AI for Education (MATHAI4ED): Bridging the Gap Between Research and Smart Education

MathBERT: A Pre-trained Language Model for General NLP Tasks in Mathematics Education

Tracy Jia Shen · Michiharu Yamashita · Ethan Prihar · Neil Heffernan · Xintao Wu · Ben Graff · Dongwon Lee


Since the introduction of the original BERT (i.e., BASE BERT), researchers have developed various customized BERT models with improved performance for specific domains and tasks by exploiting the benefits of transfer learning. Due to the nature of mathematical texts, which often use domain specific vocabulary along with equations and math symbols, we posit that the development of a new BERT model for mathematics would be useful for many mathematical downstream tasks. In this paper, we introduce our multi-institutional effort (i.e., two learning platforms and three academic institutions in the US) toward this need: MathBERT, a model created by pre-training the BASE BERT model on a large mathematical corpus ranging from pre-kindergarten (pre-k), to high-school, to college graduate level mathematical content. In addition, we select three general NLP tasks that are often used in mathematics education: prediction of knowledge component, auto-grading open-ended Q&A, and knowledge tracing, to demonstrate the superiority of m over BASE BERT. Our experiments show that MathBERT outperforms prior best methods by 1.2-22% and BASE BERT by 2-8% on these tasks. In addition, we build a mathematics specific vocabulary mathVocab to train with MathBERT. We release MathBERT for public usage at: