Skip to yearly menu bar Skip to main content

Workshop: Gaussian Processes, Spatiotemporal Modeling, and Decision-making Systems

Gaussian Process Thompson sampling for Bayesian optimization of dynamic masking-based language model pre-training

IƱigo Urteaga · Moulay Zaidane Draidia · Tomer Lancewicki · Shahram Khadivi


We design and evaluate a Thompson sampling-based Bayesian optimization algorithm that leverages a Gaussian process reward model of the Masked Language Model (MLM) pre-training objective, for its sequential minimization. Transformer-based language model (TLM) pre-training requires large volumes of data and high computational resources, while introducing many unresolved design choices, such as hyperparameter selection of the pre-training procedure.We here fit TLM pre-training validation losses with a Gaussian process, and formulate a Thompson sampling bandit policy that maximizes its sequentially attained cumulative rewards. Instead of MLM pre-training with fixed masking probabilities, the proposed Gaussian process-based Thompson sampling (GP-TS) accelerates and improves MLM pre-training performance by sequentially selecting masking hyperparameters of the language model.GP-TS provides a fast and efficient framework for pre-training TLMs, as it attains better MLM pre-training loss in less epochs, avoiding costly hyperparameter selection techniques.

Chat is not available.