Leveraging Large Language Models to Enhance Machine-Learning-Driven HPC Job Scheduling
Abstract
High-performance computing (HPC) systems rely on job schedulers like Slurm toallocate compute resources to submitted workloads. Recently, machine learningmodels have been used to predict job runtimes which can be used by schedulersto optimize utilization. However, many of these models struggle to effectivelyencode string-type job features, typically relying on integer-based Label or One-hotencoding methods. In this paper, we use Transformer-based large language models,particularly Sentence-BERT (SBERT), to semantically encode job features forregression-based job runtime prediction. Using a 90,000-record 169-feature Slurmdataset we evaluate four SBERT variants and compare them against traditionalencodings using four regression models. Our results show that SBERT-basedencodings—especially using the all-MiniLM-L6-v2 model—substantially outperformconventional methods, achieving an r2 score up to 0.88; 2.3× higher thantraditionally-used Label encoding. Moreover, we highlight practical trade-offs,such as model memory size versus accuracy, to guide the selection of efficientencoders for production HPC systems.