Timezone: »

LightGBM: A Highly Efficient Gradient Boosting Decision Tree
Guolin Ke · Qi Meng · Thomas Finley · Taifeng Wang · Wei Chen · Weidong Ma · Qiwei Ye · Tie-Yan Liu

Mon Dec 04 06:30 PM -- 10:30 PM (PST) @ Pacific Ballroom #31

Gradient Boosting Decision Tree (GBDT) is a popular machine learning algorithm, and has quite a few effective implementations such as XGBoost and pGBRT. Although many engineering optimizations have been adopted in these implementations, the efficiency and scalability are still unsatisfactory when the feature dimension is high and data size is large. A major reason is that for each feature, they need to scan all the data instances to estimate the information gain of all possible split points, which is very time consuming. To tackle this problem, we propose two novel techniques: \emph{Gradient-based One-Side Sampling} (GOSS) and \emph{Exclusive Feature Bundling} (EFB). With GOSS, we exclude a significant proportion of data instances with small gradients, and only use the rest to estimate the information gain. We prove that, since the data instances with larger gradients play a more important role in the computation of information gain, GOSS can obtain quite accurate estimation of the information gain with a much smaller data size. With EFB, we bundle mutually exclusive features (i.e., they rarely take nonzero values simultaneously), to reduce the number of features. We prove that finding the optimal bundling of exclusive features is NP-hard, but a greedy algorithm can achieve quite good approximation ratio (and thus can effectively reduce the number of features without hurting the accuracy of split point determination by much). We call our new GBDT implementation with GOSS and EFB \emph{LightGBM}. Our experiments on multiple public datasets show that, LightGBM speeds up the training process of conventional GBDT by up to over 20 times while achieving almost the same accuracy.

Author Information

Guolin Ke (Microsoft Research)
Qi Meng (Peking University)
Thomas Finley
Taifeng Wang (Microsoft Research)

Taifeng Wang is a lead researcher in Machine Learning group, Microsoft Research Asia. His research interests include machine learning, distributed system, search ads click prediction, graph mining. many of his technologies have been transferred to Microsoft’s products and online services, such as Bing, Microsoft Advertising, and Azure. Currently, he is working on distributed machine learning, and leading Microsoft’s open source project DMTK (Microsoft Distributed Machine Learning Toolkit). He has published tens of papers at top conference and journals and served as the PC member of many premium conferences such as KDD, WWW, SIGIR, IJCAI, and WSDM. He has been tutorial speakers in WWW 2011, SIGIR 2012 and ACML2016, and he has organized a workshop on Deep learning in WSDM 2015.

Wei Chen (Microsoft Research)
Weidong Ma (Microsoft Research)
Qiwei Ye (Microsoft Research)
Tie-Yan Liu (Microsoft Research)

More from the Same Authors