Timezone: »

A Communication-Efficient Parallel Algorithm for Decision Tree
Qi Meng · Guolin Ke · Taifeng Wang · Wei Chen · Qiwei Ye · Zhi-Ming Ma · Tie-Yan Liu

Wed Dec 07 09:00 AM -- 12:30 PM (PST) @ Area 5+6+7+8 #24
Decision tree (and its extensions such as Gradient Boosting Decision Trees and Random Forest) is a widely used machine learning algorithm, due to its practical effectiveness and model interpretability. With the emergence of big data, there is an increasing need to parallelize the training process of decision tree. However, most existing attempts along this line suffer from high communication costs. In this paper, we propose a new algorithm, called \emph{Parallel Voting Decision Tree (PV-Tree)}, to tackle this challenge. After partitioning the training data onto a number of (e.g., $M$) machines, this algorithm performs both local voting and global voting in each iteration. For local voting, the top-$k$ attributes are selected from each machine according to its local data. Then, the indices of these top attributes are aggregated by a server, and the globally top-$2k$ attributes are determined by a majority voting among these local candidates. Finally, the full-grained histograms of the globally top-$2k$ attributes are collected from local machines in order to identify the best (most informative) attribute and its split point. PV-Tree can achieve a very low communication cost (independent of the total number of attributes) and thus can scale out very well. Furthermore, theoretical analysis shows that this algorithm can learn a near optimal decision tree, since it can find the best attribute with a large probability. Our experiments on real-world datasets show that PV-Tree significantly outperforms the existing parallel decision tree algorithms in the tradeoff between accuracy and efficiency.

Author Information

Qi Meng (Peking University)
Guolin Ke (Microsoft Research)
Taifeng Wang (Microsoft Research)

Taifeng Wang is a lead researcher in Machine Learning group, Microsoft Research Asia. His research interests include machine learning, distributed system, search ads click prediction, graph mining. many of his technologies have been transferred to Microsoft’s products and online services, such as Bing, Microsoft Advertising, and Azure. Currently, he is working on distributed machine learning, and leading Microsoft’s open source project DMTK (Microsoft Distributed Machine Learning Toolkit). He has published tens of papers at top conference and journals and served as the PC member of many premium conferences such as KDD, WWW, SIGIR, IJCAI, and WSDM. He has been tutorial speakers in WWW 2011, SIGIR 2012 and ACML2016, and he has organized a workshop on Deep learning in WSDM 2015.

Wei Chen (Microsoft Research)
Qiwei Ye (Microsoft Research)
Zhi-Ming Ma (Academy of Mathematics and Systems Science)
Tie-Yan Liu (Microsoft Research)

Tie-Yan Liu is an assistant managing director of Microsoft Research Asia, leading the machine learning research area. He is very well known for his pioneer work on learning to rank and computational advertising, and his recent research interests include deep learning, reinforcement learning, and distributed machine learning. Many of his technologies have been transferred to Microsoft’s products and online services (such as Bing, Microsoft Advertising, Windows, Xbox, and Azure), and open-sourced through Microsoft Cognitive Toolkit (CNTK), Microsoft Distributed Machine Learning Toolkit (DMTK), and Microsoft Graph Engine. He has also been actively contributing to academic communities. He is an adjunct/honorary professor at Carnegie Mellon University (CMU), University of Nottingham, and several other universities in China. He has published 200+ papers in refereed conferences and journals, with over 17000 citations. He has won quite a few awards, including the best student paper award at SIGIR (2008), the most cited paper award at Journal of Visual Communications and Image Representation (2004-2006), the research break-through award (2012) and research-team-of-the-year award (2017) at Microsoft Research, and Top-10 Springer Computer Science books by Chinese authors (2015), and the most cited Chinese researcher by Elsevier (2017). He has been invited to serve as general chair, program committee chair, local chair, or area chair for a dozen of top conferences including SIGIR, WWW, KDD, ICML, NIPS, IJCAI, AAAI, ACL, ICTIR, as well as associate editor of ACM Transactions on Information Systems, ACM Transactions on the Web, and Neurocomputing. Tie-Yan Liu is a fellow of the IEEE, and a distinguished member of the ACM.

More from the Same Authors