Timezone: »
Traditional knowledge distillation (KD) methods manually design student architectures to compress large models given pre-specified computational cost. This requires several trials to find viable students, and repeating the process with change in computational budget. We use Neural Architecture Search (NAS) to automatically distill several compressed students with variable cost from a large model. Existing NAS methods train a single SuperLM consisting of millions of subnetworks with weight-sharing, resulting in interference between subnetworks of different sizes. Additionally, many of these works are task-specific requiring task labels for SuperLM training. Our framework AutoDistil addresses above challenges with the following steps: (a) Incorporates inductive bias and heuristics to partition Transformer search space into K compact sub-spaces (e.g., K=3 can generate typical student sizes of base, small and tiny); (b) Trains one SuperLM for each sub-space using task-agnostic objective (e.g., self-attention distillation) with weight-sharing of students; (c) Lightweight search for the optimal student without re-training. Task-agnostic training and search allow students to be reused for fine-tuning on any downstream task. Experiments on GLUE benchmark demonstrate AutoDistil to outperform state-of-the-art KD and NAS methods with upto 3x reduction in computational cost and negligible loss in task performance. Code and model checkpoints are available at https://github.com/microsoft/autodistil.
Author Information
Dongkuan (DK) Xu (North Carolina State University)

Dongkuan (DK) Xu is an Assistant Professor in the CS Department at North Carolina State University and leads the NCSU Efficient & Intelligent Computing Lab. His research interest is resource-efficient deep learning for AI at scale, investigating how to improve the efficiency of deep learning systems to achieve Pareto optimality between computing resources and model performance. DK’s research has been published multiple times in top conferences and journals in AI, NLP, and other fields. He serves as the Column Editor for ACM SIGAI Newsletter, and will chair The First Workshop on DL-Hardware Co-Design for AI Acceleration with AAAI 2023. He has served as session chair for New Deep Learning Architectures, and for Scalable & Trustable AI at KDD 2022, and has served as a (senior) PC member and regular reviewer for over 30 major conferences and 15 journals. In addition, he has launched the Machine Learning Algorithms & Natural Language Processing community, which has over 500,000 followers worldwide. DK also has extensive research experience in industry. He has interned at Microsoft Research Redmond, Moffett AI, and NEC labs America, and holds 8 US patents/applications. DK's long-term research goal is to democratize AI to serve a broader range of populations and domains.
Subhabrata Mukherjee (Microsoft)
Xiaodong Liu (Microsoft)
Debadeepta Dey (Microsoft Research)
I am a researcher in the Adaptive Systems and Interaction (ASI) group led by Dr. Eric Horvitz at Microsoft Research, Redmond, USA. I finished my PhD at the Robotics Institute, Carnegie Mellon University, USA, where I was advised by Prof. J. Andrew (Drew) Bagnell. I do fundamental as well as applied research in machine learning, control and computer vision with applications to autonomous agents in general and robotics in particular. My interests include decison-making under uncertainty, reinforcement learning, artificial intelligence and machine learning. As of January 2019 I am also serving as Affiliate Assistant Professor at The School of Computer Science and Engineering, University of Washington, Seattle, USA. I regularly review for NeurIPS, ICLR, ICML, ICRA, IROS, IJRR, JFR. On occasion for CVPR, ECCV, ICCV and Autonomous Robots.
Wenhui Wang (Microsoft Research)
Xiang Zhang (The Pennsylvania State University)
Ahmed Awadallah (Microsoft)
Jianfeng Gao (Microsoft Research, Redmond, WA)
More from the Same Authors
-
2021 Spotlight: Focal Attention for Long-Range Interactions in Vision Transformers »
Jianwei Yang · Chunyuan Li · Pengchuan Zhang · Xiyang Dai · Bin Xiao · Lu Yuan · Jianfeng Gao -
2021 : Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models »
Boxin Wang · Chejian Xu · Shuohang Wang · Zhe Gan · Yu Cheng · Jianfeng Gao · Ahmed Awadallah · Bo Li -
2021 : Few-Shot Learning Evaluation in Natural Language Understanding »
Subhabrata Mukherjee · Xiaodong Liu · Guoqing Zheng · Saghar Hosseini · Hao Cheng · Ge Yang · Christopher Meek · Ahmed Awadallah · Jianfeng Gao -
2022 : Fifteen-minute Competition Overview Video »
Maartje Anne ter Hoeve · Mikhail Burtsev · Zoya Volovikova · Ziming Li · Yuxuan Sun · Shrestha Mohanty · Negar Arabzadeh · Mohammad Aliannejadi · Milagro Teruel · Marc-Alexandre Côté · Kavya Srinet · arthur szlam · Artem Zholus · Alexey Skrynnik · Aleksandr Panov · Ahmed Awadallah · Julia Kiseleva -
2022 Spotlight: Lightning Talks 5B-2 »
Conglong Li · Mohammad Azizmalayeri · Mojan Javaheripi · Pratik Vaishnavi · Jon Hasselgren · Hao Lu · Kevin Eykholt · Arshia Soltani Moakhar · Wenze Liu · Gustavo de Rosa · Nikolai Hofmann · Minjia Zhang · Zixuan Ye · Jacob Munkberg · Amir Rahmati · Arman Zarei · Subhabrata Mukherjee · Yuxiong He · Shital Shah · Reihaneh Zohrabi · Hongtao Fu · Tomasz Religa · Yuliang Liu · Mohammad Manzuri · Mohammad Hossein Rohban · Zhiguo Cao · Caio Cesar Teodoro Mendes · Sebastien Bubeck · Farinaz Koushanfar · Debadeepta Dey -
2022 Spotlight: LiteTransformerSearch: Training-free Neural Architecture Search for Efficient Language Models »
Mojan Javaheripi · Gustavo de Rosa · Subhabrata Mukherjee · Shital Shah · Tomasz Religa · Caio Cesar Teodoro Mendes · Sebastien Bubeck · Farinaz Koushanfar · Debadeepta Dey -
2022 Spotlight: Focal Modulation Networks »
Jianwei Yang · Chunyuan Li · Xiyang Dai · Jianfeng Gao -
2022 Spotlight: ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models »
Chunyuan Li · Haotian Liu · Liunian Li · Pengchuan Zhang · Jyoti Aneja · Jianwei Yang · Ping Jin · Houdong Hu · Zicheng Liu · Yong Jae Lee · Jianfeng Gao -
2022 Spotlight: Fault-Aware Neural Code Rankers »
Jeevana Priya Inala · Chenglong Wang · Mei Yang · Andres Codas · Mark Encarnación · Shuvendu Lahiri · Madanlal Musuvathi · Jianfeng Gao -
2022 Competition: IGLU: Interactive Grounded Language Understanding in a Collaborative Environment »
Julia Kiseleva · Alexey Skrynnik · Artem Zholus · Shrestha Mohanty · Negar Arabzadeh · Marc-Alexandre Côté · Mohammad Aliannejadi · Milagro Teruel · Ziming Li · Mikhail Burtsev · Maartje Anne ter Hoeve · Zoya Volovikova · Aleksandr Panov · Yuxuan Sun · arthur szlam · Ahmed Awadallah · Kavya Srinet -
2022 Poster: K-LITE: Learning Transferable Visual Models with External Knowledge »
Sheng Shen · Chunyuan Li · Xiaowei Hu · Yujia Xie · Jianwei Yang · Pengchuan Zhang · Zhe Gan · Lijuan Wang · Lu Yuan · Ce Liu · Kurt Keutzer · Trevor Darrell · Anna Rohrbach · Jianfeng Gao -
2022 Poster: Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone »
Zi-Yi Dou · Aishwarya Kamath · Zhe Gan · Pengchuan Zhang · Jianfeng Wang · Linjie Li · Zicheng Liu · Ce Liu · Yann LeCun · Nanyun Peng · Jianfeng Gao · Lijuan Wang -
2022 Poster: ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models »
Chunyuan Li · Haotian Liu · Liunian Li · Pengchuan Zhang · Jyoti Aneja · Jianwei Yang · Ping Jin · Houdong Hu · Zicheng Liu · Yong Jae Lee · Jianfeng Gao -
2022 Poster: VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts »
Hangbo Bao · Wenhui Wang · Li Dong · Qiang Liu · Owais Khan Mohammed · Kriti Aggarwal · Subhojit Som · Songhao Piao · Furu Wei -
2022 Poster: Focal Modulation Networks »
Jianwei Yang · Chunyuan Li · Xiyang Dai · Jianfeng Gao -
2022 Poster: Fault-Aware Neural Code Rankers »
Jeevana Priya Inala · Chenglong Wang · Mei Yang · Andres Codas · Mark Encarnación · Shuvendu Lahiri · Madanlal Musuvathi · Jianfeng Gao -
2022 Poster: LiteTransformerSearch: Training-free Neural Architecture Search for Efficient Language Models »
Mojan Javaheripi · Gustavo de Rosa · Subhabrata Mukherjee · Shital Shah · Tomasz Religa · Caio Cesar Teodoro Mendes · Sebastien Bubeck · Farinaz Koushanfar · Debadeepta Dey -
2022 Poster: GLIPv2: Unifying Localization and Vision-Language Understanding »
Haotian Zhang · Pengchuan Zhang · Xiaowei Hu · Yen-Chun Chen · Liunian Li · Xiyang Dai · Lijuan Wang · Lu Yuan · Jenq-Neng Hwang · Jianfeng Gao -
2021 : Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models »
Boxin Wang · Chejian Xu · Shuohang Wang · Zhe Gan · Yu Cheng · Jianfeng Gao · Ahmed Awadallah · Bo Li -
2021 : IGLU: Interactive Grounded Language Understanding in a Collaborative Environment + Q&A »
Julia Kiseleva · Ziming Li · Mohammad Aliannejadi · Maartje Anne ter Hoeve · Mikhail Burtsev · Alexey Skrynnik · Artem Zholus · Aleksandr Panov · Katja Hofmann · Kavya Srinet · arthur szlam · Michel Galley · Ahmed Awadallah -
2021 Poster: Focal Attention for Long-Range Interactions in Vision Transformers »
Jianwei Yang · Chunyuan Li · Pengchuan Zhang · Xiyang Dai · Bin Xiao · Lu Yuan · Jianfeng Gao -
2021 Poster: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer »
Ge Yang · Edward Hu · Igor Babuschkin · Szymon Sidor · Xiaodong Liu · David Farhi · Nick Ryder · Jakub Pachocki · Weizhu Chen · Jianfeng Gao -
2021 : WebQA Competition + Q&A »
Yingshan CHANG · Yonatan Bisk · Mridu Narang · Levi Melnick · Jianfeng Gao · Hisami Suzuki · Guihong Cao -
2021 Poster: InfoGCL: Information-Aware Graph Contrastive Learning »
Dongkuan Xu · Wei Cheng · Dongsheng Luo · Haifeng Chen · Xiang Zhang -
2020 Poster: Parameterized Explainer for Graph Neural Network »
Dongsheng Luo · Wei Cheng · Dongkuan Xu · Wenchao Yu · Bo Zong · Haifeng Chen · Xiang Zhang -
2020 Poster: Uncertainty-aware Self-training for Few-shot Text Classification »
Subhabrata Mukherjee · Ahmed Awadallah -
2020 Spotlight: Uncertainty-aware Self-training for Few-shot Text Classification »
Subhabrata Mukherjee · Ahmed Awadallah -
2020 Poster: MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers »
Wenhui Wang · Furu Wei · Li Dong · Hangbo Bao · Nan Yang · Ming Zhou -
2019 Poster: Unified Language Model Pre-training for Natural Language Understanding and Generation »
Li Dong · Nan Yang · Wenhui Wang · Furu Wei · Xiaodong Liu · Yu Wang · Jianfeng Gao · Ming Zhou · Hsiao-Wuen Hon -
2019 Poster: Efficient Forward Architecture Search »
Hanzhang Hu · John Langford · Rich Caruana · Saurajit Mukherjee · Eric Horvitz · Debadeepta Dey -
2018 Poster: M-Walk: Learning to Walk over Graphs using Monte Carlo Tree Search »
Yelong Shen · Jianshu Chen · Po-Sen Huang · Yuqing Guo · Jianfeng Gao -
2018 Poster: Generating Informative and Diverse Conversational Responses via Adversarial Information Maximization »
Yizhe Zhang · Michel Galley · Jianfeng Gao · Zhe Gan · Xiujun Li · Chris Brockett · Bill Dolan -
2018 Poster: Navigating with Graph Representations for Fast and Scalable Decoding of Neural Language Models »
Minjia Zhang · Wenhan Wang · Xiaodong Liu · Jianfeng Gao · Yuxiong He -
2017 : Invited Talk: Microsoft (Asli and Jianfeng) »
Jianfeng Gao -
2015 Poster: End-to-end Learning of LDA by Mirror-Descent Back Propagation over a Deep Architecture »
Jianshu Chen · Ji He · Yelong Shen · Lin Xiao · Xiaodong He · Jianfeng Gao · Xinying Song · Li Deng