Timezone: »
Deep learning models such as the Transformer are often constructed by heuristics and experience. To provide a complementary foundation, in this work we study the following problem: Is it possible to find an energy function underlying the Transformer model, such that descent steps along this energy correspond with the Transformer forward pass? By finding such a function, we can reinterpret Transformers as the unfolding of an interpretable optimization process. This unfolding perspective has been frequently adopted in the past to elucidate more straightforward deep models such as MLPs and CNNs; however, it has thus far remained elusive obtaining a similar equivalence for more complex models with self-attention mechanisms like the Transformer. To this end, we first outline several major obstacles before providing companion techniques to at least partially address them, demonstrating for the first time a close association between energy function minimization and deep layers with self-attention. This interpretation contributes to our intuition and understanding of Transformers, while potentially laying the ground-work for new model designs.
Author Information
Yongyi Yang (University of Michigan)
zengfeng Huang (Fudan University)
David P Wipf (AWS)
More from the Same Authors
-
2021 : A Closer Look at Distribution Shifts and Out-of-Distribution Generalization on Graphs »
Mucong Ding · Kezhi Kong · Jiuhai Chen · John Kirchenbauer · Micah Goldblum · David P Wipf · Furong Huang · Tom Goldstein -
2022 Poster: Learning Enhanced Representation for Tabular Data via Neighborhood Propagation »
Kounianhua Du · Weinan Zhang · Ruiwen Zhou · Yangkun Wang · Xilong Zhao · Jiarui Jin · Quan Gan · Zheng Zhang · David P Wipf -
2022 Poster: Lipschitz Bandits with Batched Feedback »
Yasong Feng · zengfeng Huang · Tianyu Wang -
2022 : Are Neurons Actually Collapsed? On the Fine-Grained Structure in Neural Representations »
Yongyi Yang · Jacob Steinhardt · Wei Hu -
2023 Poster: Going Beyond Linear Mode Connectivity: The Layerwise Linear Feature Connectivity »
Zhanpeng Zhou · Yongyi Yang · Xiaojiang Yang · Junchi Yan · Wei Hu -
2023 Poster: Adversarially Robust Distributed Count Tracking via Partial Differential Privacy »
Zhongzheng Xiong · Xiaoyi Zhu · zengfeng Huang -
2023 Poster: Rethinking Semi-Supervised Imbalanced Node Classification from Bias-Variance Decomposition »
Divin Yan · Gengchen Wei · Chen Yang · Shengzhong Zhang · zengfeng Huang -
2022 Spotlight: Lightning Talks 5B-3 »
Yanze Wu · Jie Xiao · Nianzu Yang · Jieyi Bi · Jian Yao · Yiting Chen · Qizhou Wang · Yangru Huang · Yongqiang Chen · Peixi Peng · Yuxin Hong · Xintao Wang · Feng Liu · Yining Ma · Qibing Ren · Xueyang Fu · Yonggang Zhang · Kaipeng Zeng · Jiahai Wang · GEN LI · Yonggang Zhang · Qitian Wu · Yifan Zhao · Chiyu Wang · Junchi Yan · Feng Wu · Yatao Bian · Xiaosong Jia · Ying Shan · Zhiguang Cao · Zheng-Jun Zha · Guangyao Chen · Tianjun Xiao · Han Yang · Jing Zhang · Jinbiao Chen · MA Kaili · Yonghong Tian · Junchi Yan · Chen Gong · Tong He · Binghui Xie · Yuan Sun · Francesco Locatello · Tongliang Liu · Yeow Meng Chee · David P Wipf · Tongliang Liu · Bo Han · Bo Han · Yanwei Fu · James Cheng · Zheng Zhang -
2022 Spotlight: Self-supervised Amodal Video Object Segmentation »
Jian Yao · Yuxin Hong · Chiyu Wang · Tianjun Xiao · Tong He · Francesco Locatello · David P Wipf · Yanwei Fu · Zheng Zhang -
2022 Spotlight: Lipschitz Bandits with Batched Feedback »
Yasong Feng · zengfeng Huang · Tianyu Wang -
2022 Spotlight: Lightning Talks 2A-1 »
Caio Kalil Lauand · Ryan Strauss · Yasong Feng · lingyu gu · Alireza Fathollah Pour · Oren Mangoubi · Jianhao Ma · Binghui Li · Hassan Ashtiani · Yongqi Du · Salar Fattahi · Sean Meyn · Jikai Jin · Nisheeth Vishnoi · zengfeng Huang · Junier B Oliva · yuan zhang · Han Zhong · Tianyu Wang · John Hopcroft · Di Xie · Shiliang Pu · Liwei Wang · Robert Qiu · Zhenyu Liao -
2022 Spotlight: NodeFormer: A Scalable Graph Structure Learning Transformer for Node Classification »
Qitian Wu · Wentao Zhao · Zenan Li · David P Wipf · Junchi Yan -
2022 Spotlight: Lightning Talks 1B-1 »
Qitian Wu · Runlin Lei · Rongqin Chen · Luca Pinchetti · Yangze Zhou · Abhinav Kumar · Hans Hao-Hsun Hsu · Wentao Zhao · Chenhao Tan · Zhen Wang · Shenghui Zhang · Yuesong Shen · Tommaso Salvatori · Gitta Kutyniok · Zenan Li · Amit Sharma · Leong Hou U · Yordan Yordanov · Christian Tomani · Bruno Ribeiro · Yaliang Li · David P Wipf · Daniel Cremers · Bolin Ding · Beren Millidge · Ye Li · Yuhang Song · Junchi Yan · Zhewei Wei · Thomas Lukasiewicz -
2022 Poster: NodeFormer: A Scalable Graph Structure Learning Transformer for Node Classification »
Qitian Wu · Wentao Zhao · Zenan Li · David P Wipf · Junchi Yan -
2022 Poster: Descent Steps of a Relation-Aware Energy Produce Heterogeneous Graph Neural Networks »
Hongjoon Ahn · Yongyi Yang · Quan Gan · Taesup Moon · David P Wipf -
2022 Poster: Self-supervised Amodal Video Object Segmentation »
Jian Yao · Yuxin Hong · Chiyu Wang · Tianjun Xiao · Tong He · Francesco Locatello · David P Wipf · Yanwei Fu · Zheng Zhang -
2022 Poster: Learning Manifold Dimensions with Conditional Variational Autoencoders »
Yijia Zheng · Tong He · Yixuan Qiu · David P Wipf -
2021 : A Closer Look at Distribution Shifts and Out-of-Distribution Generalization on Graphs »
Mucong Ding · Kezhi Kong · Jiuhai Chen · John Kirchenbauer · Micah Goldblum · David P Wipf · Furong Huang · Tom Goldstein -
2021 Poster: Understanding Bandits with Graph Feedback »
Houshuang Chen · zengfeng Huang · Shuai Li · Chihao Zhang -
2021 Poster: BernNet: Learning Arbitrary Graph Spectral Filters via Bernstein Approximation »
Mingguo He · Zhewei Wei · zengfeng Huang · Hongteng Xu -
2020 Poster: Further Analysis of Outlier Detection with Deep Generative Models »
Ziyu Wang · Bin Dai · David P Wipf · Jun Zhu -
2019 Poster: Optimal Sparsity-Sensitive Bounds for Distributed Mean Estimation »
zengfeng Huang · Ziyue Huang · Yilei WANG · Ke Yi -
2012 Poster: Dual-Space Analysis of the Sparse Linear Model »
David P Wipf -
2011 Poster: Sparse Estimation with Structured Dictionaries »
David P Wipf -
2011 Spotlight: Sparse Estimation with Structured Dictionaries »
David P Wipf -
2009 Poster: Sparse Estimation Using General Likelihoods and Non-Factorial Priors »
David P Wipf · Sri Nagarajan -
2008 Poster: Estimating the Location and Orientation of Complex, Correlated Neural Activity using MEG »
David P Wipf · Julia Owen · Hagai Attias · Kensuke Sekihara · Sri Nagarajan -
2008 Spotlight: Estimating the Location and Orientation of Complex, Correlated Neural Activity using MEG »
David P Wipf · Julia Owen · Hagai Attias · Kensuke Sekihara · Sri Nagarajan -
2007 Poster: A New View of Automatic Relevance Determination »
David P Wipf · Srikantan Nagarajan -
2006 Poster: Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization »
David P Wipf · Rey R Ramirez · Jason A Palmer · Scott Makeig · Bhaskar Rao -
2006 Spotlight: Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization »
David P Wipf · Rey R Ramirez · Jason A Palmer · Scott Makeig · Bhaskar Rao