Timezone: »
Vision-language (VL) pre-training has recently received considerable attention. However, most existing end-to-end pre-training approaches either only aim to tackle VL tasks such as image-text retrieval, visual question answering (VQA) and image captioning that test high-level understanding of images, or only target region-level understanding for tasks such as phrase grounding and object detection. We present FIBER (Fusion-In-the-Backbone-based transformER), a new VL model architecture that can seamlessly handle both these types of tasks. Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model by inserting cross-attention into the image and text backbones to better capture multimodal interactions. In addition, unlike previous work that is either only pre-trained on image-text data or on fine-grained data with box-level annotations, we present a two-stage pre-training strategy that uses both these kinds of data efficiently: (i) coarse-grained pre-training based on image-text data; followed by (ii) fine-grained pre-training based on image-text-box data. We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection. Using deep multimodal fusion coupled with the two-stage pre-training, FIBER provides consistent performance improvements over strong baselines across all tasks, often outperforming methods using magnitudes more data. Code is released at https://github.com/microsoft/FIBER.
Author Information
Zi-Yi Dou (UCLA)
Aishwarya Kamath (New York University; DeepMind)
Zhe Gan (Microsoft)
Pengchuan Zhang (California Institute of Technology)
Jianfeng Wang (University of Science and Technology of China)
Linjie Li (Microsoft)
Zicheng Liu (Microsoft)
Ce Liu (Microsoft)
Yann LeCun (Facebook)
Yann LeCun is Director of AI Research at Facebook, and Silver Professor of Data Science, Computer Science, Neural Science, and Electrical Engineering at New York University. He received the Electrical Engineer Diploma from ESIEE, Paris in 1983, and a PhD in Computer Science from Université Pierre et Marie Curie (Paris) in 1987. After a postdoc at the University of Toronto, he joined AT&T Bell Laboratories in Holmdel, NJ in 1988. He became head of the Image Processing Research Department at AT&T Labs-Research in 1996, and joined NYU as a professor in 2003, after a brief period as a Fellow of the NEC Research Institute in Princeton. From 2012 to 2014 he directed NYU's initiative in data science and became the founding director of the NYU Center for Data Science. He was named Director of AI Research at Facebook in late 2013 and retains a part-time position on the NYU faculty. His current interests include AI, machine learning, computer perception, mobile robotics, and computational neuroscience. He has published over 180 technical papers and book chapters on these topics as well as on neural networks, handwriting recognition, image processing and compression, and on dedicated circuits for computer perception.
Nanyun Peng (University of California, Los Angeles)
Jianfeng Gao (Microsoft Research, Redmond, WA)
Lijuan Wang
More from the Same Authors
-
2021 : VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation »
Linjie Li · Jie Lei · Zhe Gan · Licheng Yu · Yen-Chun Chen · Rohit Pillai · Yu Cheng · Luowei Zhou · Xin Wang · William Yang Wang · Tamara L Berg · Mohit Bansal · Jingjing Liu · Lijuan Wang · Zicheng Liu -
2021 Spotlight: Focal Attention for Long-Range Interactions in Vision Transformers »
Jianwei Yang · Chunyuan Li · Pengchuan Zhang · Xiyang Dai · Bin Xiao · Lu Yuan · Jianfeng Gao -
2021 Spotlight: ViSER: Video-Specific Surface Embeddings for Articulated 3D Shape Reconstruction »
Gengshan Yang · Deqing Sun · Varun Jampani · Daniel Vlasic · Forrester Cole · Ce Liu · Deva Ramanan -
2021 : Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models »
Boxin Wang · Chejian Xu · Shuohang Wang · Zhe Gan · Yu Cheng · Jianfeng Gao · Ahmed Awadallah · Bo Li -
2021 : Few-Shot Learning Evaluation in Natural Language Understanding »
Subhabrata Mukherjee · Xiaodong Liu · Guoqing Zheng · Saghar Hosseini · Hao Cheng · Ge Yang · Christopher Meek · Ahmed Awadallah · Jianfeng Gao -
2022 Poster: OmniVL: One Foundation Model for Image-Language and Video-Language Tasks »
Junke Wang · Dongdong Chen · Zuxuan Wu · Chong Luo · Luowei Zhou · Yucheng Zhao · Yujia Xie · Ce Liu · Yu-Gang Jiang · Lu Yuan -
2022 : Parameter-Efficient Low-Resource Dialogue State Tracking by Prompt Tuning »
Mingyu Derek Ma · Jiun-Yu Kao · Shuyang Gao · arpit gupta · Di Jin · Tagyoung Chung · Nanyun Peng -
2023 Poster: Bridging Discrete and Backpropagation: Straight-Through and Beyond »
Liyuan Liu · Chengyu Dong · Xiaodong Liu · Bin Yu · Jianfeng Gao -
2023 Poster: Self-Supervised Learning with Lie Symmetries for Partial Differential Equations »
Grégoire Mialon · Quentin Garrido · Hannah Lawrence · Danyal Rehman · Bobak Kiani · Yann LeCun -
2023 Poster: Localized Symbolic Knowledge Distillation for Visual Commonsense Models »
Jae Sung Park · Jack Hessel · Khyathi Chandu · Paul Pu Liang · Ximing Lu · Qiuyuan Huang · Peter West · Jianfeng Gao · Ali Farhadi · Yejin Choi -
2023 Poster: Guiding Large Language Models via Directional Stimulus Prompting »
Zekun Li · Baolin Peng · Pengcheng He · Michel Galley · Jianfeng Gao · Xifeng Yan -
2023 Poster: Segment Everything Everywhere All at Once »
Xueyan Zou · Jianwei Yang · Hao Zhang · Feng Li · Linjie Li · Jianfeng Wang · Lijuan Wang · Jianfeng Gao · Yong Jae Lee -
2023 Poster: Reverse Engineering Self-Supervised Learning »
Ido Ben-Shaul · Ravid Shwartz-Ziv · Tomer Galanti · Shai Dekel · Yann LeCun -
2023 Poster: An Information Theory Perspective on Variance-Invariance-Covariance Regularization »
Ravid Shwartz-Ziv · Randall Balestriero · Kenji Kawaguchi · Tim G. J. Rudner · Yann LeCun -
2023 Poster: DesCo: Learning Object Recognition with Rich Language Descriptions »
Liunian Li · Zi-Yi Dou · Nanyun Peng · Kai-Wei Chang -
2023 Poster: Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models »
Pan Lu · Baolin Peng · Hao Cheng · Michel Galley · Kai-Wei Chang · Ying Nian Wu · Song-Chun Zhu · Jianfeng Gao -
2023 Poster: PaintSeg: Painting Pixels for Training-free Segmentation »
Xiang Li · Chung-Ching Lin · Yinpeng Chen · Zicheng Liu · Jinglu Wang · Bhiksha Raj -
2023 Poster: Language Models Augmented with Decoupled Memory »
Weizhi Wang · Li Dong · Hao Cheng · Xiaodong Liu · Xifeng Yan · Jianfeng Gao · Furu Wei -
2023 Poster: LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day »
Chunyuan Li · Cliff Wong · Sheng Zhang · Naoto Usuyama · Haotian Liu · Jianwei Yang · Tristan Naumann · Hoifung Poon · Jianfeng Gao -
2023 Poster: NAVI: Category-Agnostic Image Collections with High-Quality 3D Shape and Pose Annotations »
Varun Jampani · Kevis-kokitsi Maninis · Andreas Engelhardt · Arjun Karpur · Karen Truong · Kyle Sargent · Stefan Popov · Andre Araujo · Ricardo Martin Brualla · Kaushal Patel · Daniel Vlasic · Vittorio Ferrari · Ameesh Makadia · Ce Liu · Yuanzhen Li · Howard Zhou -
2023 Oral: Bridging Discrete and Backpropagation: Straight-Through and Beyond »
Liyuan Liu · Chengyu Dong · Xiaodong Liu · Bin Yu · Jianfeng Gao -
2022 Spotlight: Focal Modulation Networks »
Jianwei Yang · Chunyuan Li · Xiyang Dai · Jianfeng Gao -
2022 Spotlight: ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models »
Chunyuan Li · Haotian Liu · Liunian Li · Pengchuan Zhang · Jyoti Aneja · Jianwei Yang · Ping Jin · Houdong Hu · Zicheng Liu · Yong Jae Lee · Jianfeng Gao -
2022 Spotlight: OmniVL: One Foundation Model for Image-Language and Video-Language Tasks »
Junke Wang · Dongdong Chen · Zuxuan Wu · Chong Luo · Luowei Zhou · Yucheng Zhao · Yujia Xie · Ce Liu · Yu-Gang Jiang · Lu Yuan -
2022 Spotlight: Fault-Aware Neural Code Rankers »
Jeevana Priya Inala · Chenglong Wang · Mei Yang · Andres Codas · Mark Encarnación · Shuvendu Lahiri · Madanlal Musuvathi · Jianfeng Gao -
2022 Poster: K-LITE: Learning Transferable Visual Models with External Knowledge »
Sheng Shen · Chunyuan Li · Xiaowei Hu · Yujia Xie · Jianwei Yang · Pengchuan Zhang · Zhe Gan · Lijuan Wang · Lu Yuan · Ce Liu · Kurt Keutzer · Trevor Darrell · Anna Rohrbach · Jianfeng Gao -
2022 Poster: The Effects of Regularization and Data Augmentation are Class Dependent »
Randall Balestriero · Leon Bottou · Yann LeCun -
2022 Poster: VICRegL: Self-Supervised Learning of Local Visual Features »
Adrien Bardes · Jean Ponce · Yann LeCun -
2022 Poster: ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models »
Chunyuan Li · Haotian Liu · Liunian Li · Pengchuan Zhang · Jyoti Aneja · Jianwei Yang · Ping Jin · Houdong Hu · Zicheng Liu · Yong Jae Lee · Jianfeng Gao -
2022 Poster: Pre-Train Your Loss: Easy Bayesian Transfer Learning with Informative Priors »
Ravid Shwartz-Ziv · Micah Goldblum · Hossein Souri · Sanyam Kapoor · Chen Zhu · Yann LeCun · Andrew Wilson -
2022 Poster: Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models »
Dongkuan (DK) Xu · Subhabrata Mukherjee · Xiaodong Liu · Debadeepta Dey · Wenhui Wang · Xiang Zhang · Ahmed Awadallah · Jianfeng Gao -
2022 Poster: NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis »
Jian Liang · Chenfei Wu · Xiaowei Hu · Zhe Gan · Jianfeng Wang · Lijuan Wang · Zicheng Liu · Yuejian Fang · Nan Duan -
2022 Poster: InsNet: An Efficient, Flexible, and Performant Insertion-based Text Generation Model »
Sidi Lu · Tao Meng · Nanyun Peng -
2022 Poster: A Data-Augmentation Is Worth A Thousand Samples: Analytical Moments And Sampling-Free Training »
Randall Balestriero · Ishan Misra · Yann LeCun -
2022 Poster: projUNN: efficient method for training deep networks with unitary matrices »
Bobak Kiani · Randall Balestriero · Yann LeCun · Seth Lloyd -
2022 Poster: Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning »
Yujia Xie · Luowei Zhou · Xiyang Dai · Lu Yuan · Nguyen Bach · Ce Liu · Michael Zeng -
2022 Poster: Focal Modulation Networks »
Jianwei Yang · Chunyuan Li · Xiyang Dai · Jianfeng Gao -
2022 Poster: Fault-Aware Neural Code Rankers »
Jeevana Priya Inala · Chenglong Wang · Mei Yang · Andres Codas · Mark Encarnación · Shuvendu Lahiri · Madanlal Musuvathi · Jianfeng Gao -
2022 Poster: Contrastive and Non-Contrastive Self-Supervised Learning Recover Global and Local Spectral Embedding Methods »
Randall Balestriero · Yann LeCun -
2022 Poster: Controllable Text Generation with Neurally-Decomposed Oracle »
Tao Meng · Sidi Lu · Nanyun Peng · Kai-Wei Chang -
2022 Poster: GLIPv2: Unifying Localization and Vision-Language Understanding »
Haotian Zhang · Pengchuan Zhang · Xiaowei Hu · Yen-Chun Chen · Liunian Li · Xiyang Dai · Lijuan Wang · Lu Yuan · Jenq-Neng Hwang · Jianfeng Gao -
2022 Poster: 3DB: A Framework for Debugging Computer Vision Models »
Guillaume Leclerc · Hadi Salman · Andrew Ilyas · Sai Vemprala · Logan Engstrom · Vibhav Vineet · Kai Xiao · Pengchuan Zhang · Shibani Santurkar · Greg Yang · Ashish Kapoor · Aleksander Madry -
2021 : Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models »
Boxin Wang · Chejian Xu · Shuohang Wang · Zhe Gan · Yu Cheng · Jianfeng Gao · Ahmed Awadallah · Bo Li -
2021 Poster: Stronger NAS with Weaker Predictors »
Junru Wu · Xiyang Dai · Dongdong Chen · Yinpeng Chen · Mengchen Liu · Ye Yu · Zhangyang Wang · Zicheng Liu · Mei Chen · Lu Yuan -
2021 Poster: Neural-PIL: Neural Pre-Integrated Lighting for Reflectance Decomposition »
Mark Boss · Varun Jampani · Raphael Braun · Ce Liu · Jonathan Barron · Hendrik PA Lensch -
2021 Poster: ViSER: Video-Specific Surface Embeddings for Articulated 3D Shape Reconstruction »
Gengshan Yang · Deqing Sun · Varun Jampani · Daniel Vlasic · Forrester Cole · Ce Liu · Deva Ramanan -
2021 Poster: Focal Attention for Long-Range Interactions in Vision Transformers »
Jianwei Yang · Chunyuan Li · Pengchuan Zhang · Xiyang Dai · Bin Xiao · Lu Yuan · Jianfeng Gao -
2021 Poster: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer »
Ge Yang · Edward Hu · Igor Babuschkin · Szymon Sidor · Xiaodong Liu · David Farhi · Nick Ryder · Jakub Pachocki · Weizhu Chen · Jianfeng Gao -
2021 : WebQA Competition + Q&A »
Yingshan CHANG · Yonatan Bisk · Mridu Narang · Levi Melnick · Jianfeng Gao · Hisami Suzuki · Guihong Cao -
2020 : Panel Discussion & Closing »
Yejin Choi · Alexei Efros · Chelsea Finn · Kristen Grauman · Quoc V Le · Yann LeCun · Ruslan Salakhutdinov · Eric Xing -
2020 : QA: Yann LeCun »
Yann LeCun -
2020 : Invited Talk: Yann LeCun »
Yann LeCun -
2020 : Invited talk - Creative Language Generation: Stories, Sarcasms, and Similes - Nanyun Peng »
Nanyun Peng -
2020 Poster: Large-Scale Adversarial Training for Vision-and-Language Representation Learning »
Zhe Gan · Yen-Chun Chen · Linjie Li · Chen Zhu · Yu Cheng · Jingjing Liu -
2020 Spotlight: Large-Scale Adversarial Training for Vision-and-Language Representation Learning »
Zhe Gan · Yen-Chun Chen · Linjie Li · Chen Zhu · Yu Cheng · Jingjing Liu -
2019 : TBD »
Yann LeCun -
2019 Poster: Unified Language Model Pre-training for Natural Language Understanding and Generation »
Li Dong · Nan Yang · Wenhui Wang · Furu Wei · Xiaodong Liu · Yu Wang · Jianfeng Gao · Ming Zhou · Hsiao-Wuen Hon -
2018 Poster: M-Walk: Learning to Walk over Graphs using Monte Carlo Tree Search »
Yelong Shen · Jianshu Chen · Po-Sen Huang · Yuqing Guo · Jianfeng Gao -
2018 Poster: Generating Informative and Diverse Conversational Responses via Adversarial Information Maximization »
Yizhe Zhang · Michel Galley · Jianfeng Gao · Zhe Gan · Xiujun Li · Chris Brockett · Bill Dolan -
2018 Poster: Navigating with Graph Representations for Fast and Scalable Decoding of Neural Language Models »
Minjia Zhang · Wenhan Wang · Xiaodong Liu · Jianfeng Gao · Yuxiong He -
2017 : Panel Session »
Neil Lawrence · Finale Doshi-Velez · Zoubin Ghahramani · Yann LeCun · Max Welling · Yee Whye Teh · Ole Winther -
2017 : Invited Talk: Microsoft (Asli and Jianfeng) »
Jianfeng Gao -
2017 Tutorial: Geometric Deep Learning on Graphs and Manifolds »
Michael Bronstein · Joan Bruna · arthur szlam · Xavier Bresson · Yann LeCun -
2016 : Discussion panel »
Ian Goodfellow · Soumith Chintala · Arthur Gretton · Sebastian Nowozin · Aaron Courville · Yann LeCun · Emily Denton -
2016 : Energy-Based Adversarial Training and Video Prediction »
Yann LeCun -
2016 Symposium: Deep Learning Symposium »
Yoshua Bengio · Yann LeCun · Navdeep Jaitly · Roger Grosse -
2015 Poster: Learning to Linearize Under Uncertainty »
Ross Goroshin · Michael Mathieu · Yann LeCun -
2015 Poster: End-to-end Learning of LDA by Mirror-Descent Back Propagation over a Deep Architecture »
Jianshu Chen · Ji He · Yelong Shen · Lin Xiao · Xiaodong He · Jianfeng Gao · Xinying Song · Li Deng -
2015 Poster: Character-level Convolutional Networks for Text Classification »
Xiang Zhang · Junbo (Jake) Zhao · Yann LeCun -
2015 Poster: Deep learning with Elastic Averaging SGD »
Sixin Zhang · Anna Choromanska · Yann LeCun -
2015 Spotlight: Deep learning with Elastic Averaging SGD »
Sixin Zhang · Anna Choromanska · Yann LeCun -
2015 Tutorial: Deep Learning »
Geoffrey E Hinton · Yoshua Bengio · Yann LeCun