Timezone: »
Vision-Language Pre-training (VLP) aims to learn multi-modal representations from image-text pairs and serves for downstream vision-language tasks in a fine-tuning fashion. The dominant VLP models adopt a CNN-Transformer architecture, which embeds images with a CNN, and then aligns images and text with a Transformer. Visual relationship between visual contents plays an important role in image understanding and is the basic for inter-modal alignment learning. However, CNNs have limitations in visual relation learning due to local receptive field's weakness in modeling long-range dependencies. Thus the two objectives of learning visual relation and inter-modal alignment are encapsulated in the same Transformer network. Such design might restrict the inter-modal alignment learning in the Transformer by ignoring the specialized characteristic of each objective. To tackle this, we propose a fully Transformer visual embedding for VLP to better learn visual relation and further promote inter-modal alignment. Specifically, we propose a metric named Inter-Modality Flow (IMF) to measure the interaction between vision and language modalities (i.e., inter-modality). We also design a novel masking optimization mechanism named Masked Feature Regression (MFR) in Transformer to further promote the inter-modality learning. To the best of our knowledge, this is the first study to explore the benefit of Transformer for visual feature learning in VLP. We verify our method on a wide range of vision-language tasks, including Visual Question Answering (VQA), Visual Entailment and Visual Reasoning. Our approach not only outperforms the state-of-the-art VLP performance, but also shows benefits on the IMF metric.
Author Information
Hongwei Xue (University of Science and Technology of China)
Yupan Huang (Sun Yat-sen University)
Bei Liu (Microsoft Research Asia)
Houwen Peng (Microsoft Research)
Jianlong Fu (Microsoft Research)
Houqiang Li (University of Science and Technology of China)
Jiebo Luo (U. Rochester)
More from the Same Authors
-
2022 Poster: Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning »
Yuchong Sun · Hongwei Xue · Ruihua Song · Bei Liu · Huan Yang · Jianlong Fu -
2022 Poster: LDSA: Learning Dynamic Subtask Assignment in Cooperative Multi-Agent Reinforcement Learning »
Mingyu Yang · Jian Zhao · Xunhan Hu · Wengang Zhou · Jiangcheng Zhu · Houqiang Li -
2022 : Multi-Agent Reinforcement Learning with Shared Resources for Inventory Management »
Yuandong Ding · Mingxiao Feng · Guozi Liu · Wei Jiang · Chuheng Zhang · Li Zhao · Lei Song · Houqiang Li · Yan Jin · Jiang Bian -
2022 : Multi-Agent Reinforcement Learning with Shared Resources for Inventory Management »
Yuandong Ding · Mingxiao Feng · Guozi Liu · Wei Jiang · Chuheng Zhang · Li Zhao · Lei Song · Houqiang Li · Yan Jin · Jiang Bian -
2023 Poster: Hierarchical Multi-Agent Skill Discovery »
Mingyu Yang · Yaodong Yang · Zhenbo Lu · Wengang Zhou · Houqiang Li -
2023 Poster: CLIP4HOI: Towards Adapting CLIP for Practical Zero-Shot HOI Detection »
Yunyao Mao · Jiajun Deng · Wengang Zhou · Li Li · Yao Fang · Houqiang Li -
2023 Poster: TextDiffuser: Diffusion Models as Text Painters »
Jingye Chen · Yupan Huang · Tengchao Lv · Lei Cui · Qifeng Chen · Furu Wei -
2023 Poster: Multi-Agent First Order Constrained Optimization in Policy Space »
Youpeng Zhao · Yaodong Yang · Zhenbo Lu · Wengang Zhou · Houqiang Li -
2023 Poster: State Sequences Prediction via Fourier Transform for Representation Learning »
Mingxuan Ye · Yufei Kuang · Jie Wang · Yang Rui · Wengang Zhou · Houqiang Li · Feng Wu -
2023 Poster: ImageBrush: Learning Visual In-Context Instructions for Exemplar-Based Image Manipulation »
ya sheng sun · Yifan Yang · Houwen Peng · Yifei Shen · Yuqing Yang · Han Hu · Lili Qiu · Hideki Koike -
2023 Poster: DIFFER:Decomposing Individual Reward for Fair Experience Replay in Multi-Agent Reinforcement Learning »
Xunhan Hu · Jian Zhao · Wengang Zhou · Ruili Feng · Houqiang Li -
2023 Poster: Wyze Rule: Federated Rule Dataset for Rule Recommendation Benchmarking »
Mohammad Mahdi Kamani · Yuhang Yao · Hanjia Lyu · Zhongwei Cheng · Lin Chen · Liangju Li · Carlee Joe-Wong · Jiebo Luo -
2022 Spotlight: Lightning Talks 3A-3 »
Xu Yan · Zheng Dong · Qiancheng Fu · Jing Tan · Hezhen Hu · Fukun Yin · Weilun Wang · Ke Xu · Heshen Zhan · Wen Liu · Qingshan Xu · Xiaotong Zhao · Chaoda Zheng · Ziheng Duan · Zilong Huang · Xintian Shi · Wengang Zhou · Yew Soon Ong · Pei Cheng · Hujun Bao · Houqiang Li · Wenbing Tao · Jiantao Gao · Bin Kang · Weiwei Xu · Limin Wang · Ruimao Zhang · Tao Chen · Gang Yu · Rynson Lau · Shuguang Cui · Zhen Li -
2022 Spotlight: Hand-Object Interaction Image Generation »
Hezhen Hu · Weilun Wang · Wengang Zhou · Houqiang Li -
2022 Poster: PointNeXt: Revisiting PointNet++ with Improved Training and Scaling Strategies »
Guocheng Qian · Yuchen Li · Houwen Peng · Jinjie Mai · Hasan Hammoud · Mohamed Elhoseiny · Bernard Ghanem -
2022 Poster: Hand-Object Interaction Image Generation »
Hezhen Hu · Weilun Wang · Wengang Zhou · Houqiang Li -
2021 Poster: Dual Progressive Prototype Network for Generalized Zero-Shot Learning »
Chaoqun Wang · Shaobo Min · Xuejin Chen · Xiaoyan Sun · Houqiang Li -
2021 Poster: Contextual Similarity Aggregation with Self-attention for Visual Re-ranking »
Jianbo Ouyang · Hui Wu · Min Wang · Wengang Zhou · Houqiang Li -
2021 Poster: Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers »
Yanhong Zeng · Huan Yang · Hongyang Chao · Jianbo Wang · Jianlong Fu -
2021 Poster: Searching the Search Space of Vision Transformer »
Minghao Chen · Kan Wu · Bolin Ni · Houwen Peng · Bei Liu · Jianlong Fu · Hongyang Chao · Haibin Ling -
2021 Poster: Multi-modal Dependency Tree for Video Captioning »
Wentian Zhao · Xinxiao Wu · Jiebo Luo -
2020 Poster: Cream of the Crop: Distilling Prioritized Paths For One-Shot Neural Architecture Search »
Houwen Peng · Hao Du · Hongyuan Yu · QI LI · Jing Liao · Jianlong Fu -
2020 Poster: Promoting Stochasticity for Expressive Policies via a Simple and Efficient Regularization Method »
Qi Zhou · Yufei Kuang · Zherui Qiu · Houqiang Li · Jie Wang -
2020 Poster: Learning Semantic-aware Normalization for Generative Adversarial Networks »
Heliang Zheng · Jianlong Fu · Yanhong Zeng · Jiebo Luo · Zheng-Jun Zha -
2020 Spotlight: Learning Semantic-aware Normalization for Generative Adversarial Networks »
Heliang Zheng · Jianlong Fu · Yanhong Zeng · Jiebo Luo · Zheng-Jun Zha -
2019 Poster: Learning Deep Bilinear Transformation for Fine-grained Image Representation »
Heliang Zheng · Jianlong Fu · Zheng-Jun Zha · Jiebo Luo