Timezone: »
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks. Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and text, or advanced cross-modal attention upon image and text features. However, they fail to explicitly learn the fine-grained semantic alignment between visual regions and textual phrases, as only global image-text alignment information is available. In this paper, we introduce LOUPE, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions. To efficiently estimate the game-theoretic interactions, we further propose an uncertainty-aware neural Shapley interaction learning module. Experiments show that LOUPE achieves state-of-the-art performance on a variety of vision-language tasks. Without any object-level human annotations and fine-tuning, LOUPE achieves competitive performance on object detection and visual grounding. More importantly, LOUPE opens a new promising direction of learning fine-grained semantics from large-scale raw image-text pairs.
Author Information
Juncheng Li (Zhejiang University)
XIN HE (Wuhan University of Technology)
Longhui Wei (Peking University)
Long Qian (Zhejiang University)
Linchao Zhu (Zhejiang University)
Lingxi Xie (Huawei Noah's Ark Lab)
Yueting Zhuang (Zhejiang University)
Qi Tian (Huawei Noah’s Ark Lab)
Siliang Tang (Zhejiang University)
Dr. Siliang Tang is currently an associate professor at the College of Computer Science, Zhejiang University. In 2012, Siliang got his Ph.D. degree at the National University of Ireland, Maynooth, Ireland. His research interests include Information Extraction, Knowledge-base construction, and Multimodal Data Analysis. So far, he has published more than 70 papers in top-tier scientific conferences/journals such as AAAI, IJCAI (Artificial Intelligence); ACL, EMNLP, NAACL, SIGIR, IEEE TKDE (NLP and Information Extraction); ACM MM, CVPR, IEEE Trans. on Multimedia (Multimodal Understanding and Reasoning); IEEE Trans. on Image Processing, IEEE Trans. on Circuits and Systems for Video Technology (Image Processing and Understanding); IEEE VIS, IEEE Trans. on Visualization and Computer Graphics (Data Visualization). He has been serving as area chair or program committee member in conferences such as NIPS, ICML, AAAI, IJCAI, ACL, EMNLP, NAACL, and reviewers of journals such as IEEE TIP, IEEE TMM, IEEE TSMC, ACM Computing Surveys, Nature- Scientific Reports, etc. At present, he is mainly working for CKCEST on the projects of automatic knowledge base construction from semi-structured and unstructured text. He and his team participated in the NIST TAC (https://tac.nist.gov/) competition for knowledge base population since 2015 and won the top three places in some tracks (English EDL 2016 1st place; TEDL 2017 2nd place; DDI 2018, task1&2, 1st place) several times.
More from the Same Authors
-
2023 Poster: HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face »
Yongliang Shen · Kaitao Song · Xu Tan · Dongsheng Li · Weiming Lu · Yueting Zhuang -
2023 Poster: Zero-shot Visual Relation Detection via Composite Visual Cues from Large Language Models »
Lin Li · Jun Xiao · Guikun Chen · Jian Shao · Yueting Zhuang · Long Chen -
2023 Poster: Mode Approximation Makes Good Multimodal Prompts »
Haixin Wang · Xinlong Yang · Jianlong Chang · Dian Jin · Jinan Sun · Shikun Zhang · Xiao Luo · Qi Tian -
2023 Poster: Learning to Parameterize Visual Attributes for Open-set Fine-grained Retrieval »
Shijie Wang · Jianlong Chang · Haojie Li · Zhihui Wang · Wanli Ouyang · Qi Tian -
2023 Poster: AiluRus: A Scalable ViT Framework for Dense Prediction »
Jin Li · Yaoming Wang · XIAOPENG ZHANG · Bowen Shi · Dongsheng Jiang · Chenglin Li · Wenrui Dai · Hongkai Xiong · Qi Tian -
2023 Poster: Segment Anything in 3D with NeRFs »
Jiazhong Cen · Zanwei Zhou · Jiemin Fang · chen yang · Wei Shen · Lingxi Xie · Dongsheng Jiang · XIAOPENG ZHANG · Qi Tian -
2022 Spotlight: Fine-Grained Semantically Aligned Vision-Language Pre-Training »
Juncheng Li · XIN HE · Longhui Wei · Long Qian · Linchao Zhu · Lingxi Xie · Yueting Zhuang · Qi Tian · Siliang Tang -
2021 Poster: Learning to Generate Visual Questions with Noisy Supervision »
Shen Kai · Lingfei Wu · Siliang Tang · Yueting Zhuang · zhen he · Zhuoye Ding · Yun Xiao · Bo Long -
2021 Poster: Rectifying the Shortcut Learning of Background for Few-Shot Learning »
Xu Luo · Longhui Wei · Liangjian Wen · Jinrong Yang · Lingxi Xie · Zenglin Xu · Qi Tian -
2021 Poster: Learning High-Precision Bounding Box for Rotated Object Detection via Kullback-Leibler Divergence »
Xue Yang · Xiaojiang Yang · Jirui Yang · Qi Ming · Wentao Wang · Qi Tian · Junchi Yan -
2020 Poster: Self-Adaptively Learning to Demoiré from Focused and Defocused Image Pairs »
Lin Liu · Shanxin Yuan · Jianzhuang Liu · Liping Bao · Gregory Slabaugh · Qi Tian -
2020 Poster: One-bit Supervision for Image Classification »
Hengtong Hu · Lingxi Xie · Zewei Du · Richang Hong · Qi Tian -
2020 Poster: Pixel-Level Cycle Association: A New Perspective for Domain Adaptive Semantic Segmentation »
Guoliang Kang · Yunchao Wei · Yi Yang · Yueting Zhuang · Alexander Hauptmann -
2020 Oral: Pixel-Level Cycle Association: A New Perspective for Domain Adaptive Semantic Segmentation »
Guoliang Kang · Yunchao Wei · Yi Yang · Yueting Zhuang · Alexander Hauptmann -
2019 Poster: Connective Cognition Network for Directional Visual Commonsense Reasoning »
Aming Wu · Linchao Zhu · Yahong Han · Yi Yang -
2019 Poster: Information Competing Process for Learning Diversified Representations »
Jie Hu · Rongrong Ji · ShengChuan Zhang · Xiaoshuai Sun · Qixiang Ye · Chia-Wen Lin · Qi Tian -
2018 Poster: MacNet: Transferring Knowledge from Machine Comprehension to Sequence-to-Sequence Models »
Boyuan Pan · Yazheng Yang · Hao Li · Zhou Zhao · Yueting Zhuang · Deng Cai · Xiaofei He