Timezone: »

Fine-Grained Semantically Aligned Vision-Language Pre-Training
Juncheng Li · XIN HE · Longhui Wei · Long Qian · Linchao Zhu · Lingxi Xie · Yueting Zhuang · Qi Tian · Siliang Tang

Thu Dec 08 09:00 AM -- 11:00 AM (PST) @

Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks. Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and text, or advanced cross-modal attention upon image and text features. However, they fail to explicitly learn the fine-grained semantic alignment between visual regions and textual phrases, as only global image-text alignment information is available. In this paper, we introduce LOUPE, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions. To efficiently estimate the game-theoretic interactions, we further propose an uncertainty-aware neural Shapley interaction learning module. Experiments show that LOUPE achieves state-of-the-art performance on a variety of vision-language tasks. Without any object-level human annotations and fine-tuning, LOUPE achieves competitive performance on object detection and visual grounding. More importantly, LOUPE opens a new promising direction of learning fine-grained semantics from large-scale raw image-text pairs.

Author Information

Juncheng Li (Zhejiang University)
XIN HE (Wuhan University of Technology)
Longhui Wei (Peking University)
Long Qian (Zhejiang University)
Linchao Zhu (Zhejiang University)
Lingxi Xie (Huawei Noah's Ark Lab)
Yueting Zhuang (Zhejiang University)
Qi Tian (Huawei Noah’s Ark Lab)
Siliang Tang (Zhejiang University)

Dr. Siliang Tang is currently an associate professor at the College of Computer Science, Zhejiang University. In 2012, Siliang got his Ph.D. degree at the National University of Ireland, Maynooth, Ireland. His research interests include Information Extraction, Knowledge-base construction, and Multimodal Data Analysis. So far, he has published more than 70 papers in top-tier scientific conferences/journals such as AAAI, IJCAI (Artificial Intelligence); ACL, EMNLP, NAACL, SIGIR, IEEE TKDE (NLP and Information Extraction); ACM MM, CVPR, IEEE Trans. on Multimedia (Multimodal Understanding and Reasoning); IEEE Trans. on Image Processing, IEEE Trans. on Circuits and Systems for Video Technology (Image Processing and Understanding); IEEE VIS, IEEE Trans. on Visualization and Computer Graphics (Data Visualization). He has been serving as area chair or program committee member in conferences such as NIPS, ICML, AAAI, IJCAI, ACL, EMNLP, NAACL, and reviewers of journals such as IEEE TIP, IEEE TMM, IEEE TSMC, ACM Computing Surveys, Nature- Scientific Reports, etc. At present, he is mainly working for CKCEST on the projects of automatic knowledge base construction from semi-structured and unstructured text. He and his team participated in the NIST TAC (https://tac.nist.gov/) competition for knowledge base population since 2015 and won the top three places in some tracks (English EDL 2016 1st place; TEDL 2017 2nd place; DDI 2018, task1&2, 1st place) several times.

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors