Timezone: »

Counterfactual Contrastive Learning for Weakly-Supervised Vision-Language Grounding
Zhu Zhang · Zhou Zhao · Zhijie Lin · jieming zhu · Xiuqiang He

Tue Dec 08 09:00 PM -- 11:00 PM (PST) @ Poster Session 2 #628

Weakly-supervised vision-language grounding aims to localize a target moment in a video or a specific region in an image according to the given sentence query, where only video-level or image-level sentence annotations are provided during training. Most existing approaches employ the MIL-based or reconstruction-based paradigms for the WSVLG task, but the former heavily depends on the quality of randomly-selected negative samples and the latter cannot directly optimize the visual-textual alignment score. In this paper, we propose a novel Counterfactual Contrastive Learning (CCL) to develop sufficient contrastive training between counterfactual positive and negative results, which are based on robust and destructive counterfactual transformations. Concretely, we design three counterfactual transformation strategies from the feature-, interaction- and relation-level, where the feature-level method damages the visual features of selected proposals, interaction-level approach confuses the vision-language interaction and relation-level strategy destroys the context clues in proposal relationships. Extensive experiments on five vision-language grounding datasets verify the effectiveness of our CCL paradigm.

Author Information

Zhu Zhang (Zhejiang University)
Zhou Zhao (Zhejiang University)
Zhijie Lin (Zhejiang University)
jieming zhu (Huawei Noah''s Ark Lab)
Xiuqiang He (Huawei Noah's Ark Lab)

More from the Same Authors