Timezone: »
Poster
Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser
Yung-Hsuan Lai · Yen-Chun Chen · Frank Wang
Audio-visual learning has been a major pillar of multi-modal machine learning, where the community mostly focused on its $\textit{modality-aligned}$ setting, $\textit{i.e.}$, the audio and visual modality are $\textit{both}$ assumed to signal the prediction target.With the Look, Listen, and Parse dataset (LLP), we investigate the under-explored $\textit{unaligned}$ setting, where the goal is to recognize audio and visual events in a video with only weak labels observed.Such weak video-level labels only tell what events happen without knowing the modality they are perceived (audio, visual, or both).To enhance learning in this challenging setting, we incorporate large-scale contrastively pre-trained models as the modality teachers. A simple, effective, and generic method, termed $\textbf{V}$isual-$\textbf{A}$udio $\textbf{L}$abel Elab$\textbf{or}$ation (VALOR), is innovated to harvest modality labels for the training events.Empirical studies show that the harvested labels significantly improve an attentional baseline by $\textbf{8.0}$ in average F-score (Type@AV).Surprisingly, we found that modality-independent teachers outperform their modality-fused counterparts since they are noise-proof from the other potentially unaligned modality.Moreover, our best model achieves the new state-of-the-art on all metrics of LLP by a substantial margin ($\textbf{+5.4}$ F-score for Type@AV). VALOR is further generalized to Audio-Visual Event Localization and achieves the new state-of-the-art as well.
Author Information
Yung-Hsuan Lai (National Taiwan University)
Yen-Chun Chen (Microsoft)
Frank Wang (NVIDIA & NTU)
More from the Same Authors
-
2021 : VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation »
Linjie Li · Jie Lei · Zhe Gan · Licheng Yu · Yen-Chun Chen · Rohit Pillai · Yu Cheng · Luowei Zhou · Xin Wang · William Yang Wang · Tamara L Berg · Mohit Bansal · Jingjing Liu · Lijuan Wang · Zicheng Liu -
2021 Spotlight: Adversarial Teacher-Student Representation Learning for Domain Generalization »
Fu-En Yang · Yuan-Chia Cheng · Zu-Yun Shiau · Frank Wang -
2023 Poster: Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models »
Shihao Zhao · Dongdong Chen · Yen-Chun Chen · Jianmin Bao · Shaozhe Hao · Lu Yuan · Kwan-Yee K. Wong -
2022 Poster: SPoVT: Semantic-Prototype Variational Transformer for Dense Point Cloud Semantic Completion »
Sheng Yu Huang · Hao-Yu Hsu · Frank Wang -
2022 Poster: Paraphrasing Is All You Need for Novel Object Captioning »
Cheng-Fu Yang · Yao-Hung Hubert Tsai · Wan-Cyuan Fan · Russ Salakhutdinov · Louis-Philippe Morency · Frank Wang -
2022 Poster: GLIPv2: Unifying Localization and Vision-Language Understanding »
Haotian Zhang · Pengchuan Zhang · Xiaowei Hu · Yen-Chun Chen · Liunian Li · Xiyang Dai · Lijuan Wang · Lu Yuan · Jenq-Neng Hwang · Jianfeng Gao -
2021 Poster: Adversarial Teacher-Student Representation Learning for Domain Generalization »
Fu-En Yang · Yuan-Chia Cheng · Zu-Yun Shiau · Frank Wang -
2020 Poster: Large-Scale Adversarial Training for Vision-and-Language Representation Learning »
Zhe Gan · Yen-Chun Chen · Linjie Li · Chen Zhu · Yu Cheng · Jingjing Liu -
2020 Spotlight: Large-Scale Adversarial Training for Vision-and-Language Representation Learning »
Zhe Gan · Yen-Chun Chen · Linjie Li · Chen Zhu · Yu Cheng · Jingjing Liu -
2018 Poster: A Unified Feature Disentangler for Multi-Domain Image Translation and Manipulation »
Alexander H. Liu · Yen-Cheng Liu · Yu-Ying Yeh · Frank Wang