Timezone: »

Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching
Di Hu · Rui Qian · Minyue Jiang · Xiao Tan · Shilei Wen · Errui Ding · Weiyao Lin · Dejing Dou

Wed Dec 09 09:00 AM -- 11:00 AM (PST) @ Poster Session 3 #1016

Discriminatively localizing sounding objects in cocktail-party, i.e., mixed sound scenes, is commonplace for humans, but still challenging for machines. In this paper, we propose a two-stage learning framework to perform self-supervised class-aware sounding object localization. First, we propose to learn robust object representations by aggregating the candidate sound localization results in the single source scenes. Then, class-aware object localization maps are generated in the cocktail-party scenarios by referring the pre-learned object knowledge, and the sounding objects are accordingly selected by matching audio and visual object category distributions, where the audiovisual consistency is viewed as the self-supervised signal. Experimental results in both realistic and synthesized cocktail-party videos demonstrate that our model is superior in filtering out silent objects and pointing out the location of sounding objects of different classes. Code is available at https://github.com/DTaoo/Discriminative-Sounding-Objects-Localization.

Author Information

Di Hu (Renmin University of China)
Rui Qian (Shanghai Jiao Tong University)
Minyue Jiang (Baidu Inc.)
Xiao Tan (Baidu Inc.)
Shilei Wen (BAIDU)
Errui Ding (Baidu Inc.)
Weiyao Lin (Shanghai Jiao Tong university)
Dejing Dou (Baidu)

More from the Same Authors