Timezone: »

BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis
Yichong Leng · Zehua Chen · Junliang Guo · Haohe Liu · Jiawei Chen · Xu Tan · Danilo Mandic · Lei He · Xiangyang Li · Tao Qin · sheng zhao · Tie-Yan Liu

Wed Dec 07 05:00 PM -- 07:00 PM (PST) @
Binaural audio plays a significant role in constructing immersive augmented and virtual realities. As it is expensive to record binaural audio from the real world, synthesizing them from mono audio has attracted increasing attention. This synthesis process involves not only the basic physical warping of the mono audio, but also room reverberations and head/ear related filtration, which, however, are difficult to accurately simulate in traditional digital signal processing. In this paper, we formulate the synthesis process from a different perspective by decomposing the binaural audio into a common part that shared by the left and right channels as well as a specific part that differs in each channel. Accordingly, we propose BinauralGrad, a novel two-stage framework equipped with diffusion models to synthesize them respectively. Specifically, in the first stage, the common information of the binaural audio is generated with a single-channel diffusion model conditioned on the mono audio, based on which the binaural audio is generated by a two-channel diffusion model in the second stage. Combining this novel perspective of two-stage synthesis with advanced generative models (i.e., the diffusion models), the proposed BinauralGrad is able to generate accurate and high-fidelity binaural audio samples. Experiment results show that on a benchmark dataset, BinauralGrad outperforms the existing baselines by a large margin in terms of both object and subject evaluation metrics (Wave L2: $0.128$ vs. $0.157$, MOS: $3.80$ vs. $3.61$). The generated audio samples\footnote{\url{https://speechresearch.github.io/binauralgrad}} and code\footnote{\url{https://github.com/microsoft/NeuralSpeech/tree/master/BinauralGrad}} are available online.

Author Information

Yichong Leng (University of Science and Technology of China)
Zehua Chen (Imperial College London, Imperial College London)
Junliang Guo (Microsoft Research Asia)
Haohe Liu (University of Surrey)
Haohe Liu

I’m Haohe Liu, a first-year Ph.D. student at the Centre for Vision Speech and Signal Processing (CVSSP), University of Surrey. My research includes topics related to speech, music, and general audio. At the University of Surrey, I am fortunate to be co-advised by Prof. Mark D. Plumbley and Prof. Wenwu Wang. And I’m lucky to be jointly funded by BBC Research & Development (R&D) and the Doctoral College. In the CVSSP, I’m working as part of the AI for Sound project with the goal of developing new methods for automatic labeling of sound environments and events in broadcast audio, assisting production staff to find and search through content, and helping the general public access archive content. I’m also working closely with BBC R&D Audio Team on putting our audio recognition algorithms into production, such as generating machine tags in BBC sound effect library.

Jiawei Chen (South China University of Technology)
Xu Tan (Microsoft Research)
Danilo Mandic (Imperial College London)
Lei He (Microsoft)
Xiangyang Li (University of Science and Technology of China)
Tao Qin (Microsoft Research)
sheng zhao (Tsinghua University)
Tie-Yan Liu (Microsoft Research)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors