Workshop
Audio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation
Anurag Kumar 路 Zhaoheng Ni 路 Shinji Watanabe 路 Wenwu Wang 路 Yapeng Tian 路 Berrak Sisman
Sat 14 Dec, 8:15 a.m. PST
Generative AI has been at the forefront of AI research in the most recent times. A large number of research works across different modalities (e.g., text, image and audio) have shown remarkable generation capabilities. Audio generation brings its own unique challenges and this workshop is aimed at highlighting these challenges and their solutions. It will bring together researchers working on different audio generation problems and enable a concentrated discussions on the topic. The workshop will include invited talks, high-quality papers presented through oral and poster sessions, and a panel discussion including experts in the area to further enhance the quality of discussion on audio generation research. A crucial part of audio generation research is its perceptual experience by humans. To enable this, \emph{we also propose to have an onsite demo session during the workshop where presenters can showcase their audio generation methods and technologies}, leading to a unique experience for all workshop participants.
Schedule
Sat 8:15 a.m. - 8:30 a.m.
|
Welcome and opening remarks
(
Opening
)
>
SlidesLive Video |
馃敆 |
Sat 8:30 a.m. - 9:00 a.m.
|
Alexis Conneau
(
Invited Talk
)
>
SlidesLive Video |
Alexis CONNEAU 馃敆 |
Sat 9:00 a.m. - 9:30 a.m.
|
Joon Soon Chung
(
Invited Talk
)
>
SlidesLive Video |
Joon Son Chung 馃敆 |
Sat 9:30 a.m. - 9:45 a.m.
|
Improving Musical Accompaniment Co-creation via Diffusion Transformers
(
Oral
)
>
link
SlidesLive Video |
Javier Nistal 路 Marco Pasini 路 Stefan Lattner 馃敆 |
Sat 9:45 a.m. - 10:00 a.m.
|
AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation
(
Oral
)
>
link
SlidesLive Video |
Kai Wang 路 Shijian Deng 路 Jing Shi 路 Dimitrios Hatzinakos 路 Yapeng Tian 馃敆 |
Sat 10:00 a.m. - 10:15 a.m.
|
AudioSetCaps: Enriched Audio Captioning Dataset Generation Using Large Audio Language Models
(
Oral
)
>
link
SlidesLive Video |
Jisheng Bai 路 Haohe Liu 路 Mou Wang 路 Dongyuan Shi 路 Wenwu Wang 路 Mark Plumbley 路 Woon-Seng Gan 路 Jianfeng Chen 馃敆 |
Sat 10:15 a.m. - 10:30 a.m.
|
Short Break
(
Short Break
)
>
|
馃敆 |
Sat 10:30 a.m. - 12:00 p.m.
|
VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Bing Han 路 Long Zhou 路 Shujie LIU 路 Sanyuan Chen 路 Lingwei Meng 路 Yanmin Qian 路 Eric Liu 路 sheng zhao 路 Jinyu Li 路 Furu Wei 馃敆 |
Sat 10:30 a.m. - 12:00 p.m.
|
Decoding Musical Perception: Music Stimuli Reconstruction from Brain Activity
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Matteo Ciferri 路 Matteo Ferrante 路 Nicola Toschi 馃敆 |
Sat 10:30 a.m. - 12:00 p.m.
|
Neural Audio Codec for Latent Music Representations
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Luca Lanzend枚rfer 路 Florian Gr枚tschla 路 Amir Dellali 路 Roger Wattenhofer 馃敆 |
Sat 10:30 a.m. - 12:00 p.m.
|
Do music LLMs learn symbolic concepts? A pilot study using probing and intervention
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Wenye Ma 路 Xinyue Li 路 Gus Xia 馃敆 |
Sat 10:30 a.m. - 12:00 p.m.
|
Improving Voice Quality in Speech Anonymization With Just Perception-Informed Losses
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Suhita Ghosh 路 Frank Dreyer 路 Tim Thiele 路 Frederic Lorbeer 路 Sebastian Stober 馃敆 |
Sat 10:30 a.m. - 12:00 p.m.
|
Contextual Speech Emotion Recognition with Large Language Models and ASR-Based Transcriptions
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Enshi Zhang 路 Christian Poellabauer 馃敆 |
Sat 10:30 a.m. - 12:00 p.m.
|
What do MLLMs hear? Examining the interaction between LLM and audio encoder components in Multimodal Large Language Models
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Enis 脟oban 路 Michael Mandel 路 Johanna Devaney 馃敆 |
Sat 10:30 a.m. - 12:00 p.m.
|
Articulatory Synthesis of Speech and Diverse Vocal Sounds via Optimization ( Poster+Demo Session ) > link | Luke Mo 路 Manuel Cherep 路 Nikhil Singh 路 Quinn Langford 路 Patricia Maes 馃敆 |
Sat 10:30 a.m. - 12:00 p.m.
|
Internalizing ASR with Implicit Chain of Thought for Efficient Speech-to-Speech Conversational LLM
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Robin Shing-Hei Yuen 路 Timothy Tse 路 Jian Zhu 馃敆 |
Sat 10:30 a.m. - 12:00 p.m.
|
A Closer Look at Neural Codec Resynthesis: Bridging the Gap between Codec and Waveform Generation
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Alexander Liu 路 Qirui Wang 路 Yuan Gong 路 Jim Glass 馃敆 |
Sat 10:30 a.m. - 12:00 p.m.
|
Continuous Autoregressive Models with Noise Augmentation Avoid Error Accumulation
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Marco Pasini 路 Javier Nistal 路 Stefan Lattner 路 George Fazekas 馃敆 |
Sat 10:30 a.m. - 12:00 p.m.
|
One-shot Text-aligned Virtual Instrument Generation Utilizing Diffusion Transformer ( Poster+Demo Session ) > link | Qihui Yang 路 Jiahe Lei 路 Qiuqiang Kong 馃敆 |
Sat 10:30 a.m. - 12:00 p.m.
|
Three-modal guidance for symbolic music generation: melody, structure, texture ( Poster+Demo Session ) > link | Daniel Lucht 路 David Leins 路 Dimitri von R眉tte 路 Alexandra Moringen 馃敆 |
Sat 10:30 a.m. - 12:00 p.m.
|
Parrot: Autoregressive Spoken Dialogue Language Modeling with Decoder-only Transformers
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Ziqiao Meng 路 Qichao Wang 路 Wenqian Cui 路 Yifei Zhang 路 Bingzhe Wu 路 Irwin King 路 Liang Chen 路 Peilin Zhao 馃敆 |
Sat 10:30 a.m. - 12:00 p.m.
|
Challenge on Sound Scene Synthesis: Evaluating Text-to-Audio Generation
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Junwon Lee 路 Modan Tailleur 路 Laurie Heller 路 Keunwoo Choi 路 Mathieu Lagrange 路 Brian McFee 路 Keisuke Imoto 路 Yuki Okamoto 馃敆 |
Sat 10:30 a.m. - 12:00 p.m.
|
High Fidelity Text-Guided Music Editing via Single-Stage Flow Matching
(
Poster+Demo Session
)
>
link
SlidesLive Video |
12 presentersGael Le Lan 路 Bowen Shi 路 Zhaoheng Ni 路 Sidd Srinivasan 路 Anurag Kumar 路 Brian Ellis 路 David Kant 路 Varun Nagaraja 路 Ernie Chang 路 Wei-Ning Hsu 路 Yangyang Shi 路 Vikas Chandra |
Sat 10:30 a.m. - 12:00 p.m.
|
SNAC: Multi-Scale Neural Audio Codec
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Hubert Siuzdak 路 Florian Gr枚tschla 路 Luca Lanzend枚rfer 馃敆 |
Sat 10:30 a.m. - 12:00 p.m.
|
Text Prompt is Not Enough: Sound Event Enhanced Prompt Adapter for Target Style Audio Generation
(
Poster+Demo Session
)
>
link
SlidesLive Video |
11 presentersChenxu Xiong 路 Ruibo Fu 路 Shuchen Shi 路 Zhengqi Wen 路 Tao Wang 路 Chenxing Li 路 Chunyu Qiang 路 Yuankun Xie 路 XinQi 路 Guanjun Li 路 Zizheng Yang |
Sat 10:30 a.m. - 12:00 p.m.
|
Latent Diffusion Model for Audio: Generation, Quality Enhancement, and Neural Audio Codec
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Haohe Liu 路 Wenwu Wang 路 Mark Plumbley 馃敆 |
Sat 10:30 a.m. - 12:00 p.m.
|
3D Audio-Visual Segmentation
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Artem Sokolov 路 Swapnil Bhosale 路 Xiatian Zhu 馃敆 |
Sat 10:30 a.m. - 12:00 p.m.
|
Decoding Strategy with Perceptual Rating Prediction for Language Model-Based Text-to-Speech Synthesis
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Kazuki Yamauchi 路 Wataru Nakata 路 Yuki Saito 路 Hiroshi Saruwatari 馃敆 |
Sat 10:30 a.m. - 12:00 p.m.
|
Efficient Generative Multimodal Integration (EGMI): Enabling Audio Generation from Text-Image Pairs through Alignment with Large Language Models
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Taemin Kim 路 Wooyeol Baek 路 Heeseok Oh 馃敆 |
Sat 10:30 a.m. - 12:00 p.m.
|
MusicScore: A Dataset for Music Score Modeling and Generation ( Poster+Demo Session ) > link | Yuheng Lin 路 Zheqi DAI 路 Qiuqiang Kong 馃敆 |
Sat 10:30 a.m. - 12:00 p.m.
|
Improving Musical Accompaniment Co-creation via Diffusion Transformers
(
Poster+Demo Session
)
>
|
馃敆 |
Sat 12:00 p.m. - 1:30 p.m.
|
Lunch Break
(
Lunch Break
)
>
|
馃敆 |
Sat 1:30 p.m. - 1:45 p.m.
|
LOCKEY: A Novel Approach to Model Authentication and Deepfake Tracking
(
Oral
)
>
link
SlidesLive Video |
Mayank Kumar Singh 路 Naoya Takahashi 路 Wei-Hsiang Liao 路 Yuki Mitsufuji 馃敆 |
Sat 1:45 p.m. - 2:00 p.m.
|
BLAP: Bootstrapping Language-Audio Pre-training for Music Captioning
(
Oral
)
>
link
SlidesLive Video |
Luca Lanzend枚rfer 路 Constantin Pinkl 路 Nathanael Perraudin 路 Roger Wattenhofer 馃敆 |
Sat 2:00 p.m. - 2:15 p.m.
|
Improving Source Extraction with Diffusion and Consistency Models
(
Oral
)
>
link
SlidesLive Video |
Tornike Karchkhadze 路 Mohammad Rasool Izadi 路 Shuo Zhang 馃敆 |
Sat 2:15 p.m. - 2:45 p.m.
|
Yao Xie
(
Invited Talk
)
>
SlidesLive Video |
Yao Xie 馃敆 |
Sat 2:45 p.m. - 3:15 p.m.
|
Vikas Chandra
(
Invited Talk
)
>
SlidesLive Video |
Vikas Chandra 馃敆 |
Sat 3:15 p.m. - 3:30 p.m.
|
Short Break
(
Short Break
)
>
|
馃敆 |
Sat 3:30 p.m. - 4:00 p.m.
|
Panel Discussion
(
Panel Discussion
)
>
SlidesLive Video |
馃敆 |
Sat 4:00 p.m. - 4:15 p.m.
|
Closing Remarks
(
Closing Remarks
)
>
SlidesLive Video |
馃敆 |
Sat 4:15 p.m. - 5:30 p.m.
|
Text-to-Audio Generation via Bridging Audio Language Model and Latent Diffusion
(
Poster+Demo Session
)
>
link
SlidesLive Video |
ZHENYU WANG 路 Chenxing Li 路 YONG XU 路 Chunlei Zhang 路 John H. L. Hansen 路 Dong Yu 馃敆 |
Sat 4:15 p.m. - 5:30 p.m.
|
Diffusion-based Speech Enhancement: Demonstration of Performance and Generalization
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Julius Richter 路 Timo Gerkmann 馃敆 |
Sat 4:15 p.m. - 5:30 p.m.
|
Contrastive Lyrics Alignment with a Timestamp-Informed Loss
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Timon Kick 路 Florian Gr枚tschla 路 Luca Lanzend枚rfer 路 Roger Wattenhofer 馃敆 |
Sat 4:15 p.m. - 5:30 p.m.
|
Generating Vocals from Lyrics and Musical Accompaniment
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Georg Streich 路 Luca Lanzend枚rfer 路 Florian Gr枚tschla 路 Roger Wattenhofer 馃敆 |
Sat 4:15 p.m. - 5:30 p.m.
|
Sound-VECaps: Improving Audio Generation With Visual Enhanced Captions
(
Poster+Demo Session
)
>
link
SlidesLive Video |
12 presentersYi Yuan 路 Dongya Jia 路 Xiaobin Zhuang 路 Yuanzhe Chen 路 Zhengxi Liu 路 Zhuo Chen 路 Wang Yuping 路 Yuxuan Wang 路 Xubo Liu 路 Xiyuan Kang 路 Mark Plumbley 路 Wenwu Wang |
Sat 4:15 p.m. - 5:30 p.m.
|
DGFM: Full Body Dance Generation Driven by Music Foundation Models
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Xinran Liu 路 Zhenhua Feng 路 Diptesh Kanojia 路 Wenwu Wang 馃敆 |
Sat 4:15 p.m. - 5:30 p.m.
|
MLADDC: Multi-Lingual Audio Deepfake Detection Corpus
(
Poster+Demo Session
)
>
link
SlidesLive Video |
ARTH SHAH 路 Ravindrakumar M. Purohit 路 Dharmendra Vaghera 路 Hemant Patil 馃敆 |
Sat 4:15 p.m. - 5:30 p.m.
|
Multi-Source Music Generation with Latent Diffusion
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Zhongweiyang Xu 路 Debottam Dutta 路 Yu-Lin Wei 路 Romit Roy Choudhury 馃敆 |
Sat 4:15 p.m. - 5:30 p.m.
|
Disentangling Multi-instrument Music Audio for Source-level Pitch and Timbre Manipulation
(
Poster+Demo Session
)
>
link
SlidesLive Video |
11 presentersYin-Jyun Luo 路 Kin Wai Cheuk 路 Woosung Choi 路 Wei-Hsiang Liao 路 Keisuke Toyama 路 Toshimitsu Uesaka 路 Koichi Saito 路 Chieh-Hsin Lai 路 Yuhta Takida 路 Simon Dixon 路 Yuki Mitsufuji |
Sat 4:15 p.m. - 5:30 p.m.
|
Spatially-Aware Losses for Enhanced Neural Acoustic Fields
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Christopher Ick 路 Gordon Wichern 路 Yoshiki Masuyama 路 Fran莽ois Germain 路 Jonathan Le Roux 馃敆 |
Sat 4:15 p.m. - 5:30 p.m.
|
DART: Disentanglement of Accent and Speaker Representation in Multispeaker Text-to-Speech
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Jan Melechovsky 路 Ambuj Mehrish 路 Berrak Sisman 路 Dorien Herremans 馃敆 |
Sat 4:15 p.m. - 5:30 p.m.
|
Style Mixture of Experts for Expressive Text-To-Speech Synthesis
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Ahad Jawaid 路 Shreeram Suresh Chandra 路 Junchen Lu 路 Berrak Sisman 馃敆 |
Sat 4:15 p.m. - 5:30 p.m.
|
Vision Language Models Are Few-Shot Audio Spectrogram Classifiers
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Satvik Dixit 路 Laurie Heller 路 Chris Donahue 馃敆 |
Sat 4:15 p.m. - 5:30 p.m.
|
SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Koichi Saito 路 Dongjun Kim 路 Takashi Shibuya 路 Chieh-Hsin Lai 路 Zhi Zhong 路 Yuhta Takida 路 Yuki Mitsufuji 馃敆 |
Sat 4:15 p.m. - 5:30 p.m.
|
FSD: Acoustic Echo Cancellation with Fewer Step Diffusion ( Poster+Demo Session ) > link | Yang Liu 路 Li Wan 路 Yiteng Huang 路 Ming Sun 路 Changsheng Zhao 路 Zhaoheng Ni 路 Xinhao Mei 路 Yangyang Shi 路 Florian Metze 馃敆 |
Sat 4:15 p.m. - 5:30 p.m.
|
Towards Temporally Synchronized Visually Indicated Sounds Through Scale-Adapted Positional Embeddings ( Poster+Demo Session ) > link | Xinhao Mei 路 Gael Le Lan 路 Haohe Liu 路 Zhaoheng Ni 路 Varun Nagaraja 路 Anurag Kumar 路 Yangyang Shi 路 Vikas Chandra 馃敆 |
Sat 4:15 p.m. - 5:30 p.m.
|
LoVA: Long-form Video-to-Audio Generation
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Xin Cheng 路 Xihua Wang 路 Yihan Wu 路 Yuyue Wang 路 Ruihua Song 馃敆 |
Sat 4:15 p.m. - 5:30 p.m.
|
Coarse-to-Fine Text-to-Music Latent Diffusion
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Luca Lanzend枚rfer 路 Tongyu Lu 路 Nathanael Perraudin 路 Dorien Herremans 路 Roger Wattenhofer 馃敆 |
Sat 4:15 p.m. - 5:30 p.m.
|
Benchmarking Music Generation Models and Metrics via Human Preference Studies
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Ahmet Solak 路 Florian Gr枚tschla 路 Luca Lanzend枚rfer 路 Roger Wattenhofer 馃敆 |
Sat 4:15 p.m. - 5:30 p.m.
|
AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation
(
Poster+Demo Session
)
>
|
馃敆 |
Sat 4:15 p.m. - 5:30 p.m.
|
AudioSetCaps: Enriched Audio Captioning Dataset Generation Using Large Audio Language Models
(
Poster+Demo Session
)
>
|
馃敆 |
Sat 4:15 p.m. - 5:30 p.m.
|
LOCKEY: A Novel Approach to Model Authentication and Deepfake Tracking
(
Poster+Demo Session
)
>
|
馃敆 |
Sat 4:15 p.m. - 5:30 p.m.
|
BLAP: Bootstrapping Language-Audio Pre-training for Music Captioning
(
Poster+Demo Session
)
>
|
馃敆 |
Sat 4:15 p.m. - 5:30 p.m.
|
Improving Source Extraction with Diffusion and Consistency Models
(
Poster+Demo Session
)
>
|
馃敆 |