Workshop
Machine Learning for Audio
Brian Kulis · Sadie Allen · Sander Dieleman · Shrikanth Narayanan · Rachel Manzelli · Alice Baird · Alan Cowen
Room 228 - 230
The Machine Learning for Audio Workshop at NeurIPS 2023 will bring together audio practitioners and machine learning researchers to a venue focused on various problems in audio, including music information retrieval, acoustic event detection, computational paralinguistics, speech transcription, multimodal modeling, and generative modeling of speech and other sounds. Our team has previously held multiple audio-related workshops at top machine learning venues, and both the organizing team and invited speakers represent broad diversity in terms of gender identity, affiliation, seniority, and geography. We also plan to solicit workshop papers on the topic.
Schedule
Sat 6:30 a.m. - 6:40 a.m.
|
Opening remarks
(
Opening remarks
)
>
SlidesLive Video |
Brian Kulis 🔗 |
Sat 6:40 a.m. - 7:00 a.m.
|
Computer Audition Disrupted 2.0: The Foundation Models Era
(
Invited talk
)
>
SlidesLive Video Computer Audition is changing. Since the advent of Large Audio, Language, and Multimodal Models, or generally Foundation Models, a new age has begun. Emergence of abilities in such large models by zero- or few-shot learning render it partially unnecessary to collect task-specific data and train an according model. After the last major disruption – learning representations and model architectures directly from data – this can be judged as the second major disruption in a field that once was coined by highly specialized features, approaches, and datasets shifting towards being absorbed by sheer size of models and data used for their training. In this talk, I will first argue that Computer Audition will be massively influenced by this “plate displacement” in Artificial Intelligence as a whole. I will then move towards “informed tea-leaf reading” how present and tomorrow’s Computer Audition will change in more detail. This includes prompt optimisation, fine-tuning, or synergistic combination of different foundation models and traditional approaches. Finally, I will turn towards dangers to this new glittery era – among many, the “nightshades” of audio may soon start to poison audio data. A new time has begun – it will empower Computer Audition at a whole new level while challenging us in whole new ways – let’s get ready |
Bjoern Schuller 🔗 |
Sat 7:00 a.m. - 7:20 a.m.
|
Explainable AI for Audio via Virtual Inspection Layers
(
Oral
)
>
SlidesLive Video The field of eXplainable Artificial Intelligence (XAI) has made significant advancements in recent years. However, most progress has focused on computer vision and natural language processing. There has been limited research on XAI specifically for audio or other time series data, where the input itself is often hard to interpret. In this study, we introduce a virtual inspection layer that transforms time series data into an interpretable representation and enables the use of local XAI methods to attribute relevance to this representation. |
Johanna Vielhaben · Sebastian Lapuschkin · Grégoire Montavon · Wojciech Samek 🔗 |
Sat 7:20 a.m. - 7:40 a.m.
|
Self-Supervised Speech Enhancement using Multi-Modal Data
(
Oral
)
>
SlidesLive Video We consider the problem of speech enhancement in earphones. While microphones are classical speech sensors, motion sensors embedded in modern earphones also pick up faint components of the user’s speech. While this faint motion data has generally been ignored, we show that they can serve as a pathway for selfsupervised speech enhancement. Our proposed model is an iterative framework in which the motion data offers a hint to the microphone (in the form of an estimated posterior); the microphone SNR improves from the hint, which then helps the motion data to refine it’s next hint. Results show that this alternating self-supervision converges even in the presence of strong ambient noise, and the performance is comparable to supervised Denoisers. When small amount of training data is available, our model outperforms the same Denoisers. |
Yu-Lin Wei · Rajalaxmi Rajagopalan · Bashima Islam · Romit Roy Choudhury 🔗 |
Sat 7:40 a.m. - 8:10 a.m.
|
A multi-view approach for audio-based speech emotion recognition
(
Invited talk
)
>
SlidesLive Video The area of speech emotion recognition (SER) has seen significant advances with the wider availability of pre-trained models and embeddings, and the creation of larger publicly available corpora. In this talk we will touch upon some of the challenges that continue to riddle audio-based SER, such as domain adaptation, data augmentation and output generalization, and further discuss the advantages of a multi-view model approach, one that jointly learns from both categorical and dimensional affect labels. |
Dimitra Emmanouilidou 🔗 |
Sat 8:10 a.m. - 8:50 a.m.
|
Coffee break
(
Break
)
>
|
🔗 |
Sat 8:50 a.m. - 9:10 a.m.
|
Audio Language Models
(
Invited talk
)
>
SlidesLive Video Audio analysis and audio synthesis require modeling long-term, complex phenomena and have historically been tackled in an asymmetric fashion, with specific analysis models that differ from their synthesis counterpart. In this presentation, we will introduce the concept of audio language models, a recent innovation aimed at overcoming these limitations. By discretizing audio signals using a neural audio codec, we can frame both audio generation and audio comprehension as similar autoregressive sequence-to-sequence tasks, capitalizing on the well-established Transformer architecture commonly used in language modeling. This approach unlocks novel capabilities in areas such as textless speech modeling, zero-shot voice conversion, and even text-to-music generation. Furthermore, we will illustrate how the integration of analysis and synthesis within a single model enables the creation of versatile audio models capable of handling a wide range of tasks involving audio as inputs or outputs. We will conclude by highlighting the promising prospects offered by these models and discussing the key challenges that lie ahead in their development. |
Neil Zeghidour 🔗 |
Sat 9:10 a.m. - 9:30 a.m.
|
Zero-shot audio captioning with audio-language model guidance and audio context keywords
(
Oral
)
>
SlidesLive Video Zero-shot audio captioning aims at automatically generating descriptive textual captions for audio content without prior training for this task. Different from speech recognition which translates audio content that contains spoken language into text, audio captioning is commonly concerned with ambient sounds, or those produced by a human performing an action. Inspired by zero-shot image captioning methods, we propose a novel approach for understanding and summarising such general audio signals in a text caption. In particular, our framework exploits a pre-trained large language model (LLM) for generating the text, guided by a pre-trained audio-language model that steers the text generation to produce captions that describe the audio content. Additionally, we use audio context keywords that prompt the language model to generate text that is broadly relevant to sounds. Our proposed framework achieves state-of-the-art results in zero-shot audio captioning on the AudioCaps and Clotho datasets. Code will be released upon acceptance. |
Leonard Salewski · Stefan Fauth · A. Sophia Koepke · Zeynep Akata 🔗 |
Sat 9:30 a.m. - 10:00 a.m.
|
Lark: A Multimodal Foundation Model for Music
(
Invited talk
)
>
SlidesLive Video Music has a unique and complex structure which is challenging for both expert humans and existing AI systems to understand, and presents unique challenges relative to other forms of audio. We present LLARK, an instruction-tuned multimodal model for music understanding. We detail our process for dataset creation, which involves augmenting the annotations of diverse open-source music datasets and converting them to a unified instruction-tuning format. We propose a multimodal architecture for LLARK, integrating a pretrained generative model for music with a pretrained language model. In evaluations on three types of tasks (music understanding, captioning, and reasoning), we show that our model matches or outperforms existing baselines in zero-shot generalization for music understanding, and that humans show a high degree of agreement with the model’s responses in captioning and reasoning tasks. |
Rachel Bittner 🔗 |
Sat 10:00 a.m. - 11:30 a.m.
|
Lunch break
(
Break
)
>
|
🔗 |
Sat 11:30 a.m. - 1:00 p.m.
|
Poster & Demo Session
(
Poster Session
)
>
Accepted submissions will participate in a poster session alongside demos of their work where applicable. |
🔗 |
Sat 1:00 p.m. - 1:30 p.m.
|
Coffee break
(
Break
)
>
|
🔗 |
Sat 1:30 p.m. - 2:00 p.m.
|
Uninformative Gradients: Optimisation pathologies in differentiable digital signal processing
(
Invited talk
)
>
SlidesLive Video Differentiable digital signal processing (DDSP) allows us to constrain the outputs of a neural network to those of a known class of signal processor. This can help us train with limited data, reduce audio artefacts, infer parameters of signal models, and expose human interpretable controls. However, numerous failure modes still exist for certain important families of signal processor. This talk illustrates two such challenges, frequency parameter non-convexity and permutation symmetry, and introduces promising approaches to solving them. |
Ben Hayes 🔗 |
Sat 2:00 p.m. - 2:20 p.m.
|
EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis
(
Oral
)
>
SlidesLive Video Diffusion models have showcased their capabilities in audio synthesis ranging over a variety of sounds. Existing models often operate on the latent domain with cascaded phase recovery modules to reconstruct waveform. It potentially introduces challenges in generating high-fidelity audio. In this paper, we propose EDMSound, a diffusion-based generative model in spectrogram domain under the framework of elucidated diffusion models (EDM). Combining with efficient deterministic sampler, we achieved similar Fréchet audio distance (FAD) score as top-ranked baseline with only 10 steps and reached state-of-the-art performance with 50 steps on the DCASE2023 foley sound generation benchmark. We also revealed a potential concern regarding diffusion based audio generation models that they tend to generate samples with high perceptual similarity to the data from training data. Project page: https://tinyurl.com/4rds3bnn |
Ge Zhu · Yutong Wen · Marc-André Carbonneau · Zhiyao Duan 🔗 |
Sat 2:20 p.m. - 2:40 p.m.
|
Towards Generalizable SER: Soft Labeling and Data Augmentation for Modeling Temporal Emotion Shifts in Large-Scale Multilingual Speech
(
Oral
)
>
SlidesLive Video Recognizing emotions in spoken communication is crucial for advanced human-machine interaction. Current emotion detection methodologies often display biases when applied cross-corpus. To address this, our study amalgamates 16 diverse datasets, resulting in 375 hours of data across languages like English, Chinese, and Japanese. We propose a soft labeling system capturing gradational emotional intensities. Using the Whisper encoder and a data augmentation inspired by contrastive learning, our method emphasizes the temporal dynamics of emotions. Our validation on 4 multilingual datasets demonstrates notable zero-shot generalization. We further fine-tune on Hume-Prosody and publish initial promising results. |
Mohamed Osman · Tamer Nadeem · Ghada khoriba 🔗 |
Sat 2:40 p.m. - 3:00 p.m.
|
Audio Personalization through Human-in-the-loop Optimization
(
Oral
)
>
SlidesLive Video
We consider the problem of personalizing audio to maximize user experience. Briefly, we aim to find a filter $h^*$, which applied to any music or speech, will maximize the user's satisfaction. This is a black-box optimization problem since the user's satisfaction function is unknown. The key idea is to play audio samples to the user, each shaped by a different filter $h_i$, and query the user for their satisfaction scores $f(h_i)$. A family of ``surrogate'' functions is then designed to fit these scores and the optimization method gradually refines these functions to arrive at the filter $\hat{h}^*$ that maximizes satisfaction. In this paper, we observe that a second type of querying is possible where users can tell us the individual elements $h^*[j]$ of the optimal filter $h^*$. Given a budget of $B$ queries, where a query can be of either type, our goal is to find the filter that will maximize this user's satisfaction. Our proposal builds on Sparse Gaussian Process Regression (GPR) and shows how a hybrid approach can outperform any one type of querying. Our results are validated through simulations and real-world experiments, where volunteers gave feedback on music/speech audio and were able to achieve high satisfaction levels. We believe this idea of hybrid querying opens new problems in black-box optimization, and solutions can benefit other applications beyond audio personalization.
|
Rajalaxmi Rajagopalan · Yu-Lin Wei · Romit Roy Choudhury 🔗 |
Sat 3:00 p.m. - 3:20 p.m.
|
Multi-channel speech enhancement for moving sources
(
Invited talk
)
>
SlidesLive Video Speech enhancement technology has made remarkable progress in recent years. While many single channel methods have been proposed, and their performance has improved, multi-channel speech enhancement technology remains important due to its high performance in estimating and retaining sound source spatial information. Many multi-channel processing methods have been proposed so far for cases where the sound source and noise positions are fixed. However, for real-world applications, it is necessary to consider sound source movement and improve robustness to moving sources. In this presentation, I will introduce multi-channel audio enhancement technologies for moving sources. First, I will present an extension of mask-based neural beamforming, which is widely used as an ASR front-end, to moving sound sources. This extension is achieved by integrating model-based array signal processing and data-driven deep learning approaches. Then, I will discuss model-based, unsupervised multi-channel source separation and extraction approaches, e.g., independent component/vector analysis (ICA/IVA). For multi-channel processing, in addition to dealing with moving sources, it is also essential to devise techniques that limit the increase in computational complexity as the number of microphones increases. To address this issue, I will introduce a fast online IVA algorithm for tracking a single moving source that achieves optimal time complexity and operates significantly faster than conventional approaches. |
Shoko Araki 🔗 |
-
|
EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis
(
Poster
)
>
Diffusion models have showcased their capabilities in audio synthesis ranging over a variety of sounds. Existing models often operate on the latent domain with cascaded phase recovery modules to reconstruct waveform. It potentially introduces challenges in generating high-fidelity audio. In this paper, we propose EDMSound, a diffusion-based generative model in spectrogram domain under the framework of elucidated diffusion models (EDM). Combining with efficient deterministic sampler, we achieved similar Fréchet audio distance (FAD) score as top-ranked baseline with only 10 steps and reached state-of-the-art performance with 50 steps on the DCASE2023 foley sound generation benchmark. We also revealed a potential concern regarding diffusion based audio generation models that they tend to generate samples with high perceptual similarity to the data from training data. Project page: https://tinyurl.com/4rds3bnn |
Ge Zhu · Yutong Wen · Marc-André Carbonneau · Zhiyao Duan 🔗 |
-
|
Explainable AI for Audio via Virtual Inspection Layers
(
Poster
)
>
The field of eXplainable Artificial Intelligence (XAI) has made significant advancements in recent years. However, most progress has focused on computer vision and natural language processing. There has been limited research on XAI specifically for audio or other time series data, where the input itself is often hard to interpret. In this study, we introduce a virtual inspection layer that transforms time series data into an interpretable representation and enables the use of local XAI methods to attribute relevance to this representation. |
Johanna Vielhaben · Sebastian Lapuschkin · Grégoire Montavon · Wojciech Samek 🔗 |
-
|
Audio classification with Dilated Convolution with Learnable Spacings
(
Poster
)
>
link
Dilated convolution with learnable spacings (DCLS) is a recent convolution method in which the positions of the kernel elements are learned throughout training by backpropagation. Its interest has recently been demonstrated in computer vision (ImageNet classification and downstream tasks). Here, we show that DCLS is also useful for audio tagging using the AudioSet classification benchmark. We took two state-of-the-art convolutional architectures using depthwise separable convolutions (DSC), ConvNeXt and ConvFormer, and a hybrid one using attention in addition, FastViT, and drop-in replaced all the DSC layers by DCLS ones. This significantly improved the mean average precision (mAP) with the three architectures without increasing the number of parameters and with only a low cost on the throughput. The method code is based on PyTorch and is available at https://anonymous.4open.science/r/DCLS-Audio/. |
Ismail Khalfaoui Hassani · Timothée Masquelier · Thomas Pellegrini 🔗 |
-
|
Creative Text-to-Audio Generation via Synthesizer Programming
(
Poster
)
>
Sound designers have long harnessed the power of abstraction to distill and highlight the semantic essence of real-world auditory phenomena, akin to how simple sketches can vividly convey visual concepts. However, current neural audio synthesis methods lean heavily towards capturing acoustic realism. We introduce an open-source novel method centered on meaningful abstraction. Our approach takes a text prompt and iteratively refines the parameters of a virtual modular synthesizer to produce sounds with high semantic alignment, as predicted by a pretrained audio-language model. Our results underscore the distinctiveness of our method compared with both real recordings and state-of-the-art generative models. |
Nikhil Singh · Manuel Cherep · Jessica Shand 🔗 |
-
|
Jointly Recognizing Speech and Singing Voices Based on Multi-Task Audio Source Separation
(
Poster
)
>
In short videos and live livestreams, speech, singing voice, and background music often overlap and obscure each other. This complexity creates difficulties in structuring and recognizing the audio content, which may impair subsequent ASR and music understanding applications. This paper proposes a multi-task audio source separation-based ASR model called JRSV, which Jointly Recognizes Speech and singing Voices. Specifically, the separation module separates the mixed audio into distinct speech and singing voice tracks while removing background music. The CTC/attention hybrid recognition module recognizes both tracks. Online distillation is proposed to improve the robustness of recognition further. A benchmark dataset is constructed and released to evaluate the proposed methods. Experimental results demonstrate that JRSV can significantly improve recognition accuracy on each track of the mixed audio. |
Ye Bai · Chenxing Li · Xiaorui Wang · Yuanyuan Zhao · Hao Li 🔗 |
-
|
Leveraging Content-based Features from Multiple Acoustic Models for Singing Voice Conversion
(
Poster
)
>
Singing voice conversion (SVC) is a technique to enable an arbitrary singer to sing an arbitrary song. To achieve that, it is important to obtain speaker-agnostic representations from source audio, which is a challenging task. A common solution is to extract content-based features (e.g., PPGs) from a pretrained acoustic model. However, the choices for acoustic models are vast and varied. It is yet to be explored what characteristics of content features from different acoustic models are, and whether integrating multiple content features can help each other. This study investigates three distinct content features, sourcing from WeNet, Whisper, and ContentVec, respectively. We explore their complementary roles in intelligibility, prosody, and conversion similarity for SVC. By integrating the multiple content features with a diffusion-based SVC model, our SVC system achieves superior conversion performance on both objective and subjective evaluation in comparison to a single source of content features. |
Xueyao Zhang · Yicheng Gu · Haopeng Chen · Zihao Fang · Lexiao Zou · Liumeng Xue · Zhizheng Wu 🔗 |
-
|
Diffusion Models as Masked Audio-Video Learners
(
Poster
)
>
Over the past several years, the synchronization between audio and visual signals has been leveraged to learn richer audio-visual representations. Aided by the large availability of unlabeled videos, many unsupervised training frameworks have demonstrated impressive results in various downstream audio and video tasks. Recently, Masked Audio-Video Learners (MAViL) has emerged as a state-of-the-art audio-video pre-training framework. MAViL couples contrastive learning with masked autoencoding to jointly reconstruct audio spectrograms and video frames by fusing information from both modalities. In this paper, we study the potential synergy between diffusion models and MAViL, seeking to derive mutual benefits from these two frameworks. The incorporation of diffusion into MAViL, combined with various training efficiency methodologies that include the utilization of a masking ratio curriculum and adaptive batch sizing, results in a notable 32% reduction in pre-training Floating-Point Operations (FLOPS) and an 18% decrease in pre-training wall clock time. Crucially, this enhanced efficiency does not compromise the model's performance in downstream audio-classification tasks when compared to MAViL's performance. |
Elvis Nunez · Yanzi Jin · Mohammad Rastegari · Sachin Mehta · Maxwell Horton 🔗 |
-
|
InstrumentGen: Generating Sample-Based Musical Instruments From Text
(
Poster
)
>
link
We introduce the text-to-instrument task, which aims at generating sample-based musical instruments based on textual prompts. We propose InstrumentGen, a model that extends a text-prompted generative audio framework to condition on instrument family, source type, pitch (across an 88-key spectrum), velocity, and a joint text/audio embedding. Furthermore, we present a differentiable loss function to evaluate the intra-instrument timbral consistency of sample-based instruments. Our results establish a foundational text-to-instrument baseline, extending research in the domain of automatic sample-based instrument generation. |
Shahan Nercessian · Johannes Imort 🔗 |
-
|
Multi-Resolution Audio-Visual Feature Fusion for Temporal Action Localization
(
Poster
)
>
Temporal Action Localization (TAL) aims to identify actions' start, end, and class labels in untrimmed videos. While recent advancements using transformer networks and Temporal Pyramid Networks (TPN) have enhanced visual feature recognition in TAL tasks, there's an under-explored area of integrating multi-resolution audio features into such frameworks. This paper introduces Multi-Resolution Audio-Visual Feature Fusion (MRAV-FF), an innovative method to merge audio-visual data across different temporal resolutions. Central to our approach is a hierarchical gated cross-attention mechanism, which discerningly weighs the importance of audio information at diverse temporal scales. Such a technique not only refines the precision of regression boundaries but also bolsters classification accuracy. Importantly, MRAV-FF is versatile, making it compatible with existing TPN TAL architectures and offering a significant enhancement in performance when audio data is available. |
Edward Fish · Jon Weinbren · Andrew Gilbert 🔗 |
-
|
Composing and Validating Large-Scale Datasets for Training Open Foundation Models for Audio
(
Poster
)
>
Obtaining strong reproducible foundation language-audio models require open datasets of sufficient scale and quality. To pre-train contrastive language-audio model we compose large-scale sound effects dataset with detailed text descriptions for each sample. Generating music, as a special type of audio, presents further challenges due to limited availability of music-text pairs with expressive enough captions. We show here how we combine various composed datasets to pre-train a large-scale audio-language contrastive model (CLAP). Then we train, on music samples we collected, a state-of-the-art text-to-music model, MusicLDM, that adapts AudioLDM, which is based on Stable Diffusion architecture, to the music domain, by utilizing pre-trained CLAP model and the Hifi-GAN vocoder, as components of MusicLDM. The modelling work validates thus composed text-audio and text-music datasets as strong basis for further studies on language-rooted foundation models for audio at larger scales. |
Marianna Nezhurina · Ke Chen · Yusong Wu · Tianyu Zhang · Haohe Liu · Yuchen Hui · Taylor Berg-Kirkpatrick · Shlomo Dubnov · Jenia Jitsev 🔗 |
-
|
Unsupervised Musical Object Discovery from Audio
(
Poster
)
>
Current object-centric learning models such as the popular SlotAttention architecture allow for unsupervised visual scene decomposition. Our novel MusicSlots method adapts SlotAttention to the audio domain, to achieve unsupervised music decomposition. Since concepts of opacity and occlusion in vision have no auditory analogues, the softmax normalization of alpha masks in the decoders of visual object-centric models is not well-suited for decomposing audio objects. MusicSlots overcomes this problem. We introduce a spectrogram-based multi-object music dataset tailored to evaluate object-centric learning on western tonal music. MusicSlots achieves good performance on unsupervised note discovery and outperforms several established baselines on supervised note property prediction tasks. |
Joonsu Gha · Vincent Herrmann · Benjamin F. Grewe · Jürgen Schmidhuber · Anand Gopalakrishnan 🔗 |
-
|
Data is Overrated: Perceptual Metrics Can Lead Learning in the Absence of Training Data
(
Poster
)
>
link
Perceptual metrics are traditionally used to evaluate the quality of natural signals, such as images and audio. They are designed to mimic the perceptual behaviour of human observers and usually reflect structures found in natural signals. This motivates their use as loss functions for training generative models such that models will learn to capture the structure held in the metric. We take this idea to the extreme in the audio domain by training a compressive autoencoder to reconstruct uniform noise, in lieu of natural data. We show that training with perceptual losses improves the reconstruction of spectrograms and re-synthesized audio at test time over models trained with a standard Euclidean loss. This demonstrates better generalisation to unseen natural signals when using perceptual metrics. |
Tashi Namgyal · Alexander Hepburn · Raul Santos-Rodriguez · Valero Laparra · Jesús Malo 🔗 |
-
|
Self-Supervised Speech Enhancement using Multi-Modal Data
(
Poster
)
>
We consider the problem of speech enhancement in earphones. While microphones are classical speech sensors, motion sensors embedded in modern earphones also pick up faint components of the user’s speech. While this faint motion data has generally been ignored, we show that they can serve as a pathway for selfsupervised speech enhancement. Our proposed model is an iterative framework in which the motion data offers a hint to the microphone (in the form of an estimated posterior); the microphone SNR improves from the hint, which then helps the motion data to refine it’s next hint. Results show that this alternating self-supervision converges even in the presence of strong ambient noise, and the performance is comparable to supervised Denoisers. When small amount of training data is available, our model outperforms the same Denoisers. |
Yu-Lin Wei · Rajalaxmi Rajagopalan · Bashima Islam · Romit Roy Choudhury 🔗 |
-
|
Improved sound quality human-inspired DNN-based audio applications
(
Poster
)
>
The human auditory system evolved into a structure that provides sharp frequency tuning while transforming sound into a neural code that is optimized for speech understanding in challenging acoustic environments. Employing hallmark features of human hearing in audio applications might thus leverage these systems beyond what is currently possible with purely data-driven approaches. A key requirement for such bio-inspired audio applications is a fully differentiable closed-loop system that includes a biophysically-realistic model of (hearing-impaired) auditory processing. However, existing state-of-the-art models introduce tonal artifacts within their processing that end up as detrimental audible artifacts in the resulting audio application. We propose a solution that improves the architecture of CNN-based auditory processing block to avoid the creation of spurious distortions, while we optimize computations to ensure that the audio applications have real-time capabilities (latency <10ms). We provide a proof-of-principle example for the case of closed-loop, CNN-based hearing-aid algorithms, and conclude that CNN-based auditory models embedded in closed-loop training systems hold great promise for the next generation of bio-inspired audio applications. |
Chuan Wen · Sarah Verhulst 🔗 |
-
|
Audio Personalization through Human-in-the-loop Optimization
(
Poster
)
>
We consider the problem of personalizing audio to maximize user experience. Briefly, we aim to find a filter $h^*$, which applied to any music or speech, will maximize the user's satisfaction. This is a black-box optimization problem since the user's satisfaction function is unknown. The key idea is to play audio samples to the user, each shaped by a different filter $h_i$, and query the user for their satisfaction scores $f(h_i)$. A family of ``surrogate'' functions is then designed to fit these scores and the optimization method gradually refines these functions to arrive at the filter $\hat{h}^*$ that maximizes satisfaction. In this paper, we observe that a second type of querying is possible where users can tell us the individual elements $h^*[j]$ of the optimal filter $h^*$. Given a budget of $B$ queries, where a query can be of either type, our goal is to find the filter that will maximize this user's satisfaction. Our proposal builds on Sparse Gaussian Process Regression (GPR) and shows how a hybrid approach can outperform any one type of querying. Our results are validated through simulations and real-world experiments, where volunteers gave feedback on music/speech audio and were able to achieve high satisfaction levels. We believe this idea of hybrid querying opens new problems in black-box optimization, and solutions can benefit other applications beyond audio personalization.
|
Rajalaxmi Rajagopalan · Yu-Lin Wei · Romit Roy Choudhury 🔗 |
-
|
Synthia's Melody: A Benchmark Framework for Unsupervised \\Domain Adaptation in Audio
(
Poster
)
>
Despite significant advancements in deep learning for vision and natural language, unsupervised domain adaptation in audio remains relatively unexplored. We, in part, attribute this to the lack of an appropriate benchmark dataset. To address this gap, we present Synthia’s melody, a novel audio data generation framework capable of simulating an infinite variety of 4-second melodies with user-specified confounding structures characterised by musical keys, timbre, and loudness. Unlike existing datasets collected under observational settings, Synthia’s melody is free of unobserved biases, ensuring the reproducibility and comparability of experiments. To showcase its utility, we generate two types of distribution shifts—domain shift and sample selection bias—and evaluate the performance of acoustic deep learning models under these shifts. Our evaluations reveal that Synthia’s melody provides a robust testbed for examining the susceptibility of these models to varying levels of distribution shift. |
Harry Coppock · Chia-Hsin Lin 🔗 |
-
|
Zero-shot audio captioning with audio-language model guidance and audio context keywords
(
Poster
)
>
Zero-shot audio captioning aims at automatically generating descriptive textual captions for audio content without prior training for this task. Different from speech recognition which translates audio content that contains spoken language into text, audio captioning is commonly concerned with ambient sounds, or those produced by a human performing an action. Inspired by zero-shot image captioning methods, we propose a novel approach for understanding and summarising such general audio signals in a text caption. In particular, our framework exploits a pre-trained large language model (LLM) for generating the text, guided by a pre-trained audio-language model that steers the text generation to produce captions that describe the audio content. Additionally, we use audio context keywords that prompt the language model to generate text that is broadly relevant to sounds. Our proposed framework achieves state-of-the-art results in zero-shot audio captioning on the AudioCaps and Clotho datasets. Code will be released upon acceptance. |
Leonard Salewski · Stefan Fauth · A. Sophia Koepke · Zeynep Akata 🔗 |
-
|
AttentionStitch: How Attention Solves the Speech Editing Problem
(
Poster
)
>
The generation of natural and high-quality speech from text is a challenging problem in the field of natural language processing. In addition to speech generation, speech editing is also a crucial task, which requires the seamless and unnoticeable integration of edited speech into synthesized speech. We propose a novel approach to speech editing by leveraging a pre-trained text-to-speech (TTS) model, such as FastSpeech 2, and incorporating a double attention block network on top of it to automatically merge the synthesized mel-spectrogram with the mel-spectrogram of the edited text. We refer to this model as AttentionStitch, as it harnesses attention to stitch audio samples together. We evaluate the proposed AttentionStitch model against state-of-the-art baselines on both single and multi-speaker datasets, namely LJSpeech and VCTK. We demonstrate its superior performance through an objective and a subjective evaluation test involving 15 human participants. AttentionStitch is capable of producing high-quality speech, even for words not seen during training, while operating automatically without the need for human intervention. Moreover, AttentionStitch is fast during both training and inference and is able to generate human-sounding edited speech. |
Antonios Alexos · Pierre Baldi 🔗 |
-
|
MusT3: Unified Multi-Task Model for Fine-Grained Music Understanding
(
Poster
)
>
Recent advances in sequence-to-sequence modelling enabled new powerful multi-task models in text, vision, and speech domains. This work attempts to leverage these advances for music. We propose MusT3: Music-To-Tags Transformer, a novel model for fine-grained music understanding. First, we design the unified music-to-tags form, which enables us to cast any music understanding task as sequence prediction problem. Second, we utilize Transformer-based model to predict that sequence given music representation. Third, we leverage multi-task learning framework to train a single model for many tasks. We validate our approach on four tasks: beat tracking, chord recognition, key detection, and vocal melody extraction. Our model performs significantly better than the current state-of-the-art models on two of these tasks, while staying competitive on the remaining two. Finally, in controlled experiment, we demonstrate that our model can reuse knowledge between tasks, leading to better performance on low-resource tasks with limited training data. |
Martin Kukla · Minz Won · Yun-Ning Hung · Duc Le 🔗 |
-
|
Benchmarks and deep learning models for localizing rodent vocalizations in social interactions
(
Poster
)
>
Social animals congregate in groups and vocalize to communicate. To study the dynamics of vocal communication and their neural basis, ethologists and neuroscientists have developed a multitude of approaches to attribute vocal calls to individual animals within an interacting social group. Invasive surgical procedures, such as affixing custom-built miniature sensors to each animal, are often needed to obtain precise measurements of which individual is vocalizing. In addition to being labor intensive and species specific, these surgeries are often not tractable in very small or young animals and may alter an animal’s natural behavioral repertoire. Thus, there is considerable interest in developing non-invasive sound source localization and vocal call attribution methods that work off-the-shelf in typical laboratory settings. To advance these aims in the domain of rodent neuroscience, we acquired synchronized video and multi-channel audio recordings with >300,000 annotated sound sources in small reverberant environments, and publicly release them as benchmarks. We then trained deep neural networks to localize and attribute vocal calls. This approach outperformed current protocols in the field, achieving~5 mm accuracy on speaker-emitted sounds. Further, deep network ensembles produced well-calibrated estimates of uncertainty for each prediction. However, network performance was not robust to distributional shifts in the data, highlighting limitations and open challenges for future work. |
Ralph Peterson · Aramis Tanelus · Aman Choudhri · Violet Ivan · Aaditya Prasad · David Schneider · Dan Sanes · Alex Williams 🔗 |
-
|
Towards Generalizable SER: Soft Labeling and Data Augmentation for Modeling Temporal Emotion Shifts in Large-Scale Multilingual Speech
(
Poster
)
>
Recognizing emotions in spoken communication is crucial for advanced human-machine interaction. Current emotion detection methodologies often display biases when applied cross-corpus. To address this, our study amalgamates 16 diverse datasets, resulting in 375 hours of data across languages like English, Chinese, and Japanese. We propose a soft labeling system capturing gradational emotional intensities. Using the Whisper encoder and a data augmentation inspired by contrastive learning, our method emphasizes the temporal dynamics of emotions. Our validation on 4 multilingual datasets demonstrates notable zero-shot generalization. We further fine-tune on Hume-Prosody and publish initial promising results. |
Mohamed Osman · Tamer Nadeem · Ghada khoriba 🔗 |
-
|
The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation
(
Poster
)
>
We introduce the Song Describer dataset (SDD), a new crowdsourced corpus of high-quality audio-caption pairs, designed for the evaluation of music-and-language models. The dataset consists of 1.1k natural language descriptions of 706 music recordings, all publicly accessible and released under Creative Common licenses. To showcase the use of our dataset, we benchmark popular models on three key music-and-language tasks (music captioning, text-to-music generation and music-language retrieval). Our experiments highlight the importance of cross-dataset evaluation and offer insights into how researchers can use SDD to gain a broader understanding of model performance. |
Ilaria Manco · Benno Weck · Seungheon Doh · Yixiao Zhang · Dmitry Bogdanov · Yusong Wu · Ke Chen · Philip Tovstogan · Emmanouil Benetos · Elio Quinton · George Fazekas · Juhan Nam · Minz Won
|
-
|
ScripTONES: Sentiment-Conditioned Music Generation for Movie Scripts
(
Poster
)
>
Film scores are considered an essential part of the film cinematic experience, but the process of film score generation is often expensive and infeasible for small-scale creators. Automating the process of film score composition would provide useful starting points for music in small projects. In this paper, we propose a two-stage pipeline for generating music from a movie script. The first phase is the Sentiment Analysis phase where the sentiment of a scene from the film script is encoded into the valence-arousal continuous space. The second phase is the Conditional Music Generation phase which takes as input the valence-arousal vector and conditionally generates piano MIDI music to match the sentiment. We study the efficacy of various music generation architectures by performing a qualitative user survey and propose methods to improve sentiment-conditioning in VAE architectures. |
Vishruth Veerendranath · Vibha Masti · Utkarsh Gupta · Hrishit Chaudhuri · Gowri Srinivasa 🔗 |
-
|
Self-Supervised Music Source Separation Using Vector-Quantized Source Category Estimates
(
Poster
)
>
Music source separation is focused on extracting distinct sonic elements from composite tracks. Historically, many methods have been grounded in supervised learning, necessitating labeled data, which is occasionally constrained in its diversity. More recent methods have delved into N-shot techniques that utilize one or more audio samples to aid in the separation. However, a challenge with some of these methods is the necessity for an audio query during inference, making them less suited for genres with varied timbres and effects. This paper offers a proof-of-concept for a self-supervised music source separation system that eliminates the need for audio queries at inference time. In the training phase, while it adopts a query-based approach, we introduce a modification by substituting the continuous embedding of query audios with a Vector Quantized Variational Autoencoder (VQ-VAE). Trained end-to-end with up to N classes as determined by the VQ-VAE's codebook size, the model seeks to effectively categorize instrument classes. During inference, the input is partitioned into N sources, with some potentially left unutilized based on the mix's instrument makeup. This methodology suggests an alternative avenue for considering source separation across diverse music genres. We provide examples and additional results online. |
Stefan Lattner · Marco Pasini 🔗 |
-
|
Deep Generative Models of Music Expectation
(
Poster
)
>
A prominent theory of affective response to music revolves around the concepts of surprisal and expectation. In prior work, this idea has been operationalized in the form of probabilistic models of music which allow for precise computation of song (or note-by-note) probabilities, conditioned on a ‘training set’ of prior musical or cultural experiences. To date, however, these models have been limited to compute exact probabilities through hand-crafted features or restricted to linear models which are likely not sufficient to represent the complex conditional distributions present in music. In this work, we propose to use modern deep probabilistic generative models in the form of a Diffusion Model to compute an approximate likelihood of a musical input sequence. Unlike prior work, such a generative model parameterized by deep neural networks is able to learn complex non-linear features directly from a training set itself. In doing so, we expect to find that such models are able to more accurately represent the ‘surprisal’ of music for human listeners. From the literature, it is known that there is an inverted U-shaped relationship between surprisal and the amount human subjects ‘like’ a given song. In this work we show that pre-trained diffusion models indeed yield musical surprisal values which exhibit a negative quadratic relationship with measured subject ‘liking’ ratings, and that the quality of this relationship is competitive with state of the art methods such as IDyOM. We therefore present this model a preliminary step in developing modern deep generative models of music expectation and subjective likability. |
Ninon Lizé Masclef · Andy Keller 🔗 |
-
|
mir_ref: A Representation Evaluation Framework for Music Information Retrieval Tasks
(
Poster
)
>
link
Music Information Retrieval (MIR) research is increasingly leveraging representation learning to obtain more compact, powerful music audio representations for various downstream MIR tasks. However, current representation evaluation methods are fragmented due to discrepancies in audio and label preprocessing, downstream model and metric implementations, data availability, and computational resources, often leading to inconsistent and limited results. In this work, we introduce mir_ref, an MIR Representation Evaluation Framework focused on seamless, transparent, local-first experiment orchestration to support representation development. It features implementations of a variety of components such as MIR datasets, tasks, embedding models, and tools for result analysis and visualization, while facilitating the implementation of custom components. To demonstrate its utility, we use it to conduct an extensive evaluation of several embedding models across various tasks and datasets, including evaluating their robustness to various audio perturbations and the ease of extracting relevant information from them. |
Christos Plachouras · Dmitry Bogdanov · Pablo Alonso-Jiménez 🔗 |