Poster+Demo Session
in
Workshop: Audio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation

What do MLLMs hear? Examining the interaction between LLM and audio encoder components in Multimodal Large Language Models

Enis Çoban ⋅ Michael Mandel ⋅ Johanna Devaney

2024 Poster+Demo Session
in
Workshop: Audio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation

Project Page [ OpenReview]

Abstract

Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities, notably in connecting ideas and adhering to logical rules to solve problems. These models have evolved to accommodate various data modalities, including sound and images, known as multimodal LLMs (MLLMs), which are capable of generating descriptions of images or sound recordings. We evaluate how MLLMs separate representation of auditory and textual information may sever the reasoning pathway between the audio encoder and the LLM component. Through a captioning-based classification experiment with similar and hierarchical textual relationships, we demonstrate that audio MLLMs cannot fully leverage their LLMs' text-based reasoning when generating audio captions.

Video

Chat is not available.