Skip to yearly menu bar Skip to main content

KeyNote Talk
Workshop: Second Workshop on Efficient Natural Language and Speech Processing (ENLSP-II)

Efficient Controllable Generative Models for Music and Performance Synthesis

Cheng-Zhi Anna Huang


How can we design generative models with structure that both improve the efficiency of models and controllability for users? In this talk, I'll give two examples to illustrate how we could achieve this goal by taking inspiration from the nonlinear and hierarchical structure that underlies the human process of creating music.

Generative models of music composition typically assume music is written in a single pass from beginning to end, constraining the user to also follow this unnatural chronological process. To enable a more nonlinear creative workflow, we introduced Coconet (Huang et al., 2017) an Orderless NADE (Uria et al., 2014) like generative model (similar to masked language and visual models) that models all permutations of orderings of breaking down the task of composition. This enables both the model to learn more efficiently from data sequences by traversing it from all directions, and users to put down notes in any order and have the model complete any partial score.

Neural audio synthesizers typically synthesize musical performance audio from MIDI end-to-end, resulting in a blackbox that offers few mechanisms for control. To enable detailed user control, we introduced MIDI-DDSP (Wu et al., 2022), a hierarchical model of musical performance synthesis, that breaks down audio synthesis into a three-level hierarchy of notes, performance, and synthesis, analogous to how a creative process involves composers, performers and instruments. Not only does this interpretable hierarchy allow users to intervene at each level or utilize trained priors (performance given notes, synthesis given performance) for creative assistance, it also allows models to leverage these inductive biases to learn more efficiently from data, making it possible to train high-fidelity performance synthesis models from only a few hours of recordings.

We hope these examples might encourage researchers to partner with creative practitioners to innovate in modeling, interaction, and human-ai co-creativity. We could see the goal as not only designing generative models that can model and generate creative artifacts well, but also working towards generative agents that we can coordinate and collaborate with in a creative setting.

Chat is not available.