Skip to yearly menu bar Skip to main content


Musical Speech: A Transformer-based Composition Tool

Jason d'Eon · Sri Harsha Dumpala · Chandramouli Shama Sastry · Daniel Oore · Mengyu Yang · Sageev Oore


In this demo we propose a compositional tool that generates musical sequences based on prosody of speech recorded by the user. The tool allows any user–-regardless of musical training--to use their own speech to generate musical melodies, while hearing the direct connection between their recorded speech and resulting music. This is achieved with a pipeline combining speech-based signal processing [1,2], musical heuristics, and a set of transformer models [3,4] trained for new musical tasks. Importantly, the pipeline is designed to work with any kind of speech input and does not require a paired dataset for the training of the said transformer model.

Our approach consists of the following steps:

  1. Estimate the F0 values and loudness envelope of the speech signal.
  2. Convert this into a sequence of musical constraints derived from the speech signal.
  3. Apply one or more transformer models—each trained on different musical tasks or datasets—to this constraint sequence to produce musical sequences that follow or accompany the speech patterns in a variety of ways.

The demo is self-explanatory: the audience can interact with the system by either providing a live-recording using a web-based recording interface or by uploading a pre-recorded speech sample. The system then provides a visualization of the formant contours extracted from the provided speech sample, the set of note constraints obtained from the speech, and the sequence of musical notes as generated by the transformers. The audience can also listen to—and interactively mix the levels (volume) of—the input speech sample, initial note sequences, and the musical sequences as generated by the transformer models.

[1] Rabiner & Huang. Fundamentals of speech recognition. [2] Dumpala et al. Sine-wave speech as pre-processing for downstream tasks. Symp. FRSM 2020 [3] Vaswani et al. Attention is all you need. NeurIPS 2017 [4] Huang et al, Music Transformer ICLR 2018

Chat is not available.