Improving Text-To-Audio Models with Synthetic Captions

๐ŸŽต We propose an audio captioning pipeline that uses an audio language model to synthesize accurate and diverse captions for audio at scale. We leverage this pipeline to produce a dataset of synthetic captions for AudioSet, named AF-AudioSet. We then pre-train our Tango family of text-to-audio models on these synthetic captions. ๐ŸŽถ

This checkpoint was fine-tuned on the MusicCaps dataset

Read the paper

Code

Our code is released here: https://github.com/declare-lab/tango

Please follow the instructions in the repository for installation, usage and experiments.

Quickstart Guide

Download the model and generate music from a text prompt:

import IPython
import soundfile as sf
from tango import Tango

tango = Tango("declare-lab/tango-music-af-ft-mc")

prompt = "The song has a traditional jazzy feel to it. The piano chord progression is bouncy and light. The electric guitar has a chorus applied to it, and we hear various licks being played."
audio = tango.generate(prompt)
sf.write(f"{prompt}.wav", audio, samplerate=16000)
IPython.display.Audio(data=audio, rate=16000)

The model will be automatically downloaded and saved in cache. Subsequent runs will load the model directly from cache.

The generate function uses 100 steps by default to sample from the latent diffusion model. We recommend using 200 steps for generating better quality audios. This comes at the cost of increased run-time.

prompt = "The song has a traditional jazzy feel to it. The piano chord progression is bouncy and light. The electric guitar has a chorus applied to it, and we hear various licks being played."
audio = tango.generate(prompt, steps=200)
IPython.display.Audio(data=audio, rate=16000)

Use the generate_for_batch function to generate multiple audio samples for a batch of text prompts:

prompts = [
    "This song is a fusion of alternative and folk genres, highlighting simple yet soulful melodies and minimalist instrumentals.",
    "The instrumental music features an ensemble that resembles the orchestra. The melody is being played by a brass section while strings provide harmonic accompaniment.",
    "This music is instrumental. The tempo is fast with a lively keyboard harmony, steady drumming, groovy bass lines and harmonica melodic. The song is fresh, groovy, sunny, happy; vivacious and spirited."
]
audios = tango.generate_for_batch(prompts, samples=2)

This will generate two samples for each of the three text prompts.

Citation

Please consider citing the following article if you found our work useful:

@article{kong2024improving,
  title={Improving Text-To-Audio Models with Synthetic Captions},
  author={Kong, Zhifeng and Lee, Sang-gil and Ghosal, Deepanway and Majumder, Navonil and Mehrish, Ambuj and Valle, Rafael and Poria, Soujanya and Catanzaro, Bryan},
  journal={arXiv preprint arXiv:2406.15487},
  year={2024}
}
Downloads last month
18
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Space using declare-lab/tango-music-af-ft-mc 1