license: apache-2.0
PicoAudio: Enabling Precise Timing and Frequency Controllability of Audio Events in Text-to-audio Generation
Duplicate of github repo
Bullet contribution:
- A data simulation pipeline tailored specifically for controllable audio generation frameworks;
- Propose a timing-controllable audio generation framework, enabling precise control over the timing and frequency of sound event;
- Achieve any precise control related to timing by integrating of large language models.
Inference
You can see the demo on the website Huggingface Online Inference and Github Demo. Or you can use the "inference.py" script provided by website Huggingface Inference to generate. Huggingface Online Inference uses Gemini as a preprocessor, and we also provide a GPT preprocessing script consistent with the paper in "llm_preprocess.py"
Simulated Dataset
Simulated data can be downloaded from (1) HuggingfaceDataset or (2) BaiduNetDisk with the extraction code "pico".
The metadata is stored in "data/meta_data/{}.json", one instance is as follows:
{
"filepath": "data/multi_event_test/syn_1.wav",
"onoffCaption": "cat meowing at 0.5-2.0, 3.0-4.5 and whistling at 5.0-6.5 and explosion at 7.0-8.0, 8.5-9.5",
"frequencyCaption": "cat meowing two times and whistling one times and explosion two times"
}
where:
- "filepath" indicates the path to the audio file.
- "frequencyCaption" contains information about the occurrence frequency.
- "onoffCaption" contains on- & off-set information.
- For test file "test-frequency-control_onoffFromGpt_{}.json", the "onoffCaption" is derived from "frequencyCaption" transformed by GPT-4, which is used for evaluation in the frequency control task.
Training
Download data into the "data" folder. The training and inference code can be found in the "picoaudio" folder.
cd picoaudio
pip install -r requirements.txt
To start traning:
accelerate launch runner/controllable_train.py
Acknowledgement
Our code referred to the AudioLDM and Tango. We appreciate their open-sourcing of their code.