File size: 4,026 Bytes
88ab1a5 32ab10f 88ab1a5 32ab10f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 |
---
license: creativeml-openrail-m
---
---
license: cc-by-nc-nd-4.0
---
# AudioLDM 2
AudioLDM 2 is a latent text-to-audio diffusion model capable of generating realistic audio samples given any text input.
It is available in the 🧨 Diffusers library from v0.21.0 onwards.
# Model Details
AudioLDM 2 was proposed in the paper [AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining](https://arxiv.org/abs/2308.05734) by Haohe Liu et al.
AudioLDM takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional sound effects,
human speech and music.
# Checkpoint Details
This is the original, **base** version of the AudioLDM 2 model, also referred to as **audioldm2-full**.
There are three official AudioLDM 2 checkpoints. Two of these checkpoints are applicable to the general task of text-to-audio
generation. The third checkpoint is trained exclusively on text-to-music generation. All checkpoints share the same
model size for the text encoders and VAE. They differ in the size and depth of the UNet. See table below for details on
the three official checkpoints:
| Checkpoint | Task | UNet Model Size | Total Model Size | Training Data / h |
|-----------------------------------------------------------------|---------------|-----------------|------------------|-------------------|
| [audioldm2](https://huggingface.co./cvssp/audioldm2) | Text-to-audio | 350M | 1.1B | 1150k |
| [audioldm2-large](https://huggingface.co./cvssp/audioldm2-large) | Text-to-audio | 750M | 1.5B | 1150k |
| [audioldm2-music](https://huggingface.co./cvssp/audioldm2-music) | Text-to-music | 350M | 1.1B | 665k |
| [audioldm2-gigaspeech](https://huggingface.co./anhnct/audioldm2_gigaspeech) | Text-to-speech | 350M | 1.1B |10k |
| [audioldm2-ljspeech](https://huggingface.co./anhnct/audioldm2_ljspeech) | Text-to-speech | 350M | 1.1B | |
## Model Sources
- [**Original Repository**](https://github.com/haoheliu/audioldm2)
- [**🧨 Diffusers Pipeline**](https://huggingface.co./docs/diffusers/api/pipelines/audioldm2)
- [**Paper**](https://arxiv.org/abs/2308.05734)
- [**Demo**](https://huggingface.co./spaces/haoheliu/audioldm2-text2audio-text2music)
# Usage
First, install the required packages:
```
pip install --upgrade diffusers transformers accelerate
```
## Text-to-Speech
For text-to-speech generation, the [AudioLDM2Pipeline](https://huggingface.co./docs/diffusers/api/pipelines/audioldm2) can be
used to load pre-trained weights and generate text-conditional audio outputs:
```python
import scipy
import torch
from diffusers import AudioLDM2Pipeline
repo_id = "anhnct/audioldm2_ljspeech"
pipe = AudioLDM2Pipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
# define the prompts
prompt = "An female actor say with angry voice"
transcript = "wish you have a good day, i hope you never forget me"
negative_prompt = "low quality"
# set the seed for generator
generator = torch.Generator("cuda").manual_seed(1)
# run the generation
audio = pipe(
prompt,
negative_prompt=negative_prompt,
transcription=transcript_1,
num_inference_steps=200,
audio_length_in_s=8.0,
num_waveforms_per_prompt=1,
generator=generator,
max_new_tokens=512
).audios
# save the best audio sample (index 0) as a .wav file
scipy.io.wavfile.write("techno_2.wav", rate=16000, data=audio[0])
```
# Citation
**BibTeX:**
```
@article{liu2023audioldm2,
title={"AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining"},
author={Haohe Liu and Qiao Tian and Yi Yuan and Xubo Liu and Xinhao Mei and Qiuqiang Kong and Yuping Wang and Wenwu Wang and Yuxuan Wang and Mark D. Plumbley},
journal={arXiv preprint arXiv:2308.05734},
year={2023}
}
```
|