Descript Audio Codec (DAC)

DAC is the state-of-the-art audio tokenizer with improvement upon the previous tokenizers like SoundStream and EnCodec.

This model card provides an easy-to-use API for a pretrained DAC [1] for 16khz audio whose backbone and pretrained weights are from its original reposotiry. With this API, you can encode and decode by a single line of code either using CPU or GPU. Furhtermore, it supports chunk-based processing for memory-efficient processing, especially important for GPU processing.

Model variations

There are three types of model depending on an input audio sampling rate.

Model	Input audio sampling rate [khz]
`hance-ai/descript-audio-codec-44khz`	44.1khz
`hance-ai/descript-audio-codec-24khz`	24khz
`hance-ai/descript-audio-codec-16khz`	16khz

Dependency

See requirements.txt.

Usage

Load

from transformers import AutoModel

# device setting
device = 'cpu'  # or 'cuda:0'

# load
model = AutoModel.from_pretrained('hance-ai/descript-audio-codec-16khz', trust_remote_code=True)
model.to(device)

Encode

audio_filename = 'path/example_audio.wav'
zq, s = model.encode(audio_filename)

zq is discrete embeddings with dimension of (1, num_RVQ_codebooks, token_length) and s is a token sequence with dimension of (1, num_RVQ_codebooks, token_length).

Decode

# decoding from `zq`
waveform = model.decode(zq=zq)  # (1, 1, audio_length); the output has a mono channel.

# decoding from `s`
waveform = model.decode(s=s)  # (1, 1, audio_length); the output has a mono channel.

Save a waveform as an audio file

model.waveform_to_audiofile(waveform, 'out.wav')

Save and load tokens

model.save_tensor(s, 'tokens.pt')
loaded_s = model.load_tensor('tokens.pt')

References

[1] Kumar, Rithesh, et al. "High-fidelity audio compression with improved rvqgan." Advances in Neural Information Processing Systems 36 (2024).