Text-to-Speech
ONNX
English

Sounds very different than demo when running locally

#32
by chibop - opened

I generated the text from the link below with Sky.

https://huggingface.co./hexgrad/Kokoro-82M/raw/main/demo/af_sky.txt

However, the output is very different than what's inside the demo folder.

https://huggingface.co./hexgrad/Kokoro-82M/blob/main/demo/af_sky.wav

Here's my output.

It's much more grainy in mid frequency and has slight pingy noise in high frequency.

I'd appreciate any help!

@chibop It seems like something is wrong with your local install. First, run this cell on Google Colab, CPU is fine:

# 1️⃣ Install dependencies silently
!git lfs install
!git clone https://huggingface.co./hexgrad/Kokoro-82M
%cd Kokoro-82M
!apt-get -qq -y install espeak-ng > /dev/null 2>&1
!pip install -q phonemizer torch transformers scipy munch

# 2️⃣ Build the model and load the default voicepack
from models import build_model
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
MODEL = build_model('kokoro-v0_19.pth', device)
VOICE_NAME = [
    'af', # Default voice is a 50-50 mix of Bella & Sarah
    'af_bella', 'af_sarah', 'am_adam', 'am_michael',
    'bf_emma', 'bf_isabella', 'bm_george', 'bm_lewis',
    'af_nicole', 'af_sky',
][-1]
VOICEPACK = torch.load(f'voices/{VOICE_NAME}.pt', weights_only=True).to(device)
print(f'Loaded voice: {VOICE_NAME}')

# 3️⃣ Call generate, which returns 24khz audio and the phonemes used
from kokoro import generate
text = """
Last September, I received an offer from Sam Altman, who wanted to hire me to voice the current ChatGPT 4 system. He told me that he felt that by my voicing the system, I could bridge the gap between tech companies and creatives and help consumers to feel comfortable with the seismic shift concerning humans and AI. He said he felt that my voice would be comforting to people.

After much consideration and for personal reasons, I declined the offer. Nine months later, my friends, family and the general public all noted how much the newest system named Sky sounded like me.
"""
audio, out_ps = generate(MODEL, text, VOICEPACK, lang=VOICE_NAME[0])
# Language is determined by the first letter of the VOICE_NAME:
# 🇺🇸 'a' => American English => en-us
# 🇬🇧 'b' => British English => en-gb

# 4️⃣ Display the 24khz audio and print the output phonemes
from IPython.display import display, Audio
display(Audio(data=audio, rate=24000, autoplay=True))
print(out_ps)

Edit: This cuts off at "Nine mon-" which is expected due to the context length. To generate the rest, you would chunk & for loop it, ideally chunk on natural stops not in the middle of a word.

If & when you get the output audio, now you need to find the difference between Colab and your local install. What does pip show torch transformers give you?

Sign up or log in to comment