|
--- |
|
tags: |
|
- audio |
|
- text-to-speech |
|
- onnx |
|
base_model: |
|
- hexgrad/Kokoro-82M |
|
inference: false |
|
language: en |
|
license: apache-2.0 |
|
library_name: txtai |
|
--- |
|
|
|
# Kokoro fp16 Model for ONNX |
|
|
|
[Kokoro 82M](https://huggingface.co./hexgrad/Kokoro-82M) export to ONNX as fp16. This model is from [this GitHub repo](https://github.com/taylorchu/kokoro-onnx/releases/). The voices file is from [this repository](https://github.com/thewh1teagle/kokoro-onnx/releases/tag/model-files). |
|
|
|
## Usage with txtai |
|
|
|
[txtai](https://github.com/neuml/txtai) has a built in Text to Speech (TTS) pipeline that makes using this model easy. |
|
|
|
_Note: This requires txtai >= 8.3.0. Install from GitHub until that release._ |
|
|
|
```python |
|
import soundfile as sf |
|
|
|
from txtai.pipeline import TextToSpeech |
|
|
|
# Build pipeline |
|
tts = TextToSpeech("NeuML/kokoro-fp16-onnx") |
|
|
|
# Generate speech |
|
speech, rate = tts("Say something here") |
|
|
|
# Write to file |
|
sf.write("out.wav", speech, rate) |
|
``` |
|
|
|
## Usage with ONNX |
|
|
|
This model can also be run directly with ONNX provided the input text is tokenized. Tokenization can be done with [ttstokenizer](https://github.com/neuml/ttstokenizer). `ttstokenizer` is a permissively licensed library with no external dependencies (such as espeak). |
|
|
|
Note that the txtai pipeline has additional functionality such as batching large inputs together that would need to be duplicated with this method. |
|
|
|
```python |
|
import json |
|
import numpy as np |
|
import onnxruntime |
|
import soundfile as sf |
|
|
|
from ttstokenizer import IPATokenizer |
|
|
|
# This example assumes the files have been downloaded locally |
|
with open("kokoro-fp16-onnx/voices.json", "r", encoding="utf-8") as f: |
|
voices = json.load(f) |
|
|
|
# Create model |
|
model = onnxruntime.InferenceSession( |
|
"kokoro-fp16-onnx/model.onnx", |
|
providers=["CPUExecutionProvider"] |
|
) |
|
|
|
# Create tokenizer |
|
tokenizer = IPATokenizer() |
|
|
|
# Tokenize inputs |
|
inputs = tokenizer("Say something here") |
|
|
|
# Get speaker array |
|
speaker = np.array(self.voices["af"], dtype=np.float32) |
|
|
|
# Generate speech |
|
outputs = model.run(None, { |
|
"tokens": [[0, *inputs, 0]], |
|
"style": speaker[len(inputs)], |
|
"speed": np.ones(1, dtype=np.float32) * 1.0 |
|
}) |
|
|
|
# Write to file |
|
sf.write("out.wav", outputs[0], 24000) |
|
``` |
|
|