hexgrad/Kokoro-82M · How to run the model in smartphone apps?

Jan 3

Hi, how can I use this model in an iOS or Android app?

Owner Jan 10

I don't have mobile deployment recipes readily available right now—but also, FP16 inference is likely a soft blocker for (good) mobile deployment anyway.

A while back I confirmed that a Kokoro-type model runs FP32 inference on iPhone 13 Pro at ~0.6 RTF after model weights have been loaded (cold start ofc takes longer). RTF is time_to_generate / length_of_audio_generated, lower is faster. At 0.6 RTF, it takes 3 seconds to produce 5 seconds of audio. Although that's technically "faster than realtime", it's also a lot of latency, and probably nobody wants to wait that long just for 5 seconds of audio. Further acceleration is required for a nice UX, which can hopefully be achieved via more optimized inference code; otherwise, an architectural switch could be in order.

I haven't tested it on Android, but some people in the Discord server may have deployed via there with ONNX, iirc.

Closing to clean up the Community tab—recommend you also check out the Discord server for questions like this.

hexgrad changed discussion status to closed Jan 10