Is there a way to post an additional output array of floats for Phoneme Timings?
I looked inside the models for Lessac and Ryan after doing some experiments in Unity, runs amazing, but we want to actually lock in a skinned mesh renderer to speak along with the phonemes' exact timing. We'd have to get an additional output from the ONNX models themselves - it doesn't need to change the main output tensor shape.
Found it hard to read the model after viewing in onnx-modifier, not too familiar with the conventions inside there but it looked like there were a couple of slice and concatenate spots that took multiple lines of throughput and combined or split them. I'm guessing one of those areas could be a buildup point for the array of ints describing the sample length of each synthesized phoneme's audio. If a second output tensor could be added that has a supplementary array of the phoneme timings and their IDs from the phonemizer output were already known, then you could make a model speak live off of it accurately, no matter how eccentric the training audio is or how blown out the parameters are on the model input.
Anyway just a thought, it doesn't seem too hard but I'm not an expert.
Link to discussion on Github: https://github.com/rhasspy/piper/discussions/425