Errors in Tokenizer „convert_id_to_token“
#3
by
michaelfeil
- opened
Dear Authors,
We are integrating CodeGen2.5 into CTranslate2, a open source inference engine.
I previously wrote the code to do so for CodeGen1 and 2.
CTranslate2 encodes the vocabulary to ids. It seems like there is some tokens and their byte/utf-8/.. encoding.
See:
https://github.com/OpenNMT/CTranslate2/issues/1334
To you have any guidance or intuition, how this bug may be resolved?
Not sure if this will help, but I was running into something similar. I found that tokens 94-187 aren't utf-8 characters, and after experimenting a bit it seemed like they wanted to be decoded using 'latin-1' encoding.