What tokenizer.json did you use?
I'm currently trying to train a dutch tortoise model using the mrq repo. However I'm getting gibberish. What tokenizer.json did you use? Did you modify the original, create a new one or used a phonetic language tokenizer?
Thanks a lot in advance!
Since almost all the French letters are in the English language I didn't find it necessary to use another tokenizer
Yes but the tokenizer also groups together sounds that are often used together. I have had succes using my own generated tokenizer for dutch now. I believe it would speed up your training, if you ever decide to come back to it.
I thought I'd make one last good model anyway. , I tried with a french tokenizer and got incomprehensible results with my new tokenizer, I don't know why. I'm going to do another training but using mrq's webui maybe it'll help.