What tokenizer.json did you use?

#3
by arrivederci19 - opened

I'm currently trying to train a dutch tortoise model using the mrq repo. However I'm getting gibberish. What tokenizer.json did you use? Did you modify the original, create a new one or used a phonetic language tokenizer?

Thanks a lot in advance!

Since almost all the French letters are in the English language I didn't find it necessary to use another tokenizer

Yes but the tokenizer also groups together sounds that are often used together. I have had succes using my own generated tokenizer for dutch now. I believe it would speed up your training, if you ever decide to come back to it.

I thought I'd make one last good model anyway. , I tried with a french tokenizer and got incomprehensible results with my new tokenizer, I don't know why. I'm going to do another training but using mrq's webui maybe it'll help.

This comment has been hidden

Sign up or log in to comment