What tokenizer.json did you use?

by arrivederci19 - opened Apr 30, 2023

Apr 30, 2023

I'm currently trying to train a dutch tortoise model using the mrq repo. However I'm getting gibberish. What tokenizer.json did you use? Did you modify the original, create a new one or used a phonetic language tokenizer?

Thanks a lot in advance!

Snowad

Owner May 2, 2023

Since almost all the French letters are in the English language I didn't find it necessary to use another tokenizer

arrivederci19

May 7, 2023

Yes but the tokenizer also groups together sounds that are often used together. I have had succes using my own generated tokenizer for dutch now. I believe it would speed up your training, if you ever decide to come back to it.

Snowad

Owner Aug 30, 2023

•

edited Aug 30, 2023

I thought I'd make one last good model anyway. , I tried with a french tokenizer and got incomprehensible results with my new tokenizer, I don't know why. I'm going to do another training but using mrq's webui maybe it'll help.

VitekDev

Feb 12

This comment has been hidden

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment