Chinese very bad

#12

by lucasjin - opened Jun 22, 2024

Discussion

lucasjin

Jun 22, 2024

haipingwu

Microsoft org Jun 22, 2024

hi, the released florence-2 models are english only.

lucasjin

Jun 22, 2024

Would consider add Chinese support? Since it's a "Foundation Vision Model"

quyettv

Jun 24, 2024

I don't think Chinese is the basis. You are wrong in thinking why all models must have Chinese, by default they are in English, that is the foundation. You should not expect Chinese, you should consider English please, not Chinese

lucasjin

Jun 24, 2024

You are wrong thinking about all models should must be English.

pcernuta

Jul 6, 2024

•

edited Jul 6, 2024

You are wrong, this is clear. All foundation vision models must be in Slovene. This way is fair for English and Chinese.

lucasjin

Jul 7, 2024

Yes, am training a multi langual Florence2 now, so far so good, but not include Slovene, sorry.

xray1111

Jul 18, 2024

I'm curiours about, since the vocabulary list doesn't include any CJK character, how can the model output so many chinese characters? though most of them are incorrect?

lucasjin

Jul 18, 2024

No, the text encoder can output CJK. Actually I have tuned a new Florence model, it now can do Chinese OCR pretty well now:

lucasjin

Jul 18, 2024

Althought this is currently only roughly tuned, when applied on more data, it could gets better.

xray1111

Jul 18, 2024

I don't see any CJK characters in the original vocab.json of Florence-2-large, so I guess you must extend the vocabulary before the chinese OCR finetuning task?

And I still don't understand why it can output chinese chars in your first post, did you have already extend the vocab before inference?

xray1111

Jul 18, 2024

No, the text encoder can output CJK. Actually I have tuned a new Florence model, it now can do Chinese OCR pretty well now:

And still, good work!

lucasjin

Jul 18, 2024

Oh, yes, you were right.

I rechcked the vocab, it doesn have CJK

Very strange....

xray1111

Jul 18, 2024

Oh, yes, you were right.

I rechcked the vocab, it doesn have CJK

Very strange....

Is it mean that you have finetuned florence-2 with chinese ocr training data, but without extending the vocab? And got a pretty decent result?

lucasjin

Jul 18, 2024

Yes, yes, but as you can see, the first with raw flr2, it also can prints Chinese.

I haven't tried but I think it might can encode a Chinese character to id, can decode it back

allenz24

Nov 4, 2024

Yes, yes, but as you can see, the first with raw flr2, it also can prints Chinese.

I haven't tried but I think it might can encode a Chinese character to id, can decode it back

Maybe we should try to explore the logic of this, I am also curious, and I am trying to understand why we can output cjk

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment