calibration dataset language

#1
by Erilaz - opened

Since that's an i1 quant for a merge of two dedicated Russian language fine-tunes, here's an obvious question: does it use the standard calibration dataset as all other models or it accounts for the language and Cyrillic script?

It uses the standard dataset as all other models, which contains some, but very little cyrillic. We don't know how this affects the quants, but it would be prudent to assume some cost to the russian language capabilities. You might want to go for the static quants. We'd be happy if somebody made some objective measurements w.r.t. this, btw., because AFAICS nobody knows how big the effect will be.

You can check which set is used (at least for quants in the last few months) as every quant contains the filename of the imatrix training data (the quant browser of huggingface should be able to show it). Our current standard set is called "imatrix-training-full-3".

Sign up or log in to comment