Question on target-language quantization

#1
by robbiemu - opened

Hey @bartowski I see you quantized this model :) I was considering building a custom imatrix (for the target languages) and quantizing this with my https://github.com/robbiemu/llama-gguf-optimize tool. I dont know how long it would take me to handle such a large model though, I dont have the hardware. Wondering if you might be interested in collaborating?

@robbiemu hey missed this one at the time, yeah i'm interested ! always looking for ways to improve the quants :)

I'm not positive there's as much room for gain as one may hope, i did some mild testing a few months back that I want to do more of that showed that performance is relatively universal even with none of the target language in the imatrix dataset, but like I said improvements are improvements, and if the dataset size doesn't go up I'm definitely interested in refining it

I submitted a request about their data to SEA AI Lab ( @sail ) at their website. I'll also start after the new year on formulating alternatives if they can't share any data or don't get back to me.

@bartowski happy new years! I got some welcome and orientation from Sailor team, and have started to download data from their pretaining data. I mentioned you elsewhere with a link that shows raw commands I am using so far. I also have a log of the work that I am keeping but I haven't yet figured out where to post it. I was thinking it might be helpful to create an org for this project so we can work from the same repo?

I was hoping that maybe I could do the work on the 1B model, and we could discuss and make sure it's going the right way together, and then you could redo it in 20b ? I want to see how much the results change across the model sizes. That's something I don't think anyone has previously done in llama.cpp (that I can find anyway). And this lets me give you a recipe and guide above and beyond the usage guide: https://github.com/robbiemu/llama-gguf-optimize/blob/main/USAGE.md

my current log of work:
This will record the process of setting up an importance matrix dataset.

Setup

Get the model

First, we just get the model and ggufs.

Get the dataset

Sailor2's language tags (from the readme) are:

language: 
- en 
- zh 
- id 
- th 
- vi 
- ms 
- lo 
- my 
- jv 
- km 
- su 
- tl

They provide various datasets publicly (see https://huggingface.co./sailor2)

We need to select from the dataset. There are several different sources that are provided, both in pre and post-training. Post training will have structure and domains of compentency that we may not want to consider when preparing the importance matrix for language competencies.

We will sample the most promising sources to find the initial kl-div values (I think you could just use perplexity if you preferred, but let's stick to a single measure).

This one looks promising: sailor2/sea-commoncrawl-high-quality. Looking at the page on huggingface, I see that I want the text values from this dataset, which is what the Oscar Plugin does. But looking closer, there are no-per language splits. I can still get the data by language using the folders they provided, but a second issue appears when we do: not all samples are available. There are no folders for en, zh, nor ms languages, so we will need to sample another source for those. I will need to write a custom plugin for this dataset, which is probably going to be a common task but its really quite simple. We write the plugin and download the data.
It turns out that the only datasource listed in pre-training for English and Chinese is sailor2/sailor2-pretrain-data-stage1. It also contains standard Malay, the other missing language samples. This is pre-shuffled, unlabelled, data, so we will need to write a custom plugin that incorporates language detection. Luckily someone in the rust community has already written a high quality language detection library supporting our 3 missing languages -- shout out to Peter M. Stahl for sharing. This makes it easy to write, so easy in fact that LLMs can write it directly in just a couple of prompts to refine to our preferences. We write the plugin and download the data. Finally, we mix the two datasets so the language representation remains shuffled in chunks that match model context size.

Baseline comparison

On the blog, it says "Currently, Sailor2 supports a context length of up to 4K tokens" but in fact the model's context length is 32k. I discussed it with the model developers, and they simply did not adjust the max_position_embeddings to 4k, but did train at 4k, and there is meaningful context loss above 8k in RULER. They recommended changing the model's context length to 4k, and modified the config.json to match. So before we run the baseline we will go back and update the models to match (might as well).

Here is the overall stats for the baseline:

===== Overall KL-divergence statistics =====

[llama_gguf_optmize v0.6.0] 10:08:26 - INFO - Average : 0.023298
[llama_gguf_optmize v0.6.0] 10:08:26 - INFO - StdDev : 0.096553
[llama_gguf_optmize v0.6.0] 10:08:26 - INFO - Minimum : -0.000000
[llama_gguf_optmize v0.6.0] 10:08:26 - INFO - Maximum : 6.588896
[llama_gguf_optmize v0.6.0] 10:08:26 - INFO - KLD_99 : 0.204897
[llama_gguf_optmize v0.6.0] 10:08:26 - INFO - KLD_95 : 0.068418
[llama_gguf_optmize v0.6.0] 10:08:26 - INFO - KLD_90 : 0.043672
[llama_gguf_optmize v0.6.0] 10:08:26 - INFO - Median : 0.010238
[llama_gguf_optmize v0.6.0] 10:08:26 - INFO - KLD_10 : 0.000105
[llama_gguf_optmize v0.6.0] 10:08:26 - INFO - KLD_05 : 0.000026
[llama_gguf_optmize v0.6.0] 10:08:26 - INFO - KLD_01 : 0.000002

Lessons learned

There were no surprises (outside of just how well the model does before using the imatrix) with the above run. Before doing it with the 4k context size, I first ran the same bench at 32k. The early stopping mechanism was able to lock in before the minimum samples were reached, suggesting a sensitivity to context size. Eventually I could see going back to adjust the algorithm so that by default this scaling need not be tuned with the parameters provided (but for now, there are already such parameters available).
There are also some -0.0000 values, which effected the overall minimum values. These are earmarks of lack of precision. The metrics were run at 64 bit precision, but the avoidance of actually negative output (compared to earlier f32 scripts I could locate) was a design goal, so even if this is nearly 0, it is mildly disappointing.

Iterative Quantization

Just as a rule of thumb observed from another small model (2b), I think I will use increment sizes of 250 samples. We can see that we only needed 47 of the 2325 chunks that I made available in my initial download of 500 samples from the training data for the model. To maximize reuse, the plan I have currently is to build the imatrix from data excluding the 47 chunks used in the calibration dataset, starting with nearly 250 samples (less the 47 chunks), then an additional 250 chunks. Based on previous work, I'm hoping to see a nice plateau in the return on investment for data used in the imatrix at roughly 1000 chunks. First up, 203 chunks from the unused calibration data we already downloaded:

(next: select the data, quantize the model)

Happy new year to you as well!

I like this testing set up from you, looks great!

A heads up, i had heard that for llama.cpp's KL divergence, even if you run at 64 bit precision, there are parts internally that are stored at fp16, I discovered this when I tried to run KLD measurement against FP32 and compare it to BF16 for sanity, and found that they diverged... made me question everything until it was pointed out that the saved KL values are fp16, so the on the fly ones wouldn't line up :')

When you say "We can see that we only needed 47 of the 2325 chunks that I made available in my initial download of 500 samples from the training data for the model." what exactly are you referring to?

I'm definitely down to apply the learnings to larger models to attempt to validate the results!

I calculate kl divergence manually instead. I noticed that these numbers were showing significant rounding errors using the tools in the threads at the project, so I wrote my own. I saw no issues with my first model, but on Sailor2 1b I do see -0.000000 in a couple places for the minimum, indicating tiny rounding errors from fp64. I wasn't quite as slick as I had hoped, but at least I dont see any non-zero output in the significant range.

RE 47 chunks, if you look at the gist now you can see what I am referring to: I downloaded at first 500 samples that could be used as test data, then used my bench tool to detect the variability in that data to see how many chunks of it I need in the test data. I saw in discussions people were occasionally noting they only need a chunk or two but that seems wrong since every so often you see chunks with lots of divergence from the norm. So, kl-d-bench is written to try to detect when it has enough samples to predict the variability it is seeing, to some degree of confidence. The remaining samples, not used in this baseline and in later comparisons of kl-div values, can safely be used to quantize the model.

(so look at it this way, we set aside a test set to compare different sized datasets used to train (ie create the imatrix).) As we grow the imatrix, we can just keep additively combining the data, but it just feels like standard practice to set aside test data not used in training so we can compare the results in kl-diverenge).

I completed the baseline (link to graphs: https://imgur.com/a/TLGtmA4 ) -- it looks like at 1b we already hit (strongly) diminishing returns by 250 samples per language from their pre-training data that I used.

===== Overall Composite Metrics =====
File Composite Metric
0 kl_divergence.baseline.h5 0.170687
1 kl_divergence.im250.h5 0.149774
2 kl_divergence.im500.h5 0.149750

This is with 4k context size, after a discussion with the SEA AI Labs team they modified their config.json to reflect the training that they did, so that the context size is 4k. I followed their recommendation and limited by approach to 4k.

I'm almost tempted to use half as many and see if it is already diminishing. Or to go the other way, take 2500 samples, reduce to 250 by selecting the samples in the chunks in the top 10% highest kl-d values using the metric that I use.

The metric that I use, btw, is entirely custom (I saw different people suggesting different outlier thresholds like the kld95, for example, or cld-median): https://github.com/robbiemu/llama-gguf-optimize/blob/main/USAGE.md#metric-for-evaluation

Score=(Median1/3×(KLD99×1+KLD95×4+KLD90×5)2/3) \text{Score} = \left( \text{Median}^{1/3} \times \left( \text{KLD}_{99} \times 1 + \text{KLD}_{95} \times 4 + \text{KLD}_{90} \times 5 \right)^{2/3} \right)

There's no reason to use it over any other I guess, its just what made sense to me based on what I read in the discussions.

btw, based on what I saw at 32k context size, before conforming to their recommendation, I think it takes more samples dependent on context size.

Sign up or log in to comment