"Experimental" = ZeroWw method.
@bartowski
I didn't invent anything, but a little credit wouldn't have hurted since nobody tried that before me.
The method is described here by the way: https://huggingface.co./RobertSinclair
I don't think, using parameters of a program is high "intellectual creativity". I used unquantisized output tensors long before in tests. (--leave-output-tensor). I think, some people could have same ideas parallel, because the description of "Will leave output.weight un(re)quantized. Increases model size but may also increase quality" makes it interesting for many people, that play with the quantize program.
Are you sure, he saw this on your site first and does not tried it himself?
But i thank you both, to make the results publicly available.
He's probably the person who first brought it up to me, but also tbh if anyone had ever given me positive feedback about them I probably would have bumped it up from experimental to permanent and credited the idea, but at this point I'm closer to scrapping it. Its been over a month and I haven't seen a single person say it works better or a single test showing an improvement in performance, which is shocking since he's been pushing it everywhere and I've been begging people for feedback
I'm even starting to wonder if it's worse than quantizing to Q8, because at least quantizing will attempt to maintain the original range, where fp16 actually truncates a TON of info around 0
I would probably re-visit it when llama.cpp supports BF16 on CUDA...
In fact, just today I received negative feedback, which I think may stem from trying to shove FP32/BF16 range into fp16: https://huggingface.co./bartowski/gemma-2-9b-it-GGUF/discussions/4#6683e8a3cdb6d05d2d9a7d87
I mostly download only one version. The last days Q8-L only. Does it make sense, to test both Q8 versions? Because i saw artcicles, there is nearly no difference between original, Q8 and Q6. Perhaps on lower quants the effect is bigger with unquantisized output tensors. In my personal test some weeks ago (i think with Llama-3-8b) i saw, one model wrote more abbreviations and ÷ instread of / than the other, but i have no idea, what is better (same seed and other settings, only output tensors different).
I use CPU only.
ideally someone would be testing both so I can get some positive or negative feedback that shows if this method is worth continuing, it's a lot of extra files. I agree that since there's almost no difference between BF16, Q8, and Q6 in many tests, it seems unlikely that the larger embeddings/outputs make a real difference besides slowing down my process and also may generate slower as well?
if you don't have the setup or cycles to do a side by side, no worries just keep carrying on, but that's what I've been trying to get for awhile now to indicate one way or the other, and especially the last few days I've been concerned that it's actually degrading the performance
Why not run MMLU pro against them and call it a day? We have that option now.
I download the missing Q8 and booth Q6 quants and will test them perhaps tomorrow, but with own prompts only (same seed and other settings). For big benchmarks i don't have the ressources. If there is no difference, i will test the Q5 quants too.
Thanks @supportend !!
And that's a good point @jackboot i'll see if I can set something up today between quants
I MEAN, I'VE BEEN TRYING
even @ZeroWw will not give me any evidence that they're good, and he's been pushing them all over the place.. I've finally managed to get some people looking at them, and i'm running an MMLU-pro test which is going to take most of today. I have a very strong feeling that fp16 is actually degrading the performance by losing out on a ton of near 0 data that can't be represented in fp16. I'D LOVE TO BE WRONG ABOUT THIS!
I've made and uploaded gemma 9b with Q8 embeddings/output and FP32 embeddings/output
The FP32 one is huge and not going to be viable going forward I assume, but it's a proof of concept for if/when BF16 CUDA support is added to llama.cpp
Credit added for @ZeroWw , closing this discussion.
as I said I didn't invent anything... all credit is to Georgi Gerganov, but I was the one who actually used it and found it useful and so many people who thanked me.. that's all.
I don't need glory, otherwise I wouldn't have used here a different nickname.
I already got my glory some time ago...
In a different "field".
I just said that f16/q6 or f16/q5 are better than a pure Q8 (and smaller)... that's all...
Sure but like I said, I don't have any evidence of that. I'm running some MMLU pro benchmarks today to try to determine for once and for all what the best is. I will be including that method going forward. If it's FP16, I'll leave the credit as is. If FP16 proves worse but Q8 or FP32 are in fact noticeably better, I'll credit your idea to experiment but with my new versions.
Not sure what you mean by want credit but not glory or what the difference is, but I hope my current attribution is suitable enough.
Not sure what you mean by want credit but not glory or what the difference is, but I hope my current attribution is suitable enough.
Indeed. It's just the truth.
I have a very strong feeling that fp16 is actually degrading the performance by losing out on a ton of near 0 data that can't be represented in fp16.
Interesting point of view and you are right, bfloat16's are slightly better. They can represent lower numbers 2^-126 ≈ 1.18 × 10^-38
(as opposed to fp16's 2^-14 ≈ 6.10 × 10^-5
) and surprisingly have a bit more numbers between 0 and 1 but that's not the whole story:
Float16:
Uses exponents from -14 to -1 (15 values)
For each exponent, it can represent 2^10 = 1024 different fractions
Total values: 15 * 1024 = 15,360
Plus one more for the subnormal numbers (very close to zero)
Total: 15,361 values
Bfloat16:
Uses exponents from -126 to -1 (126 values)
For each exponent, it can represent 2^7 = 128 different fractions
Total values: 126 * 128 = 16,128
Plus one more for the subnormal numbers (very close to zero)
Total: 16,129 values
Going above 1, as both have ~15k/~16k possible values, but fp16 has to cover 1 to 65,504
while bf16 has to cover 1 to 3.39 × 10^38
ranges, fp16
will be more precise
Choosing which is better it's a tricky business:
Have values logarithmically distributed between 0 and 1? Bfloat16 is better
Number of representable values between 0 and 1, X log scale.
Have values evenly distributed between 0 and 1? float16 is better
Number of representable values between 0 and 1, X chunked every 0.05.
Have values that go above 1 but stay low? float16 is better
Have values that go above 1 require a very high range of values? bfloat16 is better
Number of representable numbers between 1 and 8, chunked at 0.1
--
As always, the answer is: it depends 😅
@legraphista I'm running mmlu pro against Phi 3 Q3_K_L with modes: f32 (to represent bf16), fp16, Q8, and default
So far, f32 == Q8 > default > f16.. yes, Q3_K embeddings and output are outperforming f16 on mmlu pro... I'll let it run as many as I can before I complete my conclusions, but so far it's looking like I shouldn't continue f16 at the least
those results are so strange, i converted gemma 27b it to fp32 with the official llama.cpp py script, and looked at the output_norm.weight
layer with the GGUFReader:
Total params: 4608
Representable in fp16: 4608 (100.00%)
Representable in bf16: 4235 (91.91%)
Unrepresentable fp16: []
Unrepresentable bf16: ['8.21875=>8.1875', '8.34375=>8.3125', '32.625=>32.5', '8.15625=>8.125', '8.09375=>8.0625', '8.84375=>8.8125', '8.40625=>8.375', '8.03125=>8.0', '8.65625=>8.625', '8.90625=>8.875', '8.96875=>8.9375', '8.59375=>8.5625', '8.71875=>8.6875', '8.46875=>8.4375', '8.28125=>8.25', '32.375=>32.25', '8.78125=>8.75', '16.8125=>16.75', '8.53125=>8.5', '16.0625=>16.0', '16.3125=>16.25', '16.4375=>16.375', '16.9375=>16.875', '32.875=>32.75', '16.6875=>16.625', '32.125=>32.0', '16.1875=>16.125', '16.5625=>16.5']
Technically, fp16 should be equal to fp32 and superior to bf16
This is peculiar, as the original data type is stated to be bfloat16
in the .safetensors
. Converting bf16 to fp32 should be lossless, so why are there bad values as bf16
...
code to reproduce
from gguf import GGUFReader
from collections import Counter
import numpy as np
# ~~~~~~~~~~~~~~~~~ read data ~~~~~~~~~~~~~~~~~~~~~
def read_gguf_tensors(gguf_file_path):
reader = GGUFReader(gguf_file_path)
tensors = []
for tensor in reader.tensors:
tensors.append({
"name": tensor.name,
"shape": tuple(tensor.shape),
"elem": tensor.n_elements,
"quant": tensor.tensor_type.name,
"bytes": tensor.n_bytes,
"bpw": tensor.n_bytes / tensor.n_elements * 8,
})
return reader.tensors, tensors
data, _ = read_gguf_tensors('gemma-2-27b-it-IMat-GGUF/gemma-2-27b-it.gguf')
for i, t in enumerate(data):
print(i, t.name, t.tensor_type.name)
c_out = Counter()
c_out.update(data[-1].data)
# ~~~~~~~~~~~~~~~~~ interpret ~~~~~~~~~~~~~~~~~~~~~
def to_fp16(num):
return np.float16(num)
def is_representable_in_float16(num):
return to_fp16(num) == np.float32(num)
def to_bf16(num):
# Convert to float32 bits
bits = np.float32(num).view(np.uint32)
# Keep only the top 16 bits
bfloat16_bits = bits >> 16
# Convert back to float32
return np.uint32(bfloat16_bits << 16).view(np.float32)
def is_representable_in_bfloat16(num):
return to_bf16(num) == num
def count_representable_numbers(counter):
total_count = sum(counter.values())
fp16_count = 0
bf16_count = 0
unrepr_fp16 = {}
unrepr_bf16 = {}
for num, count in counter.items():
if is_representable_in_float16(num):
fp16_count += count
else:
unrepr_fp16[num] = to_fp16(num)
if is_representable_in_bfloat16(num):
bf16_count += count
else:
unrepr_bf16[num] = to_bf16(num)
return total_count, fp16_count, unrepr_fp16, bf16_count, unrepr_bf16
total, fp16, ufp16, bf16, ubf16 = count_representable_numbers(c_out)
print(f"Total params: {total}")
print(f"Representable in fp16: {fp16} ({fp16/total:.2%})")
print(f"Representable in bf16: {bf16} ({bf16/total:.2%})")
print(f'Unrepresentable fp16: {[f"{k}=>{v}" for k, v in ufp16.items()]}')
print(f'Unrepresentable bf16: {[f"{k}=>{v}" for k, v in ubf16.items()]}')
# ~~~~~~~~~~~~~~~~~ histogram ~~~~~~~~~~~~~~~~~~~~~
import matplotlib.pyplot as plt
def create_histogram(counter):
# Extract non-zero values
values = [k for k in counter.keys() if k > 0]
# Create a logarithmic histogram
plt.figure(figsize=(12, 6))
# Use log scale for x-axis
bins = np.logspace(np.log10(min(values)), np.log10(max(values)), num=50)
plt.hist(values, bins=bins, weights=[counter[v] for v in values])
# plt.xscale('log')
plt.yscale('log')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Histogram of Float32 Numbers (Log Scale)')
plt.tight_layout()
plt.show()
create_histogram(c_out)
That is very interesting because I was running some similar tests which found that a ton of values were outside the range that f16 could represent.. but I was also using the original safetensors, doubt that matters but I'll take another crack at the code and see if I can find what's off
More than anything your results confuse me especially since Google themselves noted far inferior performance at fp16, they must have seen something right?
Oh also output_norm.weight is the wrong one I think, I couldn't find the output weights in Gemma somehow, try with embeddings
This is peculiar, as the original data type is stated to be bfloat16 in the .safetensors. Converting bf16 to fp32 should be lossless, so why are there bad values as bf16...
Oh yeah I didn't even consider this, something is off for sure with that test then 🤔
I had a look at token_embd.weight
on the 27b model and now it kinda makes sense 😂
some values do go to 0
Total params: 1179648000
Representable in fp16: 1179292429 (99.97%)
Representable in bf16: 1179648000 (100.00%)
Unrepresentable fp16: ['-1.2367963790893555e-06=>-1.2516975402832031e-06', '-4.5821070671081543e-07=>-4.76837158203125e-07', '-5.0961971282958984e-06=>-5.125999450683594e-06', '6.109476089477539e-07=>5.960464477539062e-07', '-7.063150405883789e-06=>-7.033348083496094e-06', '-2.637505531311035e-06=>-2.6226043701171875e-06', '3.591179847717285e-06=>3.5762786865234375e-06', '1.9222497940063477e-06=>1.9073486328125e-06', '1.2665987014770508e-06=>1.2516975402832031e-06', '5.334615707397461e-06=>5.364418029785156e-06', '2.339482307434082e-06=>2.3245811462402344e-06', '-4.857778549194336e-06=>-4.887580871582031e-06', '1.952052116394043e-06=>1.9669532775878906e-06', '-2.6775524020195007e-08=>-0.0', '4.738569259643555e-06=>4.76837158203125e-06', '-4.082918167114258e-06=>-4.0531158447265625e-06', '-2.652406692504883e-06=>-2.6226043701171875e-06', '-3.762543201446533e-07=>-3.5762786865234375e-07', '-4.0046870708465576e-08=>-5.960464477539063e-08', '-5.513429641723633e-06=>-5.4836273193359375e-06', '7.539987564086914e-06=>7.510185241699219e-06', '3.8929283618927e-07=>4.172325134277344e-07', '4.202127456665039e-06=>4.172325134277344e-06', '-8.381903171539307e-07=>-8.344650268554688e-07', '-1.780688762664795e-06=>-1.7881393432617188e-06', '-2.995133399963379e-06=>-2.9802322387695312e-06', '-1.1399388313293457e-06=>-1.1324882507324219e-06', '-3.7103891372680664e-06=>-3.6954879760742188e-06', '7.3015689849853516e-06=>7.271766662597656e-06', '-2.428889274597168e-06=>-2.4437904357910156e-06', '-2.2798776626586914e-06=>-2.2649765014648438e-06', '-2.7120113372802734e-06=>-2.7418136596679688e-06', '4.023313522338867e-06=>4.0531158447265625e-06', '-1.817941665649414e-06=>-1.7881393432617188e-06', '-5.036592483520508e-06=>-5.0067901611328125e-06', '-7.264316082000732e-07=>-7.152557373046875e-07', '-6.16908073425293e-06=>-6.198883056640625e-06', '-4.209578037261963e-07=>-4.172325134277344e-07', '2.041459083557129e-06=>2.0265579223632812e-06', '-6.407499313354492e-06=>-6.4373016357421875e-06', '4.559755325317383e-06=>4.5299530029296875e-06', '-1.4528632164001465e-06=>-1.430511474609375e-06', '-9.685754776000977e-07=>-9.5367431640625e-07', '7.078051567077637e-07=>7.152557373046875e-07', '5.27501106262207e-06=>5.245208740234375e-06', '-3.6209821701049805e-06=>-3.635883331298828e-06', '1.7434358596801758e-06=>1.7285346984863281e-06', '-3.725290298461914e-07=>-3.5762786865234375e-07', '1.1734664440155029e-07=>1.1920928955078125e-07', '-7.636845111846924e-07=>-7.748603820800781e-07', '1.5869736671447754e-06=>1.6093254089355469e-06', '1.385807991027832e-06=>1.3709068298339844e-06', '-5.401670932769775e-07=>-5.364418029785156e-07', '6.586313247680664e-06=>6.556510925292969e-06', '3.1888484954833984e-06=>3.2186508178710938e-06', '-5.155801773071289e-06=>-5.125999450683594e-06', '7.241964340209961e-06=>7.271766662597656e-06', '4.857778549194336e-06=>4.887580871582031e-06', '-6.467103958129883e-06=>-6.4373016357421875e-06', '1.5944242477416992e-06=>1.6093254089355469e-06', '-3.2335519790649414e-06=>-3.2186508178710938e-06', '3.7997961044311523e-06=>3.814697265625e-06', '-5.930662155151367e-06=>-5.9604644775390625e-06', '3.293156623840332e-06=>3.2782554626464844e-06', '-2.0116567611694336e-06=>-2.0265579223632812e-06', '-2.175569534301758e-06=>-2.1457672119140625e-06', '3.2335519790649414e-06=>3.2186508178710938e-06', '5.036592483520508e-06=>5.0067901611328125e-06', '-2.1979212760925293e-07=>-2.384185791015625e-07', '-5.3942203521728516e-06=>-5.364418029785156e-06', '-3.9637088775634766e-06=>-3.933906555175781e-06', '2.1886080503463745e-07=>2.384185791015625e-07', '-5.751848220825195e-06=>-5.7220458984375e-06', '-3.591179847717285e-06=>-3.5762786865234375e-06', '-4.023313522338867e-06=>-4.0531158447265625e-06', '-7.82310962677002e-07=>-7.748603820800781e-07', '4.380941390991211e-06=>4.410743713378906e-06', '-6.586313247680664e-06=>-6.556510925292969e-06', '-1.7955899238586426e-06=>-1.7881393432617188e-06', '-3.6656856536865234e-06=>-3.6954879760742188e-06', ... shortened for brevity
Unrepresentable bf16: []
Oh also output_norm.weight is the wrong one I think, I couldn't find the output weights in Gemma somehow, try with embeddings
I did have a look at your fp32 model, it lacks the output norm layer 🤔
on my end:
print(data[-1].name) # 'output_norm.weight'
I see output_norm.weight, what I don't see is output.weight which is the one that's actually being affected by --output-tensor-type
I think
.03% still feels off compared to my test (I found 1%) but at least you don't have any that bf16 can't represent lmao
Are you paying specifically attention to values that fall between 0 and +/- 2 ^ -14? Because all of those would get clamped to 0
I see output_norm.weight, what I don't see is output.weight which is the one that's actually being affected by --output-tensor-type I think
interesting, I don't have any layer output.weight
Are you paying specifically attention to values that fall between 0 and 2 ^ -14 (and the negative)? Because all of those would get clamped to 0
I'm just counting values that would get rounded when down-typed to fp16/bf16 (see https://huggingface.co./bartowski/gemma-2-27b-it-GGUF/discussions/7#66848a70d4e3eff8e19f881e, hidden section code to reproduce )
had a look at gemma 2 9b it:token_embd.weight
Total params: 917504000
Representable in fp16: 917356043 (99.98%)
Representable in bf16: 917504000 (100.00%)
Unrepresentable fp16: ['2.339482307434082e-06=>2.3245811462402344e-06', '-4.842877388000488e-07=>-4.76837158203125e-07', ... , '-5.122274160385132e-09=>-0.0', '6.693881005048752e-09=>0.0', '-2.0081643015146255e-09=>-0.0', '4.190951585769653e-09=>0.0', '-1.877197064459324e-09=>-0.0']
Unrepresentable bf16: []
output_norm.weight
Total params: 3584
Representable in fp16: 3584 (100.00%)
Representable in bf16: 3434 (95.81%)
Unrepresentable fp16: []
Unrepresentable bf16: ['2.9921875=>2.984375', '2.9140625=>2.90625', '2.8359375=>2.828125', '4.765625=>4.75', '4.109375=>4.09375', '2.9296875=>2.921875', '4.203125=>4.1875', '2.9609375=>2.953125', '2.5546875=>2.546875', '4.078125=>4.0625', '4.390625=>4.375', '2.8046875=>2.796875', '2.8984375=>2.890625', '1.41796875=>1.4140625', '2.8203125=>2.8125', '8.34375=>8.3125', '2.7890625=>2.78125', '2.6171875=>2.609375', '2.8515625=>2.84375', '2.7734375=>2.765625', '2.8828125=>2.875', '4.046875=>4.03125', '2.6484375=>2.640625', '2.7421875=>2.734375', '2.9453125=>2.9375', '4.140625=>4.125', '2.4765625=>2.46875', '4.515625=>4.5', '2.7578125=>2.75', '2.6640625=>2.65625', '2.3828125=>2.375', '4.671875=>4.65625', '2.2421875=>2.234375', '4.234375=>4.21875', '2.5078125=>2.5', '2.6953125=>2.6875', '2.9765625=>2.96875', '2.7109375=>2.703125', '4.484375=>4.46875', '4.953125=>4.9375', '4.640625=>4.625', '2.7265625=>2.71875', '4.796875=>4.78125', '2.5859375=>2.578125', '2.4453125=>2.4375', '4.296875=>4.28125', '4.359375=>4.34375', '2.6015625=>2.59375', '4.609375=>4.59375', '4.703125=>4.6875', '2.5390625=>2.53125', '2.6796875=>2.671875', '2.1484375=>2.140625', '4.015625=>4.0', '2.3515625=>2.34375', '2.8671875=>2.859375', '2.5234375=>2.515625', '4.265625=>4.25', '4.546875=>4.53125', '2.3671875=>2.359375', '1.86328125=>1.859375']
similar story
Phi 3 has output.weight, I know that Gemma's output_norm.weight was not affected by me setting output tensors to f16 🤷♂️ not sure how models work LOL
I'll check your code when I'm on my computer and I'll look at mine too
the initial idea of quantizing output+embed at f16 and the rest at q5-q8 came from a "stupid" reason:
those 2 tensors are usually as big as the sum of all others and at the time I tried to do two different things:
- quantized out+emb to q4 (q5 and q6) and the rest at f16
- quantized out+emb to f16 and the rest at q4 (tried also q5 and q6)
in the first case I always ended up with visibly lobotomized models.
in the second I saw no difference than the "pure" f16 which is twice as big.
hence I tried with other models (all llama-3 at the time) and got the same results.
that's all I know.
I test at the moment, the differences are not big between 8_L, 8, 6_L and 6. But in one case only 6_L gave me the correct answer 26.5 (but it says it's approximately. All other tested quants gave me 26.52, because the model rounded: 5/3 -> 1.67). I used the same seed and other identical settings.
The test prompt was:
Solve the following task step by step: Klaus eats 3 bars of chocolate in 4 hours. Hans eats 2 bars of chocolate in one hour. Petra eats 5 bars of chocolate in 3 hours. How many bars of chocolate do Klaus, Hans and Petra eat together in 6 hours?
I think, only big benchmarks can say, a quantisation method is better or not, because small tests are too strongly influenced by randomness. Sometimes this quant is better, sometimes a other quant gives a better answer. Or i have too add instructions like "Solve this mathematical exact, don't round...", otherwise the LLM has too much room, to answer differently.
26.5 bars of chocolate in 6 hours is too much, even for me. Happy testing.
@legraphista finally finished my MMLU pro testing and I think I'm willing to call my results relatively conclusive:
https://www.reddit.com/r/LocalLLaMA/comments/1duume2/quantization_experimentation_mmlu_pro_results/
Interesting findings! I'm surprised you don't see higher degradation with lower quants, tho the numbers look within variance. I'm saying this as I see Fp16 performing better than FP32 in some cases, and even Default (Q3 for embed, Q6 for output) had a better Physics score than FP32.
I'm wondering if maybe the tests should have been repeated and averaged or maybe another benchmark may show more discrepancies, like HellaSwag.
I'll also try to independently run some tests on my end this weekend and compare notes.
Yes the tests should be repeated way more times, this was only enough information to me make a conclusion about f16 embeddings/outputs, since there isn't a clear trend of increased quality I'm willing to change my setup
Way more tests would give much more conclusive information, appreciate you taking a look at tests! We've seen times where lower quants punch weirdly high but those are more likely to be edge case accidents where the quant happened to fortify the correct information by pure chance
try it with 8B and 7B models. perhaps the bigger models are affected differently.
I'd love to but even at Q3 the tiny phi 3 took on average 1 hour per category on a 4090, multiplied by 8 categories, ideally at least 3 quants (can probably leave out default), and assume it takes double the time since it's double the size, that's 48 hours of compute per run. If we want to run at least 2 tests, ideally 3, to account for randomness that's gonna be close to 100 USD.. I suppose I could spin up multiple instances and make 3 go at the same time, but after doing it two days in a row i'm way too tired of it.. ugh but I really probably should...
if anyone wants to sponsor some compute to run these i'd love to explore it more
Until now I got mixed feedback on my method, but I also noticed that the "same or worse" feedback came from bigger models.
It seems that my method works best with 9B/7B/3B/1B models.
It also seems that the best conbination is f16 for output and embed and q6_k for the others.