File size: 5,018 Bytes
e97c0f9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5d55fcb
e97c0f9
a5c5f75
e97c0f9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a09e668
e97c0f9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
af91533
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e97c0f9
 
03af98c
e97c0f9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80ac654
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
---
license: apache-2.0
language:
- en
tags:
- creative
- story
- writing
- fiction
- float32
- roleplaying
- rp
- enhanced
- space whale
- 32 bit upscale
pipeline_tag: text-generation
---
<font color=red><h3> Ultra Quality High Remaster of the incredible: Psyonic-Cetacean 20b - Imatrix Plus 2. </h3></font>

This is a Floating Point 32 upscale, where all components and merges were remastered to floating point 32.
This includes all the merges (recreated with master files), and where possible subbing full FP32 models.

This repo contains the new Imatrix Plus 2 quants using a new inhouse dataset merged with a master dataset 
to push performance of the Ultra Quality remaster even higher.

<img src="space-whale-thinking.jpg">

The goal: Carry forward maximum precision right up to the point where it is "GUFFed".

This includes F32 master file for GGUF too... at a whopping 78 GBs.

WHY?

Because the difference between F32 vs BF16 is... over 8 DECIMAL places.

And as each merge / model is modified there are "losses" along the way.

These losses are carried forward and in turn lead to more losses.

And decimal points are critical to model performance.

SMALL?

Yes... but multiplied by each merge(s), and compression(s): 20 billion times.

<B>The result:</b>

At Q2K an impressive drop of 533 points in perplexity. (lower is better)
(VS: Q2K original base model: PPL = 9.8077 +/- 0.06821 )

At Q4KM a whopping drop of 976 points in perplexity.
(VS: Q4km original base model -> PPL = 8.7858 +/- 0.06074)

At Q6 an awesome drop of 234 points in perplexity. 
(VS: Q6 original base model -> PPL = 8.6070 +/- 0.05907 )

To put this in perspective "Q6" now operates ABOVE the original full precision version of "Psyonic-Cetacean-20b" 
and Q4KM operates at close to Q6 level quality.

This because at "Q6" the quant / compressed model is considered to be accurate within "+0.0008 ppl" of the full, 
uncompressed / unquanted model and it exceeds this threshold by over 200 points.

<I> Imatrix quants take this even further in most cases DOUBLING the "drop" in perplexity realized in the reg quants. </i>

Q4km-imatrix : 

Final estimate: PPL = 8.6095 +/- 0.05898

(Non imatrix: Final estimate: PPL = 8.6902 +/- 0.05985 )

(VS: Q4km base model -> PPL = 8.7858 +/- 0.06074)

(VS: Q6 BASE model -> Final estimate: PPL = 8.6070 +/- 0.05907 Q6)


But... what about Q8? 

The mountain moved:

150 points better: PPL = 8.5850 +/- 0.05881  VS: BASE/ORGINAL: PPL = 8.6012 +/- 0.05900

<B>Settings: CHAT / ROLEPLAY and/or SMOOTHER operation of this model:</B>

In "KoboldCpp" or  "oobabooga/text-generation-webui" or "Silly Tavern" ;

Set the "Smoothing_factor" to 1.5 to 2.5 

: in KoboldCpp -> Settings->Samplers->Advanced-> "Smooth_F"

: in text-generation-webui -> parameters -> lower right.

: In Silly Tavern this is called: "Smoothing"


NOTE: For "text-generation-webui" 

-> if using GGUFs you need to use "llama_HF" (which involves downloading some config files from the SOURCE version of this model)

Source versions (and config files) of my models are here:

https://huggingface.co./collections/DavidAU/d-au-source-files-for-gguf-exl2-awq-gptq-hqq-etc-etc-66b55cb8ba25f914cbf210be

OTHER OPTIONS:

- Increase rep pen to 1.1 to 1.15 (you don't need to do this if you use "smoothing_factor")

- If the interface/program you are using to run AI MODELS supports "Quadratic Sampling" ("smoothing") just make the adjustment as noted.

<B>THE RESULTS ARE IN: </b>

AS per Jeb Carter, original creator of the model:

    - instruction following has improved dramatically.
    - new abilities have emerged.
    - he had to REDUCE the instructions sets used because the model no longer needed as specific instructions.
    - prose, nuance and depth have all improved.
    - known issues with the original model have disappeared.

This is not "something for nothing" ; it is method of ensuring maximum precision at every step just before "ggufing" the model.

The methods employed only ensure precision loss is minimized or eliminated.

It is mathematical and theory sound.

<B>The bottom line here is this:</b>

Higher quality instruction following and output.

Likewise you can use a smaller compression, with higher token per second and still get great quality.

Same great model... turbo charged.

This is the first group of remasters.

Thanks again to Jeb Carter, the original creator of "Psyonic-Cetacean 20B"

[ https://huggingface.co./jebcarter/psyonic-cetacean-20B ]

<B>Highest Quality Settings / Optimal Operation Guide / Parameters and Samplers</B>

This a "Class 2" model:

For all settings used for this model (including specifics for its "class"), including example generation(s) and for advanced settings guide (which many times addresses any model issue(s)), including methods to improve model performance for all use case(s) as well as chat, roleplay and other use case(s) please see:

[ https://huggingface.co./DavidAU/Maximizing-Model-Performance-All-Quants-Types-And-Full-Precision-by-Samplers_Parameters ]