Feedback discussion.
...
@M4L1C3 You just need to update koboldcpp. All llama.cpp based inference engines need to be on latest version to be compatible with models quantized in the last week and a half. There was a change in the conversion step of quantization that selects the "smaug-bpe" vocab type instead of "bpe" which is what it used before.
So, yes, there is an issue with quantization (built into the software which does the quantization,) but updating your inference engine should resolve crashing.
@jeiku
Thanks for letting me know!
As far for feedback, this particular model has given me a lot more hallucinations (dead links), authors notes, [SCENE] descriptions, and other ooc stuff, perhaps due to my prompting but haven't noticed similar issues with other/older models.
Tried to play around with context/Authors note to set up rules for going ooc, to no avail. As a plus, it seems to use not as many tokens for larger replies for some reason.
@Lewdiculous
Regarding recommended settings/samplers, are there any suggestions?
Based on L3-8B-Stheno-v3.1-GGUF-IQ-Imatrix, are these also relevant to this model:
- Temperature - 1.12 to 1.32
- Min-P - 0.075
- Top-K - 40
- Repetition Penalty - 1.1
Additionally, as my running method (Koboldcpp) doesn't allow MinP/TopK configuration (As far as I know) I've been using "Top-p" sampling, I really don't know the relevance but for now these have been optimal settings for me:
- Temperature - 1.01
- Top-P - 0.76
- Repetition Penalty - 1.24
As I haven't found suggestions on Context length [Context Size] & [Max Ctx. Tokens] (and their dependence/difference on Q8, Q6,Q5....)* I've been running Q6_K with:
- Context Size - 16384
- Max Ctx. Tokens - 12288
- Amount to Gen - 250
250 still spills some extra/repetitive stuff, but might be useful for... inspiration ;)
*(Yes, I will look it up, just started out with llm's, I swear I'm not begging for a guide :P)
/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
Following is extra information, feel free to ignore, if someone wants it for reference.
I use RTX 4060/16GB RAM/I5-10400F 2.9GHz
- Layers to GPU - 27
- BLAS - 512
I assume I'm below spec for Q6 as L3-8B-Stheno suggests using Q4 with 12288 context, but for my goals I'm ok with not having the response insta gen (I'm not into edging I swear), I presume making the context length too much, or giving your system too much to handle might worsen the results even with a "better" model, but that remains to be seen after extensive "experimentation/research".
Lastly, don't know what goes on in the kitchen or how hard it is to bake these cakes, but they taste all the same - delicious. Kudos Author :)
(EDIT#1 - Didn't click on preview to see formatting on this site, sorry x_x)
SillyTavern Presets (recommended):
- Samplers
Honestly, just tweak from there. This depends a lot on the model, so you might need to tune to your liking. - Context and Instruct
Unfortunately the model does not load. oobabooga, currently updated.
llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'smaug-bpe'
llama_load_model_from_file: failed to load model
00:25:30-813299 ERROR Failed to load the model.
@Andi47 It will load in an updated llama.cpp or koboldcpp instance, unfortunately it seems that the upstream changes have not been implemented in Oobabooga yet. You're best bet is to use either llama.cpp or koboldcpp, or if you are married to the oobabooga infrastructure, then create an issue in their github asking for support.
If you were to download llama.cpp you could use the following command to convert the files appropriately:
python llama.cpp/gguf-py/scripts/gguf-new-metadata.py --pre-tokenizer llama-bpe smauggy.gguf fixed.gguf
smaugy.gguf would be your input file and it can be any gguf version of the model, fixed.gguf is your output file. If the gguf-new-metadata.py is not in this location, then point to the correct location for this python file. If you were to fix the f16 converted gguf, you could then quantize that file further and the changes would apply to each quantized gguf produced.