Broken End Token
Unfortunately, the quant uses wrong end tokens, and appends the word assistant
to every response.
Check out NousResearch thread for details on how to fix it:
https://huggingface.co./NousResearch/Meta-Llama-3-8B-Instruct-GGUF/discussions/1
Thanks for the heads-up, unfortunately, that's what the upstream model does right now. I'll probably delete this repo and/or redo it once upstream has fixed theirs (in this case, NousResearch ).
I think the upstream solution is wrong. The end token in this repo is correct, just not for all cases - llama doesn't handle multiple end tokens atm.
Well, GPT4All freezing is obviously not a problem with these ggufs, but simply a bug in gpt4all.
@mradermacher Yeah, when it doesn't get the expected end token it freezes. But it's still a problem in other apps, such as koboldcpp which will keep outputing nonsense. It just doesn't freeze.
That's a known bug in token handling in koboldcpp that has been fixed.
Also, koboldcpp does not simply keep outputting nonsense, even before the fix. It would simply be impossible to configure the stop token.
Here's an example of what I'm talking about with koboldcpp. After it answer my question is added the following.
"(Note: The responses should be brief and concise.)
More Information:
If you want more information about this topic or the TV show itself, feel free to ask. I'd be happy to help!assistant"
I'll try download the new version and see if it still happens.
The latest version of koboldcpp still does it. All GGUFs of Llama 3 without the end token fix do this in every app I tested.
Angus T. Jones played Jake Harper.ert</div></p>ertassistant
<p dir="ltr"
The end token "fix" does not fix anything, it just replaces one bug by another, i.e. it might work in your config, but break in others. This is not a problem with these ggufs, the ggufs are (within the limits of current llama.cpp support) correct. I will not break the model because of bugs in inference engines. If llama 3 multiple end token support improves, I might redo these ggufs.