model breaks down at high context

#1
by FlareRebellion - opened

tested the Q4_K_S at 24k context and it's entirely incoherent.

No such problems with https://huggingface.co./NeverSleep/Lumimaid-v0.2-70B-GGUF

Well, that url does not even have a Q4_K_S, so not sure what you are comparing against. But that sounds more like download corruption or the LLM having a bad day/bad settings rather than an issue with these quants.

mradermacher changed discussion status to closed

@mradermacher Are you sure this is not related to the Llama 3.1 rope scaling issue? The rope scaling issue especially affects context windows above 8192. Based on the upload time it seems as if this Meta-Llama-3.1-70B-Instruct based model was quantized with the old llama.cpp version and needs to be requantized with the new one for the rope scaling issue to be fixed.

Are you talking about this repo or another one? The files here are clearly more recent than in the linked repo, so I don't understand what you are refering to. AFAIK, I have not uploaded any 3.1 quants before the 3.1 scaling was implemented in llama-3.1, unless explicitly requested by the author.

Sorry for the confusion. I thought you started using the new llama.cpp version at 27th July 2024 at 22:35 GMT when you mentioned that you (re-)queued everything but I overlooked that in the same comment you also mentioned that you already converted a few models first for testing. I assume this must be one of those test models. The new llama.cpp version got released on 27th July 2024 at 14:07 GMT while you started with Lumimaid-v0.2-70B on 27th July 2024 at 14:42 GMT so assuming you immediately updated this model is perfectly fine.

Ah, that explains it. lumimaid was indeed the first model (models, really) i queued. You can see whether the new rope implementation is in effect by clicking on a quant in huggingface (right side panel) and looking for the rope_freqs tensor - if it's there, it was converted with the new code.

For what it's worth, I downloaded the IQ3_XXS, and even at 29k tokens, it was completely coherent. Well, as coherent as with smaller contexts.

Sign up or log in to comment