Does it work for anyone?

#1
by Onix22 - opened

this model is producing nonsense
if temperature is decreased it remains stable for a bit longer but eventually it defaults to this

he the story tale of Timmy's s prankankles were were were were not only amused entertained but also learned lesson moremory moment experience from what happened outcome result effect impact influence on upon upon upon upon upon upon upon upon upon upon upon upon upon upon upon upon upon upon upon upon upon upon upon upon upon upon upon upon upon upon upon upon upon upon upon upon upon upon upon upon upon upon upon upon

I had this problem because I failed to change the model loader to ExLlama.

With the OoobaBooga webui I got it working on a single 3090, total GPU memory used is 23.6/24. Generates text just fine on any of the instruct generation parameters presets. Ensure your ModelLoader is set to ExLlama, not ExLlama_hf, max_seq_len = 4096, compress_pos_emb = 2. You'll need more than 24GB VRAM for 8k tokens. It seems to be riding the limit at 4k. Under Parameters, Truncate Prompt = 4096. Also, check your instruct template under chat settings, I have mine on WizardLM. It didn't set this automatically for me.

I found on other site that those models need to be set to more than 2048 token context length or else they don't work at all.
Now everything seem to be working but this should be told on the model page.

As for the maximum I did not hit the limit yet but looks like exlliama is capable of using shared GPU memory so context length is not realy limited by usual OOM but it may get super slow
also it allocates all memory in advance

Sign up or log in to comment