Text Generation
Transformers
Safetensors
English
llama
llama-factory
Not-For-All-Audiences
conversational
text-generation-inference
Inference Endpoints

Model seems to have a repitition issue.

#4
by ockerman0 - opened

I was giving this model a go earlier using one of my own character cards. The earlier messages were all pretty good though I began to notice roughly 20 messages in the model would begin to repeat whole phrases from earlier messages over and over. Using the find feature on my browser I would notice some phrases being repeated up to 5 times.

Unrelated, though something worth mentioning as well is the models tendency to just recite the character's description as opposed to actually formulate a response. I have also observed this happening when trying to generate a summary.

For reference I have tried the static Q8 and Q6_K quants, as well as the F16.Q6 quant. All three of which seem to have the same issue.

Yeah, some repetition is still present in v1.2, we are working on fixing that in the new version. We are filtering out the most repetitive samples from C2 Logs subset used in this model + adding RP formatting on D/WP dataset to, potentially, better mix it with RP data. This also will hopefully will fix the second issue (which happens likely because of some charcards in C2 taking a large part of 8k seq len).
Also we have received reports of model being more repetitive on inference engines other than vLLM and Aphrodite. More specifically, repetition/frequency/presence penalties being less effective than expected.

Nothing is Real org

@ockerman0 try using Aphrodite engine with rep pen of 1.1

Nothing is Real org

A friend of mine found repetition differences between engines and Aphrodite pretty much fixes it with rep pen 1.1

Thanks, I'll have a look and see if I can find any differences. Something I should've mentioned in the first post is that I was using Koboldcpp. Aphrodite doesn't seem to support many AMD graphics cards at the moment, so I'll have to try vLLM.

Nothing is Real org

@ockerman0 you can also try loading up EXL2 quants on TabbyAPI, EXL2 format seems to be the less problematic one, and Tabby supports ROCm

I've tried a short conversation using the 6.3bpw EXL2 quant with TabbyAPI, it seems to behave mostly the same. Hopefully the new version might do something to alleviate this.

Changing the repitition penalty range seems to help with the issue somewhat, to what extent I'm not sure though.

Sign up or log in to comment