4080 or 16gb of VRAM

#3
by MateoTeo - opened

All 57 layers with Q4_K_M and 16k Q8 context:

Model: Mistral-Small-22B-ArliAI-RPMax-v1.1-Q4_K_M
Engine: KoboldCPP v1.74 (CuBLAS)
MaxCtx: 16384
Layers: 59
Threads: 8
BlasBatchSize: 512
ProcessingAmount: 16384
GenAmount: 100

ProcessingTime: 9.627s
ProcessingSpeed: 1691.49T/s
GenerationTime: 4.513s
GenerationSpeed: 22.16T/s
TotalTime: 14.140s
Output: 1 1 1 1

My experience shows no difference between modern Q8/6/5/4 (K_M or K_L) quants for roleplay or creative writing, even with complex rules and interfaces (over 2k and more) ¯_(ツ)_/¯

MateoTeo changed discussion title from 4080 or 16gb of RAM to 4080 or 16gb of VRAM

GOD I AM SO JEALOUS OF YOU! MY 3060 12GB IS ABOUT 1-2 TOKENS PER SECOND! AND OVER TIME IT BECOMES 1 TOKENS PER SECOND!

GOD I AM SO JEALOUS OF YOU! MY 3060 12GB IS ABOUT 2-3 TOKENS PER SECOND! AND OVER TIME IT BECOMES 1-2 TOKENS PER SECOND!

Eheh... no need, mate. I'm jealous of pips with 24 VGB and two 4090s :D
For 12GB you can try the 12B model (Mistral NeMo) here in Q6_K_M or Q5_K_M (usually, Q4_K_M is good too.)
I do not recommend you try iQ2, but... maybe this will be good for your tasks. Depends. Just find i-matrix iQ2 or iQ3 quants - they are better.
In my tests, only 70B models can still produce good results in low quantization, but the quality will be around 36B models + models may confused by "you", "me", "I" tokens.

Sign up or log in to comment