Triangle104/Eximius_Persona_5B-Q6_K-GGUF

This model was converted to GGUF format from SicariusSicariiStuff/Eximius_Persona_5B using llama.cpp via the ggml.ai's GGUF-my-repo space. Refer to the original model card for more details on the model.

Model details:

I wanted to create a model with an exceptional capacity for using varied speech patterns and fresh role-play takes. The model had to have a unique personality, not on a surface level but on the inside, for real. Unfortunately, SFT alone just didn't cut it. And I had only 16GB of VRAM at the time. Oh, and I wanted it to be small enough to be viable for phones and to be able to give a fight to larger models while at it. If only there was a magical way to do it.

Merges. Merges are quite unique. In the early days, they were considered "fake." Clearly, there's no such thing as merges. Where are the papers? No papers? Then it's clearly impossible. "Mathematically impossible." Simply preposterous. To mix layers and hope for a coherent output? What nonsense!

And yet, they were real. Undi95 made some of the earliest merges I can remember, and the "LLAMA2 Era" was truly amazing and innovative thanks to them. Cool stuff like Tiefighter was being made, and eventually the time tested Midnight-Miqu-70B (v1.5 is my personal favorite).

Merges are an interesting thing, as they affect LLMs in a way that is currently impossible to reproduce using SFT (or any 'SOTA' technique). One of the plagues we have today, while we have orders of magnitude smarter LLMs, is GPTisms and predictability. Merges can potentially 'solve' that. How? In short, if you physically tear neurons (passthrough brain surgery) while you somehow manage to keep the model coherent enough, and if you're lucky, it can even follows instructions- then magical stuff begins to happen.

Magic, because it's not an exact science, there's some art to it, as it is done with a lot of intuition. GPTisms are patterns that the model really really "wants" to follow, it's quite hard to dissuade it. But if you yeet a couple of layers and rearrange them, boy does it get hard to spew those shivers down the spine... and instead the model starts spewing stuff that it was never intended to. It breaks its patterns and introduces some healthy chaos into the mix.

This model, Eximius_Persona_5B, is the result of multiple merges, that have been tuned, then merged again, then... for many times and iterations. The base was LLAMA 3.2 3B and I focused on achieving the following 4 traits, in that specific order:

Varied speech patterns

Roleplay ability

Long context coherency

Instruction following

For me, getting varied speech patterns was more important than instruction following, for instruction following we got API models, or LLAMA 3.3. Many models are excellent assistants, yet they all sound pretty much the same.

I also wanted to make use of my 4090m 16GB while my workstation crunches Phi-4' brain. Making a nice 5B model aligns with my goal of making AI accessible and fun for everyone, and hence Eximius_Persona_5B was born. Let this also be a call to action for more people to make AI models, you don't have to have multiple GPUs or spend a fortune on the cloud (although that definitely opens up options), you can do plenty with a mere 16GB of VRAM. And in case 16GB seems out of reach too, I should mention that Google Collab gives access to a free T4.

I uploaded a more funky, less stable, and thiccer version of Eximius_Persona to my prototyping org here:

Eximius_Persona with 84 Layers from various checkpoints

(from some early tests, occasionally it outputs stories that fool GPTZERO that it was written by a human- 60% human, 40% AI with a lucky roll) See example: GPTZERO Example TL;DR

Fun & Fresh Roleplay flavour.
Interesting speech patterns in creative writing.
Good long context coherency in Roleplay.
Occasionally outputs quite human like stories.
50 Layers LLAMA 3.2, fully coherent.
Strong performance in general for a 5B model.

Important: Make sure to use the correct settings!

Assistant settings

Roleplay settings

Details

Intended use: Role-Play, Creative Writing, General Tasks.

Censorship level: Medium

5 / 10 (10 completely uncensored)

Use with llama.cpp

Install llama.cpp through brew (works on Mac and Linux)

brew install llama.cpp

Invoke the llama.cpp server or the CLI.

CLI:

llama-cli --hf-repo Triangle104/Eximius_Persona_5B-Q6_K-GGUF --hf-file eximius_persona_5b-q6_k.gguf -p "The meaning to life and the universe is"

Server:

llama-server --hf-repo Triangle104/Eximius_Persona_5B-Q6_K-GGUF --hf-file eximius_persona_5b-q6_k.gguf -c 2048

Note: You can also use this checkpoint directly through the usage steps listed in the Llama.cpp repo as well.

Step 1: Clone llama.cpp from GitHub.

git clone https://github.com/ggerganov/llama.cpp

Step 2: Move into the llama.cpp folder and build it with LLAMA_CURL=1 flag along with other hardware-specific flags (for ex: LLAMA_CUDA=1 for Nvidia GPUs on Linux).

cd llama.cpp && LLAMA_CURL=1 make

Step 3: Run inference through the main binary.

./llama-cli --hf-repo Triangle104/Eximius_Persona_5B-Q6_K-GGUF --hf-file eximius_persona_5b-q6_k.gguf -p "The meaning to life and the universe is"

./llama-server --hf-repo Triangle104/Eximius_Persona_5B-Q6_K-GGUF --hf-file eximius_persona_5b-q6_k.gguf -c 2048

Triangle104
/

Eximius_Persona_5B-Q6_K-GGUF

Triangle104/Eximius_Persona_5B-Q6_K-GGUF

Model details:

Use with llama.cpp

CLI:

Server:

Model tree for Triangle104/Eximius_Persona_5B-Q6_K-GGUF

Collections including Triangle104/Eximius_Persona_5B-Q6_K-GGUF

Llama

RP