metadata

base_model: HiroseKoichi/Llama-3-8B-Stroganoff-4.0
library_name: transformers
license: llama3
tags:
  - nsfw
  - not-for-all-audiences
  - llama-3
  - text-generation-inference
  - mergekit
  - merge
  - llama-cpp
  - gguf-my-repo

Triangle104/Llama-3-8B-Stroganoff-4.0-Q5_K_M-GGUF

This model was converted to GGUF format from HiroseKoichi/Llama-3-8B-Stroganoff-4.0 using llama.cpp via the ggml.ai's GGUF-my-repo space. Refer to the original model card for more details on the model.

Model details:

I made a really stupid mistake and uploaded two models instead of one. I uploaded the files for both and was going to decide which one to release today, but I got up at 4-5am, immediately got on my PC, and then just set both to public after writing more of the model card. Hopefully no one downloaded it, but if you did, then I'm sorry for the inconvenience. Llama-3-8B-Stroganoff-4.0

Since V3, I tested a lot of old models, looked at some new ones, and used every merge method available in mergekit. This one is from experiments I was doing on model order, which is why all the models use the same parameters, but it was good enough that I decided to upload it. If you've been doing merges yourself, then most or all of the following information will be redundant, but some of it was not at all apparent to me, so I hope it will help others looking for more information.

Ties is not better than Task-Arithmetic, and Task-Arithmetic is not better than Ties; they both have certain advantages that make them better in different situations. Ties aims to reduce model interference by keeping weights that agree with each other and zeroing out the rest. If you try to use Ties with a bunch of models that do different things, then some aspects of the models might get erased if it doesn't have a strong enough presence. The order of the models does not matter with a Ties merge because all of the merging happens in one step, and changing the model order will produce identical hashes, assuming you're not using Dare or Della, which adds randomness to the merge.

Task-Arithmetic is a linear merge that first subtracts the base model from the fine-tuned models and then merges them in pairs starting at the top of the list before finally merging the result back on top of the base model. The order of the models does matter with a Task-Arithmetic merge, and changing the model order will produce different hashes. A Task-Arithmetic merge keeps more of the individuality of the component models, with the last to be merged having the strongest effect on the resulting model. Task-Arithmetic can be unpredictable at times, as changing the order of the models can produce significantly different results, but it can be effective at combining the strengths of different models once you find the right order.

Dare, Della, and Breadcrumbs are all enhancements to Ties and Task-Arithmetic that aim to improve the resulting merge by zeroing out certain weights. While they all remove weights before merging takes place, they each do it a bit differently. Dare assigns a flat dropout rate, meaning all weights have an equal chance of being dropped; Della scales the dropout rate based on the magnitude of change from the base model, with the largest changes having the smallest dropout rate; and Breadcrumbs first removes any outliers and then begins zeroing out weights until it reaches the target density, starting with the smallest changes. I've done direct comparisons between Dare and Della with all the same parameters, and Della has consistently outperformed Dare. I haven't tested breadcrumbs much, but the idea behind it seems solid.

Details

License: llama3
Instruct Format: llama-3 or ChatML
Context Size: 8K

Models Used

Dusk_Rainbow
ArliAI-Llama-3-8B-Formax-v1.0
L3-8B-Stheno-v3.2
Hathor_Sofit-L3-8B-v1
Llama-3SOME-8B-v2
Llama-3-Spellbound-Instruct-8B-0.3

Merge Config

merge_method: della_linear dtype: bfloat16 parameters: normalize: true int8_mask: true tokenizer_source: union base_model: SicariusSicariiStuff/Dusk_Rainbow models: - model: ArliAI/ArliAI-Llama-3-8B-Formax-v1.0 parameters: density: 0.55 weight: 1 - model: Sao10K/L3-8B-Stheno-v3.2 parameters: density: 0.55 weight: 1 - model: Nitral-AI/Hathor_Sofit-L3-8B-v1 parameters: density: 0.55 weight: 1 - model: TheDrummer/Llama-3SOME-8B-v2 parameters: density: 0.55 weight: 1 - model: hf-100/Llama-3-Spellbound-Instruct-8B-0.3 parameters: density: 0.55 weight: 1

Use with llama.cpp

Install llama.cpp through brew (works on Mac and Linux)

brew install llama.cpp

Invoke the llama.cpp server or the CLI.

CLI:

llama-cli --hf-repo Triangle104/Llama-3-8B-Stroganoff-4.0-Q5_K_M-GGUF --hf-file llama-3-8b-stroganoff-4.0-q5_k_m.gguf -p "The meaning to life and the universe is"

Server:

llama-server --hf-repo Triangle104/Llama-3-8B-Stroganoff-4.0-Q5_K_M-GGUF --hf-file llama-3-8b-stroganoff-4.0-q5_k_m.gguf -c 2048

Note: You can also use this checkpoint directly through the usage steps listed in the Llama.cpp repo as well.

Step 1: Clone llama.cpp from GitHub.

git clone https://github.com/ggerganov/llama.cpp

Step 2: Move into the llama.cpp folder and build it with LLAMA_CURL=1 flag along with other hardware-specific flags (for ex: LLAMA_CUDA=1 for Nvidia GPUs on Linux).

cd llama.cpp && LLAMA_CURL=1 make

Step 3: Run inference through the main binary.

./llama-cli --hf-repo Triangle104/Llama-3-8B-Stroganoff-4.0-Q5_K_M-GGUF --hf-file llama-3-8b-stroganoff-4.0-q5_k_m.gguf -p "The meaning to life and the universe is"

./llama-server --hf-repo Triangle104/Llama-3-8B-Stroganoff-4.0-Q5_K_M-GGUF --hf-file llama-3-8b-stroganoff-4.0-q5_k_m.gguf -c 2048