Guidance on Using GGUF Files in Ollama with LLaMmlein_1B_chat_all
Dear LSX-UniWue Team,
We’ve been exploring the LLaMmlein_1B_chat_all model and encountered an issue when using the converted GGUF files in Ollama. While other models (e.g., Llama 3) utilize GPUs correctly in the same Ollama instance (v0.4.6) on a DGX-2 (4x A100), the LLaMmlein_1B_chat_all.gguf seems to default to CPU-only if we import it with a simple model file (FROM "./ LLaMmlein_1B_chat_all.gguf").
Could you provide guidance on (a) importing the gguf into an ollama instance, (b) configuring the model for GPU utilization?
Alternatively, would you consider releasing an out-of-the-box Ollama-compatible version of LLaMmlein?
Gruß
Simon & Richard
Hi Simon & Richard!
We only tested the GGUFs on our mac devices (including ollama) and it works perfectly there :( Sadly we are very inexperienced in correctly converting models to GGUF. That's why we were hesitant to release them at all tbh.
Alternatively, would you consider releasing an out-of-the-box Ollama-compatible version of LLaMmlein?
absolutely, it seems you are a bit more experienced in the correct conversion of GGUF files? Do you have an idea where this might be failing? I thought GGUF is universal and once it's converted it "just works"?
Does this help? https://github.com/ollama/ollama/issues/1855#issuecomment-1881719430
Can this be specified in the modelfile? But it's not clear to me, why other models should work then 🤔
Sorry and Best,
jan
Hi Jan,
thanks for the fast reply.
Actually, no. We don't have experience (only hardware and default models) but we will take a look into trying to convert it ourselves and see if this might Help. Will report once tried :-)
The parameter doesn't help - tested it before posting.
Gruß from Heilbronn
Richard
Hi Richard,
I just found the script we used to convert the models, please find it attached. It's the default way to go about this using the default script provided by llama.cpp using as many default parameters as possible as far as we can tell.
We only had to add line 661 into the conversion script, but this is a small tokenizer hack and shouldn't influence any of the model weight conversions:
660 res = "chameleon"
661 res = "llama-bpe"
662 if res is None:
As i already suspected some issues might arise I actually explicitly converted the models on a machine that had cuda available :/
import subprocess
import torch
from huggingface_hub import HfApi
from peft import PeftConfig, PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
torch.manual_seed(42)
# script config
base_model_name = "LSX-UniWue/LLaMmlein_1B"
merged_repo = "LSX-UniWue/LLaMmlein_1B_alternative_formats"
device = "cuda"
for chat_adapter_name in [
"LSX-UniWue/LLaMmlein_1B_chat_selected",
"LSX-UniWue/LLaMmlein_1B_chat_guanako",
"LSX-UniWue/LLaMmlein_1B_chat_alpaca",
"LSX-UniWue/LLaMmlein_1B_chat_sharegpt",
"LSX-UniWue/LLaMmlein_1B_chat_evol_instruct",
"LSX-UniWue/LLaMmlein_1B_chat_all",
]:
# load model
config = PeftConfig.from_pretrained(chat_adapter_name)
base_model = model = AutoModelForCausalLM.from_pretrained(
base_model_name,
torch_dtype=torch.bfloat16,
device_map=device,
)
base_model.resize_token_embeddings(32064)
model = PeftModel.from_pretrained(base_model, chat_adapter_name)
tokenizer = AutoTokenizer.from_pretrained(chat_adapter_name)
model = model.merge_and_unload()
model.save_pretrained(f"/tmp/{chat_adapter_name}")
tokenizer.save_pretrained(f"/tmp/{chat_adapter_name}")
model.push_to_hub(
repo_id=merged_repo,
revision=chat_adapter_name.split("/")[-1],
)
tokenizer.push_to_hub(
repo_id=merged_repo,
revision=chat_adapter_name.split("/")[-1],
)
print(merged_repo, chat_adapter_name.split("/")[-1])
subprocess.run(
[
"python",
"llama.cpp/convert_hf_to_gguf.py",
f"/tmp/{chat_adapter_name}",
"--outfile",
f"/tmp/{chat_adapter_name}.gguf",
"--outtype",
"bf16",
],
check=True,
)
api = HfApi()
api.upload_file(
path_or_fileobj=f"/tmp/{chat_adapter_name}.gguf",
path_in_repo=f"{chat_adapter_name.split('/')[-1]}.gguf",
repo_id=merged_repo,
revision=chat_adapter_name.split("/")[-1],
)
Paging @bartowski 🙏 - do you have any intuition as to why GPU might not be working correctly in this case?
I'm mildly concerned about the line you had to add to the conversion script, can you elaborate why it was needed ?
Otherwise no not really, that's super strange, I can try to download it locally and see if pure llama.cpp allows GPU offloading, but unfortunately ollama is a bit of a black box.
My suggestions would be to first, run it in verbose mode, and second, attempt to run it directly from hf rather than make your own modelcard
Though I don't see the GGUF files anymore so not sure what the link would be to run them, but it would be something like:
ollama run hf.co/LSX-UniWue/LLaMmlein_1B_chat_all-GGUF
ohhhh this is an adapter, not a full model..
Oh wow, thank you for your swift reply!
I'm mildly concerned about the line you had to add to the conversion script, can you elaborate why it was needed ?
i believe because we didn't run llama.cpp/convert_hf_to_gguf_update.py
beforehand(?) Possibly not ideal, but shouldn't really cause this issue afaict
run it directly from hf
nice idea, thanks!
Though I don't see the GGUF files anymore so not sure what the link would be to run them
https://huggingface.co./LSX-UniWue/LLaMmlein_1B_alternative_formats/tree/LLaMmlein_1B_chat_selected
ohhhh this is an adapter, not a full model..
yes, but the GGUF was created from the merged model and the full model this GGUF was created from is available in the other repo. I guess this issue would also better fit in the other repo
Thanks a lot for the hints 🙏
Thanks for your response. Yes, the issue would better fit in the other repo but we were unsure, if the repos are actually monitored (it is actually not often the case in academic driven projects, so your fast responses are really appreciated!).
Actually, we tried different combinations to run directly from HuggingFace instead of importing our own model, but it seems (at least with my restricted skill set in that area) that the GGUF model files needs to be on the "main" (?) branch. At least the UI differs from the description of the process provided here: https://huggingface.co./docs/hub/ollama
So if you have a pointer on how to run https://huggingface.co./LSX-UniWue/LLaMmlein_1B_alternative_formats/tree/LLaMmlein_1B_chat_selected directly from HuggingFace, it would be very much appreciated. @bartowski If you have any hint on how to run a GGUF model directly from a branch in HuggingFace, we will try it out :-)
In the meantime, we also tested it on a NVIDIA Jetson and saw the same behavior as on the DGX2 server.
Here is some log output
ollama-1 | llama_model_loader: loaded meta data with 30 key-value pairs and 201 tensors from /ollama/blobs/sha256-3d815935673d9baad30806d02da4a1ae9d33b67187ed02b93d7ef37be5e82893 (version GGUF V3 (latest))
ollama-1 | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
ollama-1 | llama_model_loader: - kv 0: general.architecture str = llama
ollama-1 | llama_model_loader: - kv 1: general.type str = model
ollama-1 | llama_model_loader: - kv 2: general.name str = LLaMmlein_1B
ollama-1 | llama_model_loader: - kv 3: general.organization str = LSX UniWue
ollama-1 | llama_model_loader: - kv 4: general.size_label str = 1.1B
ollama-1 | llama_model_loader: - kv 5: llama.block_count u32 = 22
ollama-1 | llama_model_loader: - kv 6: llama.context_length u32 = 2048
ollama-1 | llama_model_loader: - kv 7: llama.embedding_length u32 = 2048
ollama-1 | llama_model_loader: - kv 8: llama.feed_forward_length u32 = 5632
ollama-1 | llama_model_loader: - kv 9: llama.attention.head_count u32 = 32
ollama-1 | llama_model_loader: - kv 10: llama.attention.head_count_kv u32 = 4
ollama-1 | llama_model_loader: - kv 11: llama.rope.freq_base f32 = 10000.000000
ollama-1 | llama_model_loader: - kv 12: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
ollama-1 | llama_model_loader: - kv 13: general.file_type u32 = 32
ollama-1 | llama_model_loader: - kv 14: llama.vocab_size u32 = 32064
ollama-1 | llama_model_loader: - kv 15: llama.rope.dimension_count u32 = 64
ollama-1 | llama_model_loader: - kv 16: tokenizer.ggml.model str = gpt2
ollama-1 | llama_model_loader: - kv 17: tokenizer.ggml.pre str = llama-bpe
ollama-1 | llama_model_loader: - kv 18: tokenizer.ggml.tokens arr[str,32064] = ["<unk>", "<s>", "</s>", "!", "\"", "...
ollama-1 | llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,32064] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
ollama-1 | llama_model_loader: - kv 20: tokenizer.ggml.merges arr[str,31770] = ["e n", "e r", "c h", "Ġ d", "e i", ...
ollama-1 | llama_model_loader: - kv 21: tokenizer.ggml.bos_token_id u32 = 32001
ollama-1 | llama_model_loader: - kv 22: tokenizer.ggml.eos_token_id u32 = 32002
ollama-1 | llama_model_loader: - kv 23: tokenizer.ggml.unknown_token_id u32 = 0
ollama-1 | llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 32000
ollama-1 | llama_model_loader: - kv 25: tokenizer.ggml.add_bos_token bool = true
ollama-1 | llama_model_loader: - kv 26: tokenizer.ggml.add_eos_token bool = false
ollama-1 | llama_model_loader: - kv 27: tokenizer.chat_template str = {% for message in messages %}{{'<|im_...
ollama-1 | llama_model_loader: - kv 28: tokenizer.ggml.add_space_prefix bool = false
ollama-1 | llama_model_loader: - kv 29: general.quantization_version u32 = 2
ollama-1 | llama_model_loader: - type f32: 45 tensors
ollama-1 | llama_model_loader: - type bf16: 156 tensors
ollama-1 | llm_load_vocab: special tokens cache size = 6
ollama-1 | llm_load_vocab: token to piece cache size = 0.2186 MB
ollama-1 | llm_load_print_meta: format = GGUF V3 (latest)
ollama-1 | llm_load_print_meta: arch = llama
ollama-1 | llm_load_print_meta: vocab type = BPE
ollama-1 | llm_load_print_meta: n_vocab = 32064
ollama-1 | llm_load_print_meta: n_merges = 31770
ollama-1 | llm_load_print_meta: vocab_only = 1
ollama-1 | llm_load_print_meta: model type = ?B
ollama-1 | llm_load_print_meta: model ftype = all F32
ollama-1 | llm_load_print_meta: model params = 1.10 B
ollama-1 | llm_load_print_meta: model size = 2.05 GiB (16.00 BPW)
ollama-1 | llm_load_print_meta: general.name = LLaMmlein_1B
ollama-1 | llm_load_print_meta: BOS token = 32001 '<|im_start|>'
ollama-1 | llm_load_print_meta: EOS token = 32002 '<|im_end|>'
ollama-1 | llm_load_print_meta: UNK token = 0 '<unk>'
ollama-1 | llm_load_print_meta: PAD token = 32000 '[PAD]'
ollama-1 | llm_load_print_meta: LF token = 129 'Ä'
ollama-1 | llm_load_print_meta: EOT token = 32002 '<|im_end|>'
ollama-1 | llm_load_print_meta: EOG token = 32002 '<|im_end|>'
ollama-1 | llm_load_print_meta: max token length = 66
ollama-1 | llama_model_load: vocab only - skipping tensors
ollama-1 | check_double_bos_eos: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
ollama-1 | [GIN] 2024/11/29 - 19:36:18 | 200 | 1m39s | 127.0.0.1 | POST "/api/chat"
ollama-1 | [GIN] 2024/11/29 - 19:37:27 | 200 | 28.063µs | 127.0.0.1 | HEAD "/"
ollama-1 | [GIN] 2024/11/29 - 19:37:27 | 200 | 4.896392ms | 127.0.0.1 | POST "/api/show"
ollama-1 | [GIN] 2024/11/29 - 19:37:27 | 200 | 5.828537ms | 127.0.0.1 | POST "/api/generate"
ollama-1 | check_double_bos_eos: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
Here are some stats from "--verbose" for a super simple prompt with the model above:
total duration: 2m12.990857938s
load duration: 5.050211ms
prompt eval count: 16 token(s)
prompt eval duration: 3.856s
prompt eval rate: 4.15 tokens/s
eval count: 32 token(s)
eval duration: 2m9.127s
eval rate: 0.25 tokens/s
Here are the stats for a llama 3 7bmodel on the same machine in the same docker container for the same prompt
total duration: 467.493963ms
load duration: 20.967386ms
prompt eval count: 15 token(s)
prompt eval duration: 124ms
prompt eval rate: 120.97 tokens/s
eval count: 34 token(s)
eval duration: 320ms
eval rate: 106.25 tokens/s
I can try to run ollama with debug output, if it would help :)
Gruß and thanks for your time!!
Richard
Short update: I converted it myself using your script appraoch, which resulted in a gguf, which also doesn't offload to GPU. So I guess, there is an issue in the conversion :)
fixed it and reuploaded all GGUF files:
janpf@p085info010013 ~/D/chat> wget https://huggingface.co./LSX-UniWue/LLaMmlein_1B_alternative_formats/resolve/LLaMmlein_1B_chat_selected/LLaMmlein_1B_chat_selected.gguf
2024-12-02 12:34:37 (40,2 MB/s) - »LLaMmlein_1B_chat_selected.gguf.1« gespeichert [2202004832/2202004832]
janpf@p085info010013 ~/D/chat> ollama create LLaMmlein_1B_chat_selected.1 -f LLaMmlein_1B_chat_selected.modelfile.1
transferring model data 100%
using existing layer sha256:e83974ba42d60f8e5b976461b426b9008ff15a8596966645a675eb9892732228
creating new layer sha256:3e9c19662c62ac69687769db7192eef3c40211b7cf758d67546e8ac644871761
using existing layer sha256:f02dd72bb2423204352eabc5637b44d79d17f109fdb510a7c51455892aa2d216
creating new layer sha256:a67d763bc0afcd7d70dbd53c041cb0a1042f515d3826734cbe000fe6b15bf56c
writing manifest
success
janpf@p085info010013 ~/D/chat> ollama run --verbose LLaMmlein_1B_chat_selected.1
>>> Hallo
Hallo! Wie kann ich Ihnen heute helfen?
total duration: 715.285669ms
load duration: 5.854764ms
prompt eval count: 11 token(s)
prompt eval duration: 649ms
prompt eval rate: 16.95 tokens/s
eval count: 11 token(s)
eval duration: 58ms
eval rate: 189.66 tokens/s
issue was bf16
. f16
works.
oh shoot haha yes BF16 isn't support on CUDA on llama.cpp :')