compiled llama.cpp from main on 2024-09-26 and error when loading model

#3
by LaferriereJC - opened

Traceback (most recent call last):

File "/home/user/text-generation-webui/modules/ui_model_menu.py", line 231, in load_model_wrapper

shared.model, shared.tokenizer = load_model(selected_model, loader)
File "/home/user/text-generation-webui/modules/models.py", line 93, in load_model

output = load_func_maploader
File "/home/user/text-generation-webui/modules/models.py", line 278, in llamacpp_loader

model, tokenizer = LlamaCppModel.from_pretrained(model_file)
File "/home/user/text-generation-webui/modules/llamacpp_model.py", line 85, in from_pretrained

result.model = Llama(**params)
File "/home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/llama_cpp_cuda_tensorcores/llama.py", line 392, in init

_LlamaContext(
File "/home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/llama_cpp_cuda_tensorcores/_internals.py", line 298, in init

raise ValueError("Failed to create llama_context")
ValueError: Failed to create llama_context

it looks like you're in text-gen-webui which i think uses llama-cpp-python, not llama.cpp

I get this and I am running on llamafile with llama.cpp

warning: couldn't find nvcc (nvidia c compiler) try setting $CUDA_PATH if it's installed
prebuilt binary /zip/ggml-cuda.so not found
{"timestamp":1727533861,"level":"INFO","function":"main","line":2669,"message":"build info","build":1500,"commit":"a30b324"}
{"timestamp":1727533861,"level":"INFO","function":"main","line":2672,"message":"system info","n_threads":4,"n_threads_batch":-1,"total_threads":8,"system_info":"AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "}
Multi Modal Mode Enabledclip_model_load: model name: openai/clip-vit-large-patch14-336
clip_model_load: description: image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment: 32
clip_model_load: n_tensors: 377
clip_model_load: n_kv: 19
clip_model_load: ftype: q4_0

clip_model_load: text_encoder: 0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector: 1
clip_model_load: model size: 169.33 MB
clip_model_load: metadata size: 0.15 MB
clip_model_load: total allocated memory: 196.02 MB
llama_model_loader: loaded meta data with 35 key-value pairs and 255 tensors from /home/jim/text-generation-webui/models/Llama-3.2-3B-Instruct-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: - tensor 0: rope_freqs.weight f32 [ 64, 1, 1, 1 ]
llama_model_loader: - tensor 1: token_embd.weight q8_0 [ 3072, 128256, 1, 1 ]
llama_model_loader: - tensor 2: blk.0.attn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 3: blk.0.ffn_down.weight q8_0 [ 8192, 3072, 1, 1 ]
llama_model_loader: - tensor 4: blk.0.ffn_gate.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 5: blk.0.ffn_up.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 6: blk.0.ffn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 7: blk.0.attn_k.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 8: blk.0.attn_output.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 9: blk.0.attn_q.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 10: blk.0.attn_v.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 11: blk.1.attn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 12: blk.1.ffn_down.weight q8_0 [ 8192, 3072, 1, 1 ]
llama_model_loader: - tensor 13: blk.1.ffn_gate.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 14: blk.1.ffn_up.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 15: blk.1.ffn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 16: blk.1.attn_k.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 17: blk.1.attn_output.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 18: blk.1.attn_q.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 19: blk.1.attn_v.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 20: blk.10.attn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 21: blk.10.ffn_down.weight q8_0 [ 8192, 3072, 1, 1 ]
llama_model_loader: - tensor 22: blk.10.ffn_gate.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 23: blk.10.ffn_up.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 24: blk.10.ffn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 25: blk.10.attn_k.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 26: blk.10.attn_output.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 27: blk.10.attn_q.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 28: blk.10.attn_v.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 29: blk.11.attn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 30: blk.11.ffn_down.weight q8_0 [ 8192, 3072, 1, 1 ]
llama_model_loader: - tensor 31: blk.11.ffn_gate.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 32: blk.11.ffn_up.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 33: blk.11.ffn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 34: blk.11.attn_k.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 35: blk.11.attn_output.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 36: blk.11.attn_q.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 37: blk.11.attn_v.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 38: blk.12.attn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 39: blk.12.ffn_down.weight q8_0 [ 8192, 3072, 1, 1 ]
llama_model_loader: - tensor 40: blk.12.ffn_gate.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 41: blk.12.ffn_up.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 42: blk.12.ffn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 43: blk.12.attn_k.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 44: blk.12.attn_output.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 45: blk.12.attn_q.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 46: blk.12.attn_v.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 47: blk.13.attn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 48: blk.13.ffn_down.weight q8_0 [ 8192, 3072, 1, 1 ]
llama_model_loader: - tensor 49: blk.13.ffn_gate.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 50: blk.13.ffn_up.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 51: blk.13.ffn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 52: blk.13.attn_k.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 53: blk.13.attn_output.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 54: blk.13.attn_q.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 55: blk.13.attn_v.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 56: blk.14.attn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 57: blk.14.ffn_down.weight q8_0 [ 8192, 3072, 1, 1 ]
llama_model_loader: - tensor 58: blk.14.ffn_gate.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 59: blk.14.ffn_up.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 60: blk.14.ffn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 61: blk.14.attn_k.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 62: blk.14.attn_output.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 63: blk.14.attn_q.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 64: blk.14.attn_v.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 65: blk.15.attn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 66: blk.15.ffn_down.weight q8_0 [ 8192, 3072, 1, 1 ]
llama_model_loader: - tensor 67: blk.15.ffn_gate.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 68: blk.15.ffn_up.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 69: blk.15.ffn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 70: blk.15.attn_k.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 71: blk.15.attn_output.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 72: blk.15.attn_q.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 73: blk.15.attn_v.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 74: blk.16.attn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 75: blk.16.ffn_down.weight q8_0 [ 8192, 3072, 1, 1 ]
llama_model_loader: - tensor 76: blk.16.ffn_gate.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 77: blk.16.ffn_up.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 78: blk.16.ffn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 79: blk.16.attn_k.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 80: blk.16.attn_output.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 81: blk.16.attn_q.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 82: blk.16.attn_v.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 83: blk.17.attn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 84: blk.17.ffn_down.weight q8_0 [ 8192, 3072, 1, 1 ]
llama_model_loader: - tensor 85: blk.17.ffn_gate.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 86: blk.17.ffn_up.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 87: blk.17.ffn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 88: blk.17.attn_k.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 89: blk.17.attn_output.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 90: blk.17.attn_q.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 91: blk.17.attn_v.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 92: blk.18.attn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 93: blk.18.ffn_down.weight q8_0 [ 8192, 3072, 1, 1 ]
llama_model_loader: - tensor 94: blk.18.ffn_gate.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 95: blk.18.ffn_up.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 96: blk.18.ffn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 97: blk.18.attn_k.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 98: blk.18.attn_output.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 99: blk.18.attn_q.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 100: blk.18.attn_v.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 101: blk.19.attn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 102: blk.19.ffn_down.weight q8_0 [ 8192, 3072, 1, 1 ]
llama_model_loader: - tensor 103: blk.19.ffn_gate.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 104: blk.19.ffn_up.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 105: blk.19.ffn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 106: blk.19.attn_k.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 107: blk.19.attn_output.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 108: blk.19.attn_q.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 109: blk.19.attn_v.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 110: blk.2.attn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 111: blk.2.ffn_down.weight q8_0 [ 8192, 3072, 1, 1 ]
llama_model_loader: - tensor 112: blk.2.ffn_gate.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 113: blk.2.ffn_up.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 114: blk.2.ffn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 115: blk.2.attn_k.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 116: blk.2.attn_output.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 117: blk.2.attn_q.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 118: blk.2.attn_v.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 119: blk.20.ffn_gate.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 120: blk.20.ffn_up.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 121: blk.20.attn_k.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 122: blk.20.attn_output.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 123: blk.20.attn_q.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 124: blk.20.attn_v.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 125: blk.3.attn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 126: blk.3.ffn_down.weight q8_0 [ 8192, 3072, 1, 1 ]
llama_model_loader: - tensor 127: blk.3.ffn_gate.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 128: blk.3.ffn_up.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 129: blk.3.ffn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 130: blk.3.attn_k.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 131: blk.3.attn_output.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 132: blk.3.attn_q.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 133: blk.3.attn_v.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 134: blk.4.attn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 135: blk.4.ffn_down.weight q8_0 [ 8192, 3072, 1, 1 ]
llama_model_loader: - tensor 136: blk.4.ffn_gate.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 137: blk.4.ffn_up.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 138: blk.4.ffn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 139: blk.4.attn_k.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 140: blk.4.attn_output.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 141: blk.4.attn_q.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 142: blk.4.attn_v.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 143: blk.5.attn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 144: blk.5.ffn_down.weight q8_0 [ 8192, 3072, 1, 1 ]
llama_model_loader: - tensor 145: blk.5.ffn_gate.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 146: blk.5.ffn_up.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 147: blk.5.ffn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 148: blk.5.attn_k.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 149: blk.5.attn_output.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 150: blk.5.attn_q.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 151: blk.5.attn_v.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 152: blk.6.attn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 153: blk.6.ffn_down.weight q8_0 [ 8192, 3072, 1, 1 ]
llama_model_loader: - tensor 154: blk.6.ffn_gate.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 155: blk.6.ffn_up.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 156: blk.6.ffn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 157: blk.6.attn_k.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 158: blk.6.attn_output.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 159: blk.6.attn_q.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 160: blk.6.attn_v.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 161: blk.7.attn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 162: blk.7.ffn_down.weight q8_0 [ 8192, 3072, 1, 1 ]
llama_model_loader: - tensor 163: blk.7.ffn_gate.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 164: blk.7.ffn_up.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 165: blk.7.ffn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 166: blk.7.attn_k.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 167: blk.7.attn_output.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 168: blk.7.attn_q.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 169: blk.7.attn_v.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 170: blk.8.attn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 171: blk.8.ffn_down.weight q8_0 [ 8192, 3072, 1, 1 ]
llama_model_loader: - tensor 172: blk.8.ffn_gate.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 173: blk.8.ffn_up.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 174: blk.8.ffn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 175: blk.8.attn_k.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 176: blk.8.attn_output.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 177: blk.8.attn_q.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 178: blk.8.attn_v.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 179: blk.9.attn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 180: blk.9.ffn_down.weight q8_0 [ 8192, 3072, 1, 1 ]
llama_model_loader: - tensor 181: blk.9.ffn_gate.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 182: blk.9.ffn_up.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 183: blk.9.ffn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 184: blk.9.attn_k.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 185: blk.9.attn_output.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 186: blk.9.attn_q.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 187: blk.9.attn_v.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 188: blk.20.attn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 189: blk.20.ffn_down.weight q8_0 [ 8192, 3072, 1, 1 ]
llama_model_loader: - tensor 190: blk.20.ffn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 191: blk.21.attn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 192: blk.21.ffn_down.weight q8_0 [ 8192, 3072, 1, 1 ]
llama_model_loader: - tensor 193: blk.21.ffn_gate.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 194: blk.21.ffn_up.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 195: blk.21.ffn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 196: blk.21.attn_k.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 197: blk.21.attn_output.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 198: blk.21.attn_q.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 199: blk.21.attn_v.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 200: blk.22.attn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 201: blk.22.ffn_down.weight q8_0 [ 8192, 3072, 1, 1 ]
llama_model_loader: - tensor 202: blk.22.ffn_gate.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 203: blk.22.ffn_up.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 204: blk.22.ffn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 205: blk.22.attn_k.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 206: blk.22.attn_output.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 207: blk.22.attn_q.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 208: blk.22.attn_v.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 209: blk.23.attn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 210: blk.23.ffn_down.weight q8_0 [ 8192, 3072, 1, 1 ]
llama_model_loader: - tensor 211: blk.23.ffn_gate.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 212: blk.23.ffn_up.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 213: blk.23.ffn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 214: blk.23.attn_k.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 215: blk.23.attn_output.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 216: blk.23.attn_q.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 217: blk.23.attn_v.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 218: blk.24.attn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 219: blk.24.ffn_down.weight q8_0 [ 8192, 3072, 1, 1 ]
llama_model_loader: - tensor 220: blk.24.ffn_gate.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 221: blk.24.ffn_up.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 222: blk.24.ffn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 223: blk.24.attn_k.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 224: blk.24.attn_output.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 225: blk.24.attn_q.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 226: blk.24.attn_v.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 227: blk.25.attn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 228: blk.25.ffn_down.weight q8_0 [ 8192, 3072, 1, 1 ]
llama_model_loader: - tensor 229: blk.25.ffn_gate.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 230: blk.25.ffn_up.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 231: blk.25.ffn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 232: blk.25.attn_k.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 233: blk.25.attn_output.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 234: blk.25.attn_q.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 235: blk.25.attn_v.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 236: blk.26.attn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 237: blk.26.ffn_down.weight q8_0 [ 8192, 3072, 1, 1 ]
llama_model_loader: - tensor 238: blk.26.ffn_gate.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 239: blk.26.ffn_up.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 240: blk.26.ffn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 241: blk.26.attn_k.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 242: blk.26.attn_output.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 243: blk.26.attn_q.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 244: blk.26.attn_v.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 245: blk.27.attn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 246: blk.27.ffn_down.weight q8_0 [ 8192, 3072, 1, 1 ]
llama_model_loader: - tensor 247: blk.27.ffn_gate.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 248: blk.27.ffn_up.weight q8_0 [ 3072, 8192, 1, 1 ]
llama_model_loader: - tensor 249: blk.27.ffn_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: - tensor 250: blk.27.attn_k.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 251: blk.27.attn_output.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 252: blk.27.attn_q.weight q8_0 [ 3072, 3072, 1, 1 ]
llama_model_loader: - tensor 253: blk.27.attn_v.weight q8_0 [ 3072, 1024, 1, 1 ]
llama_model_loader: - tensor 254: output_norm.weight f32 [ 3072, 1, 1, 1 ]
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Llama-3.2
llama_model_loader: - kv 5: general.size_label str = 3B
llama_model_loader: - kv 6: general.license str = llama3.2
llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 9: llama.block_count u32 = 28
llama_model_loader: - kv 10: llama.context_length u32 = 131072
llama_model_loader: - kv 11: llama.embedding_length u32 = 3072
llama_model_loader: - kv 12: llama.feed_forward_length u32 = 8192
llama_model_loader: - kv 13: llama.attention.head_count u32 = 24
llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 17: llama.attention.key_length u32 = 128
llama_model_loader: - kv 18: llama.attention.value_length u32 = 128
llama_model_loader: - kv 19: general.file_type u32 = 7
llama_model_loader: - kv 20: llama.vocab_size u32 = 128256
llama_model_loader: - kv 21: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 23: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 29: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 30: general.quantization_version u32 = 2
llama_model_loader: - kv 31: quantize.imatrix.file str = /models_out/Llama-3.2-3B-Instruct-GGU...
llama_model_loader: - kv 32: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt
llama_model_loader: - kv 33: quantize.imatrix.entries_count i32 = 196
llama_model_loader: - kv 34: quantize.imatrix.chunks_count i32 = 125
llama_model_loader: - type f32: 58 tensors
llama_model_loader: - type q8_0: 197 tensors
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 3072
llm_load_print_meta: n_head = 24
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 28
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 3
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 8192
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = mostly Q8_0
llm_load_print_meta: model params = 3.21 B
llm_load_print_meta: model size = 3.18 GiB (8.50 BPW)
llm_load_print_meta: general.name = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_tensors: ggml ctx size = 0.10 MiB
error: create_tensor: tensor 'output.weight' not found

... while loading Llama-3.2-3B-Instruct-Q8_0.gguf

I did manage to get the Llama 3.2 version to work with an updated ollama (using ollama's command- ollama run llama3.2) but that is their model version.

why does it look like you're trying to load a vision model..? it looks like it's mentioning openai clip

also concerned by

warning: couldn't find nvcc (nvidia c compiler) try setting $CUDA_PATH if it's installed
prebuilt binary /zip/ggml-cuda.so not found

can you run any other models?

I did manage to get the Llama 3.2 version to work with an updated ollama (using ollama's command- ollama run llama3.2) but that is their model version.

The llama.cpp base used by llamafile is outdated, from a month ago. It doesn't support tied embedding-output weights (in the llama architecture), which is why you're getting this error (GGUFs generated by newer versions of llama.cpp omit the output tensor to match this change).

Sign up or log in to comment