can't quantize it to f16

#4
by owao - opened

Hey guys, I've read your blog post and am interested trying out your model as I'm really fan of Qwen2.5.
However, I don't manage to convert it to f16 using llama.cpp (latest commit, main).
I don't know what's going on, llama.cpp outputs no errors, but it only runs for only few sec, and the output file is 5.9MB.

~/llama.cpp ❯❯❯ python convert_hf_to_gguf.py --outtype f16 model/OpenPipe/Deductive-Reasoning-Qwen-32B/                                                                                              (nexa)  master
INFO:hf-to-gguf:Loading model: Deductive-Reasoning-Qwen-32B
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:Set meta model
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:gguf: context length = 32768
INFO:hf-to-gguf:gguf: embedding length = 5120
INFO:hf-to-gguf:gguf: feed forward length = 27648
INFO:hf-to-gguf:gguf: head count = 40
INFO:hf-to-gguf:gguf: key-value head count = 8
INFO:hf-to-gguf:gguf: rope theta = 1000000.0
INFO:hf-to-gguf:gguf: rms norm epsilon = 1e-06
INFO:hf-to-gguf:gguf: file type = 1
INFO:hf-to-gguf:Set model tokenizer
INFO:gguf.vocab:Adding 151387 merge(s).
INFO:gguf.vocab:Setting special token type eos to 151645
INFO:gguf.vocab:Setting special token type pad to 151643
INFO:gguf.vocab:Setting special token type bos to 151643
INFO:gguf.vocab:Setting add_bos_token to False
INFO:gguf.vocab:Setting chat_template to {%- if tools %}
    {{- '<|im_start|>system\n' }}
    {%- if messages[0]['role'] == 'system' %}
        {{- messages[0]['content'] }}
    {%- else %}
        {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}
    {%- endif %}
    {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
    {%- for tool in tools %}
        {{- "\n" }}
        {{- tool | tojson }}
    {%- endfor %}
    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
{%- else %}
    {%- if messages[0]['role'] == 'system' %}
        {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
    {%- else %}
        {{- '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n' }}
    {%- endif %}
{%- endif %}
{%- for message in messages %}
    {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}
        {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
    {%- elif message.role == "assistant" %}
        {{- '<|im_start|>' + message.role }}
        {%- if message.content %}
            {{- '\n' + message.content }}
        {%- endif %}
        {%- for tool_call in message.tool_calls %}
            {%- if tool_call.function is defined %}
                {%- set tool_call = tool_call.function %}
            {%- endif %}
            {{- '\n<tool_call>\n{"name": "' }}
            {{- tool_call.name }}
            {{- '", "arguments": ' }}
            {{- tool_call.arguments | tojson }}
            {{- '}\n</tool_call>' }}
        {%- endfor %}
        {{- '<|im_end|>\n' }}
    {%- elif message.role == "tool" %}
        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}
            {{- '<|im_start|>user' }}
        {%- endif %}
        {{- '\n<tool_response>\n' }}
        {{- message.content }}
        {{- '\n</tool_response>' }}
        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
            {{- '<|im_end|>\n' }}
        {%- endif %}
    {%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n' }}
{%- endif %}

INFO:hf-to-gguf:Set model quantization version
INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:model/OpenPipe/Deductive-Reasoning-Qwen-32B/Deductive-Reasoning-Qwen-32B-F16.gguf: n_tensors = 0, total_size = negligible - metadata only
Writing: 0.00byte [00:00, ?byte/s]
INFO:hf-to-gguf:Model successfully exported to model/OpenPipe/Deductive-Reasoning-Qwen-32B/Deductive-Reasoning-Qwen-32B-F16.gguf

I can also add https://huggingface.co./spaces/ggml-org/gguf-my-repo run into the same, producing this 5.9MB gguf (instead of any chosen quant type).
So, it seems there is something wrong with llama.cpp when dealing with this specific model.
Would you have any idea of the root cause leading to this issue?

Thanks by advance for any response :)

@owao

To repair for use with LLamacpp / Create GGUFS:

  • Rename all model safetensor files, remove the "ft-"
  • Fix "model.safetensors.index.json" => remove the "ft-" from all entries. (search/replace in NOTEPAD)

Likely same issue with 14B model too (?) - same format.

Operation:

Tested q2k quant in LMStudio, with this system prompt:

You are a deep thinking AI, you may use extremely long chains of thought to deeply consider the problem and deliberate with yourself via systematic reasoning processes to help come to a correct solution prior to answering. You should enclose your thoughts and internal monologue inside tags, and then provide your solution or response to the problem.

Seemed to work well with the model (using Jinja Template).
Higher temps seemed to cause/invoke more reasoning.

OpenPipe:

This will severely limit the users of your model.
Quanters like Mradermacher will not pick it up, likewise "GGUF my repo" will crash and burn.

After it (and 14B?) fixed, submit a ticket at MRadermacher 's repo to auto-quant the model to GGUF.
The will create the 32B in gguf, gguf-imatrix (and 14B too).

Sign up or log in to comment