requesting gguf version

#16
by Hasaranga85 - opened

Hi, can you please provide the gguf version of this model? So, I can use it with ollama.

Owner

hi, thanks for reaching out! Yes, I can take a pass at this over the next few days 'manually' - I tried with the gguf my repo space and ran into issues, if that ends up working for you do let me know

check out the Q6_K in this repo (here)

edit: more options here: https://hf.co/pszemraj/flan-t5-large-grammar-synthesis-gguf

pszemraj changed discussion status to closed

tried "ggml-model-Q5_K_M.gguf" with llamacpp and it is repeating the system prompt.

https://hf.co/pszemraj/flan-t5-large-grammar-synthesis-gguf

Owner

please review the demo code and/or the colab notebook linked on the model card. this is a text2text model, it does not use a system prompt of any kind, you cannot talk to it, etc.

it does one thing and one thing only: whatever text you put in will be grammatically corrected (this is what its doing with your "system prompt")

Thank you very much for your explanation. So this model cannot be use with llama-cli right?

hey - you should be able to run inference with it, but not as a "chat interface". try just passing a prompt, which will then be corrected. An analogue would be what I have for flan-ul2 (different framework and model, but same idea)

Owner

quick update, you can use the GGUFs with llamafile (or llama-cli) like this:

llamafile.exe -m grammar-synthesis-Q6_K.gguf --temp 0 -p "There car broke down so their hitching a ride to they're class."

and it will output the corrected text:

system_info: n_threads = 4 / 8 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.000
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 8192, n_batch = 2048, n_predict = -1, n_keep = 0


 The car broke down so they had to take a ride to school. [end of text]


llama_print_timings:        load time =     782.21 ms
llama_print_timings:      sample time =       0.23 ms /    16 runs   (    0.01 ms per token, 68376.07 tokens per second)
llama_print_timings: prompt eval time =      85.08 ms /    19 tokens (    4.48 ms per token,   223.33 tokens per second)
llama_print_timings:        eval time =     341.74 ms /    15 runs   (   22.78 ms per token,    43.89 tokens per second)
llama_print_timings:       total time =     456.56 ms /    34 tokens
Log end
pszemraj pinned discussion

That worked. Thanks!

Sign up or log in to comment