Instruct versus non-Instruct

#8
by BigDeeper - opened

Are Instruct versions better for use with agents like gpt-pilot? How is function calling in 3-70B?

yes.
i couldn't run the 70b bigger than Q3 on my setup and its garbage.
I am working on a llama-cpp-python prompt template for function calling for llama 3.
I am getting good results with the 8B https://github.com/themrzmaster/llama-cpp-python

I am actually going to test this at 16bit for function calling (first only instruction) this weekend. I'll update you here

Using the Llama 3 template in LM studio and adding the eos_token_id to 128009 did not help (model chats garbage). The 8B version works. Any ideas?

I am download the Q2 in LM Studio and will get back to you. They both were made with the same Llama.cpp build. So they should either work or not. I'll test and come back to you

I am testing directly with llama.cpp/main and it outputs responsive content for a while and then starts producing garbage.

p.s. using the 6 bit version.

Here is my 70B Q2 that I just downloaded:
image.png

Stops right, it's the last release of LM Studio from last night

I found this change to llama.cpp:

https://github.com/ggerganov/llama.cpp/pull/6745
https://github.com/ggerganov/llama.cpp/pull/6745/files

Here are some files that has used this method:
https://huggingface.co./lmstudio-community/Meta-Llama-3-70B-Instruct-GGUF

As you can see from 8B discussions and my 70B screenshots, the quants for these models work perfectly. Those changes in the PR, is to make it easier for people who convert to pick the right BPE and add something to the Llama.cpp when it comes to Llama-3.

But you don't have to go with Llama.cpp default template taken from the tokenizer, you can provide it yourself as we do in LM Studio or manually in llama.cpp and it all work without any issue.

I download the one i whant and try again, maybe i downloaded from another user...=). But other seams to have the same problem so i am confused...I got the lmstudio.community to work at least.

Downloading and using Meta-Llama-3-70B-Instruct.IQ3_XS.gguf:

Skärmbild 2024-04-19 213510.png

Fixing the eos_token_id to 128009:

Skärmbild 2024-04-19 214551.png

Using the a gguf file from lmstudio-community:

Skärmbild 2024-04-19 214644.png

That's strange! Also, I don't do Fixing the eos_token_id to 128009: part.

  • Fresh LM Studio,
  • and any GGUF model

image.png

I have 0.2.20 aswell. I am downloading a non "i" quant, Can it be what different quants have different problems?

(but you said that you could load any gguf....)

I have 0.2.20 aswell. I am downloading a non "i" quant, Can it be what different quants have different problems?

(but you said that you could load any gguf....)

That is possible! I don't try IQ models usually and go with _S or _M. Let me know how it goes, if the I is not good I'll make another one

Loading the Meta-Llama-3-70B-Instruct.Q3_K_S.gguf:

Skärmbild 2024-04-19 224108.png

Fixing the eos problem:

  • Loading: Meta-Llama-3-70B-Instruct.Q3_K_S.gguf
  • Preparing to change field 'tokenizer.ggml.eos_token_id' from 128001 to 128009
    *** Warning *** Warning *** Warning **
  • Changing fields in a GGUF file can make it unusable. Proceed at your own risk.
  • Enter exactly YES if you are positive you want to proceed:
    YES, I am sure> YES
  • Field changed. Successful completion.

Skärmbild 2024-04-19 225807.png

Seams to be the i matrix quants that are problematic?

Interesting! I'll make an imatrix this weekend and redo the I quants with that again.

Could you please try this on the existing GGUF models?

./llama.cpp/main -m Meta-Llama-3-70B-Instruct.Q2_K.gguf -r '<|eot_id|>' --in-prefix "\n<|start_header_id|>user<|end_header_id|>\n\n" --in-suffix "<|eot_id|><|start_header_id|>assistant<|end_header_id|>
\n\n" -p "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability.<|eot_id|>\n<|start_header_id|>user<|end_header_id|>\n\nHi!<|eot_id|>\n<|start_header_id|>assistant<|end_header_id|>\n\n" -n 1024

image.png

The modifications introduced in that PR was to fix some issues in converting the model to GGUF, not the prompt template / tokenizer.

I get a trivial | was unexpected at this time.

I usally dont play with these tools so i am a bit lost.

Here is a quick demo to show how to use it and you can see the response: https://colab.research.google.com/drive/1HD-_evvGo1l1B-imVfQP7BKfDe-7BbE-?usp=sharing

Hmm ... changed '<|eot_id|>' to "<|eot_id|>" insteed. But it seams to work. please note i used my fixed eos gguf

Skärmbild 2024-04-20 112317.png

So it works from the command line....

Hmm this was the K_S that worked all the time.....

The broken one (eos fixed) IQ3_XS:

Skärmbild 2024-04-20 113807.png

I have removed all the IQ quants from all my GGUF repos. I forgot to do an imatrix, so their quality was not good. The rest have been tested in Llama.cpp, llama.cpp-python, and LM Studio without changing metadata (with actual prompt) - similar to the demo in Colab.

I am currently running gpt-pilot with the ollama import of Llama-3-70B. The model as imported is about 40GB. Ollama is able to distribute it over 4 GPUs with 12.2GiB VRAM each.

It seems to be "working" in the sense that it does similar thing that I saw when I ran gpt-pilot with OpenAI API. Sometimes it starts outputting junk, on the screen and into files, and I have to cancel and restart for it to behave reasonably again. The key here is that gpt-pilot is able to create files, and mostly is doing reasonable things. What I don't know is whether any of this code will work or not.

@SvenJoinH I have now uploaded 5 new IQ quants for 70B based on the imatrix, they are pretty good. Even the IQ-1_S which is the smallest quants.

I am downloading now, if they work i (and others i assume) will be very thankful for you effort.

@SvenJoinH I have now uploaded 5 new IQ quants for 70B based on the imatrix, they are pretty good. Even the IQ-1_S which is the smallest quants.

Thank you so much!

I've been trying an IQ2-XS quant from some other repo, and it indeed works eerily good. I can't wait to get my hands on a functioning IQ3-XS as well, much appreciated!

I have tested it both locally and in LM Studio. The one before was made without any imatrix so all the IQ- quants were bad. These ones are tested and they are up for the task:

GLoIHgxWYAI0OaJ.jpeg

The same problem is happening with me for the Q4_M, it's unable to stop, so I dont think it's a problem with IQuants only

@yehiaserag depending on where you use these GGUF models, if you don't follow the correct template it fails to stop. Here is a live demo with the smallest Q2 GGUF, downloaded right in the Colab, and you can see the response stopped perfectly fine. The important part is, I used the correct chat template and didn't rely on Llama.cpp to provide one for me:

https://colab.research.google.com/drive/1HD-_evvGo1l1B-imVfQP7BKfDe-7BbE-?usp=sharing

Do I understand correctly that the error was/is due to the fact that an incorrect EOS was defined? Because the Q4_K_S model still shows EOS token = 128001 '<|end_of_text|>'; so incorrect, instead of <|eot_id|> as it should be correctly.

On the application level, when you see <|eot_id|> string generated they all stop generating. If you define the stop string as es_token_id blindly, then yes that must be fixed. But if you just say I know <|eot_id|> is the stop string, just stop if you see it, then everything should be fine. (my example, LM Studio, etc.)

So what is the final and the correct way to prompt these Q models? I'd like to use ollama to import a quantized model. Should the system prompt be used only once, or with every single query? Should "<|end_of_text|>" at all, if it is interactive?

Thanks.

This is how I use it in Llama.cpp (in other apps the stop strings is set to ["<|eot_id|>", "<|end_of_text|>"] and it stops perfectly regardless of what's in the tokenizer.):

'<|eot_id|>' --in-prefix "\n<|start_header_id|>user<|end_header_id|>\n\n" --in-suffix "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -p "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability.<|eot_id|>\n<|start_header_id|>user<|end_header_id|>\n\nHi! How are you?<|eot_id|>\n<|start_header_id|>assistant<|end_header_id|>\n\n" -n 1024

https://colab.research.google.com/drive/1HD-_evvGo1l1B-imVfQP7BKfDe-7BbE-?usp=sharing#scrollTo=R88QtCrMUraW

Ok, but is it a single shot or interactive use? For a single shot you would pass your system prompt to the model every time. If one uses the model interactively, you want to pass the system prompt once. Also if one is using the model interactively, how does one end the assistant output? Or do you just leave it open?

You follow the official template, always. The model will at some point when it thinks the generation is over generate the eos_token. If the software supports stop_strings or stop_squence_strings, by seeing that it will stop the generation. (which is the only way to stop any LLM from keep on generating)

This is a multi-turn:

image.png

Perhaps, it's best if you can share what it is you are doing so we can test it on our side and see why it fails

Hello, can you make IQ quants of Non-Instruct 70B as well? I really want to test it but my own attempt of doing it failed.

Hello, can you make IQ quants of Non-Instruct 70B as well? I really want to test it but my own attempt of doing it failed.

Hi,
Just out of curiosity, what are the use cases of the base model in GGUF? Can you fine-tune based on GGUF quant models?

No, it isn't very useful, I'm just curious how it reacts to various prompting. It also seems uncensored unlike Instruct, in the case of 8B that is.

No, it isn't very useful, I'm just curious how it reacts to various prompting. It also seems uncensored unlike Instruct, 8B that is.

That's OK, let me see what others have done in the last few days. I'll do the ones that are missing :)

I suspect it's because it's a base model. The IQ-1 GGUF here actually works surprisingly good!

How did you upload Q6 and Q8? I am trying to upload, getting a 50GB limit error. Can you share the script to split the shards?

@aaditya you have to split anything that is larger than 48G limit of Hugging Face. You can do that with a simple split/cat in linux, or use the native split/merge in Llama.cpp: https://github.com/ggerganov/llama.cpp/discussions/6404

I suspect it's because it's a base model. The IQ-1 GGUF here actually works surprisingly good!

I posted some tests in the discussion. It looks something is wrong with iMatrix.
Created an issue on github.
https://github.com/ggerganov/llama.cpp/issues/6841

Hi, I encountered the not stop generating problem with models smaller than 24GB by LMStudio. I managed to resolve the issue by adding the word 'assistant' as a stop string.
llama3.png

Hello, can you make IQ quants of Non-Instruct 70B as well? I really want to test it but my own attempt of doing it failed.

Hi,
Just out of curiosity, what are the use cases of the base model in GGUF? Can you fine-tune based on GGUF quant models?

Oh yes! You actually can finetune gguf quant models with llama.cpp

This comment has been hidden

Hi, I encountered the not stop generating problem with models smaller than 24GB by LMStudio. I managed to resolve the issue by adding the word 'assistant' as a stop string.

image.png

image.png

https://github.com/ggerganov/llama.cpp/issues/6804
Could be relevant, it seems that imatrix has some issues after all.

Could be, I only use imatrix for IO- quants and not for Q- quants. I checked the IQ- models myself, the worked very well. (but I had the prompt template correctly)

There is some problem with these quantized models. I was using the 6bit version, and no matter what format I was using for the prompt, it was getting into infinite loops. So I had to switch over to another repo.
I was using ollama to serve it.

The stuff below does NOT loop infinitely.

FROM /opt/data/QuantFactory/Meta-Llama-3-70B-Instruct-GGUF/Meta-Llama-3-70B-Instruct.Q5_K_M.gguf

TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""

PARAMETER num_ctx 8192
PARAMETER temperature 0.2
PARAMETER num_gpu 73

PARAMETER stop "<|eot_id|>"
PARAMETER stop '<|end_of_text|>'
PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop '<|begin_of_text|>'

SYSTEM "You are a helpful AI which can plan, program, and test."

@BigDeeper
I don't know about Ollama, but it works fine in Llama.cpp (latest) and LM Studio (latest). Just make sure you have the latest version of these applications.

MaziyarPanahi changed discussion status to closed

Sign up or log in to comment