Unfiltered model?
When you said "filtered", you mean that your model is trained by a dataset without the moralizing nonsence in it?
If that's the case, can you provide us the safetensor quantized model? I want to try it on the oobabooga webui :D
Well yeah, and this model is quantized so u can use it right away
its quantized to 4bit
@ShreyasBrill is there a way to convert this to pth format? I am relatively new to this, but am trying to do this on my M1 Max MacBook using LLaMA_MPS as it runs on Apple Silicon faster than oobabooga webui. Will try with that as well thought :)
@ShreyasBrill no, I mean I'd want the format to be quantized in the safetensor format so that I can use it here: https://github.com/oobabooga/text-generation-webui
@panayao yes you can convert it back to pth format! you can download the vicuna model and download/clone this repository and use this convert-ggml-to-pth.py file https://github.com/ggerganov/llama.cpp/blob/master/convert-ggml-to-pth.py easy and simple! and also it does work with M1 macs i guess. This version is currently not very stable enough because the official vicuna released like 2 days ago and you know its just starting up. I might update the model when an update is released so that i make the models more stable with their responses.
@TheYuriLover
Hmmm i don't know how to do that. You can download the model convert it to .pth format and then use some other library to quantize it to safetensor format!
again you can also use https://github.com/ggerganov/llama.cpp/blob/master/convert-ggml-to-pth.py to convert the model to a different format!
@ShreyasBrill nah you have to use this https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/triton to convert the HF models into safetensors, but it requires a lot of vram and unfortunately I don't have a big enough graphic card to do it :'(
I created vicuna using only my CPU. I don't have a GPU either :(
@ShreyasBrill Do you have the original 16-bit HF model though? You should also upload it so people can quantize it on the safetensor format.
@TheYuriLover ill check it, if i have it ill upload it. let me see
@ShreyasBrill thanks man, I appreciate ! :D
@TheYuriLover Wait!!!! i guess i am quantizing the model into safetensor.. Ill upload it and message you here once its done. You can use it in the oogabooga webui
@ShreyasBrill
if you quantize it into safetensor do it with both versions, cuda and triton, and use all the implementations aswell (true sequential + act_order + groupsize 128)
https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/triton
https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/cuda
If you can just convert one of them, use triton then, it gives the fastest output speed!
@TheYuriLover its currently getting quantized to vicuna-13b-GPTQ-4bit-128g now
@ShreyasBrill
You used cuda or triton convertion? I hope it's triton because for the moment we don't know how to make the cuda model run on the webui
Did you add the other implementations? you wrote vicuna-13b-GPTQ-4bit-128g , but does it have true_sequential and act_order?
CUDA_VISIBLE_DEVICES=0 python llama.py ./llama-hf/llama-7b c4 --wbits 4 --true-sequential --act-order --groupsize 128 --save_safetensors llama7b-4bit-128g.safetensors
@TheYuriLover Ah this took forever dude check the safetensors folder and download the vicuna model from the folder then while starting up the webui use these flags and it will start up "--wbits 4 --groupsize 128 --model_type llama"
@ShreyasBrill Thanks dude!! We really appreciate :D
But you didn't answer my questions from before, can you please respond to these?
"You used cuda or triton convertion? I hope it's triton because for the moment we don't know how to make the cuda model run on the webui
Did you add the other implementations? you wrote vicuna-13b-GPTQ-4bit-128g , but does it have true_sequential and act_order?
CUDA_VISIBLE_DEVICES=0 python llama.py ./llama-hf/llama-7b c4 --wbits 4 --true-sequential --act-order --groupsize 128 --save_safetensors llama7b-4bit-128g.safetensors"
@ShreyasBrill
Are you trolling us or something? Your safetensors file is exactly the same as anon8231489123's one (which is trained with the filtered dataset)
https://huggingface.co./anon8231489123/vicuna-13b-GPTQ-4bit-128g
@TheYuriLover I am not trolling, as i said i had no gpu to make a safetensor and i asked my friend to make it. He gave me a file after so long and i uploaded it here. He also told to use those flags that i gave you. I didn't really know that someone had made it.
and as i also said that vicuna isnt really stable enough. For confirmation watch this guy's video and see how it performs
https://youtu.be/jb4r1CL2tcc
@ShreyasBrill
In your description it says that it used the unfiltered dataset
https://huggingface.co./datasets/anon8231489123/ShareGPT_Vicuna_unfiltered
The safetensor file from anon8231489123 used the filtered dataset
It shouldn't be the same safetensor file at the end, why did you say you used the unfiltered dataset? Why lying like that?
lol forgot to remove it hold on actually first i was another model and uploaded it. Then i deleted it and reuploaded a filtered model and forgot to change the tags. Sorry for that
fixed it :)
@TheYuriLover Please tell me what do you exactly need? unfiltered model with 4bit quantization and should work with oogabooga?
@ShreyasBrill
yes, I want a model that is trained on the unfiltered dataset
https://huggingface.co./datasets/anon8231489123/ShareGPT_Vicuna_unfiltered
And that model should be quantized with Triton and with all the GPTQ implementations (true sequential, act order and groupsize 128)
https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/triton
Okay ill try to make it and upload it.
@TheYuriLover do you know how to convert the 4-bit quantized LLaMA.cpp GGML format (ggml-model-q4_0.bin) to the file format like "consolidated.00.pth"? Like the original format LLaMA comes in? Or links to anything I need to learn/understand to figure out? I am trying to use https://github.com/jankais3r/LLaMA_MPS on my M1 Mac and it converts from consolidated.00.pth format to pyarrow. I tried to use a script called "convert-ggml-to-pth.py" but it does not work as expected. If I knew the differences in .pth, .bin, safetensors, and how these all inter-relate to one another it would be much easier to figure all this out lol. Any references/pointers are much appreciated :)
@panayao yes you can convert it back to pth format! you can download the vicuna model and download/clone this repository and use this convert-ggml-to-pth.py file https://github.com/ggerganov/llama.cpp/blob/master/convert-ggml-to-pth.py easy and simple! and also it does work with M1 macs i guess. This version is currently not very stable enough because the official vicuna released like 2 days ago and you know its just starting up. I might update the model when an update is released so that i make the models more stable with their responses.
This is the script I was trying lol. It doesn't seem to work. I get this when I try to convert to pyarrow format (which is used by LLaMA_MPS and is automatically done upon running with *.pth files)
Converting checkpoint to pyarrow format
models/13B_Vicuna/consolidated.00.pth
Traceback (most recent call last):
File "/Users/panayao/Documents/LLaMA_MPS/chat.py", line 146, in <module>
fire.Fire(main)
File "/Users/panayao/Documents/LLaMA_MPS/env/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/Users/panayao/Documents/LLaMA_MPS/env/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/Users/panayao/Documents/LLaMA_MPS/env/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/Users/panayao/Documents/LLaMA_MPS/chat.py", line 106, in main
generator = load(ckpt_dir, tokenizer_path, max_seq_len, max_batch_size)
File "/Users/panayao/Documents/LLaMA_MPS/chat.py", line 49, in load
tens = pa.Tensor.from_numpy(v.numpy())
AttributeError: 'dict' object has no attribute 'numpy'
@panayao I guess you have to ask them directly to fix your error: https://github.com/ggerganov/llama.cpp/issues
Thanks @TheYuriLover lol I probably should have thought to open a Git issue. @ShreyasBrill if you oblige @TheYuriLover 's request can you also upload the unfiltered model in the same "ggml-model-q4_0.bin" format in addition to the format @TheYuriLover requested?
Also @TheYuriLover @ShreyasBrill I have crypto mining machines running Windows and with proper CUDA drivers that I haven't used in years but they each have 8 AMD RX470 (4Gb). So if you need stuff converted and are willing to point me in the right direction I can help :).
@panayao well for the moment the 16 bit unrestricted Vicuna doesn't exist yet, so there's nothing to quantize but if it does you can help yeah :p
@panayao yes i do need your gpus for help to create the models but first hear out for the next release of vicuna because its not fully stable currently. And as yuri said 16 bit unrestricted vicuna doesnt exist
and sorry i didnt reply because of different timezones. I was sleeping when you guys messaged me
I would also love an unfiltered version that is also quantised in 4bit just like this version, it's a great model,
Saying that, the outputs seem to be identicle - as in - token for token --- to the vicuna-13b-q4, I am not sure this is your work?
@nucleardiffusion Thanks :)
I edited my first comment after some tests could you please clarify?
@nucleardiffusion yes its the same model. I downloaded from one of my friends who trained the model, later on he deleted it thats why i thought of uploading it here. This is not my work.