GGML Version
Outstanding work! just convert it to ggml, check it out if your are interested! https://huggingface.co./s3nh/LLaMA-2-7B-32K-GGML
@s3nh Will your converted model can run on colab's CPU easily?
@deepakkaura26 I think so! by default you get 2 vCPUs on colab with 13G RAM which should be enough to run the ggml versions
@mauriceweber actually I tried it but whether I choose CPU or GPU my colab got crashed 5 times.
Which quantization did you try? I tried the 4bit version on colab and could run it without problems.
import ctransformers
from ctransformers import AutoModelForCausalLM
model_file = "LLaMA-2-7B-32K.ggmlv3.q4_0.bin"
model = AutoModelForCausalLM.from_pretrained("s3nh/LLaMA-2-7B-32K-GGML", model_type="llama", model_file=model_file)
prompt = "Whales have been living in the oceans for millions of years "
model(prompt, max_new_tokens=128, temperature=0.9, top_p= 0.7)
EDIT: load model directly from hub.
@mauriceweber I have use this same example which is present in this model website
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("togethercomputer/LLaMA-2-7B-32K")
model = AutoModelForCausalLM.from_pretrained("togethercomputer/LLaMA-2-7B-32K", trust_remote_code=True, torch_dtype=torch.float16)
input_context = "Your text here"
input_ids = tokenizer.encode(input_context, return_tensors="pt")
output = model.generate(input_ids, max_length=128, temperature=0.7)
output_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_text)
@mauriceweber I tried to run your codes which you showed they give me this following error
HTTPError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py in hf_raise_for_status(response, endpoint_name)
260 try:
--> 261 response.raise_for_status()
262 except HTTPError as e:
11 frames
HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co./api/models/LLaMA-2-7B-32K.ggmlv3.q4_0.bin/revision/main
The above exception was the direct cause of the following exception:
RepositoryNotFoundError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py in hf_raise_for_status(response, endpoint_name)
291 " make sure you are authenticated."
292 )
--> 293 raise RepositoryNotFoundError(message, response) from e
294
295 elif response.status_code == 400:
RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-64caab34-5bd826d76686f26a76b02644;7f562443-2822-41e5-bcd0-37c62aef99f9)
Repository Not Found for url: https://huggingface.co./api/models/LLaMA-2-7B-32K.ggmlv3.q4_0.bin/revision/main.
Please make sure you specified the correct repo_id
and repo_type
.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.
@mauriceweber I have use this same example which is present in this model website
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("togethercomputer/LLaMA-2-7B-32K")
model = AutoModelForCausalLM.from_pretrained("togethercomputer/LLaMA-2-7B-32K", trust_remote_code=True, torch_dtype=torch.float16)input_context = "Your text here"
input_ids = tokenizer.encode(input_context, return_tensors="pt")
output = model.generate(input_ids, max_length=128, temperature=0.7)
output_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_text)
Here you are not using the quantized (ggml) models, which is why you are running out of memory (you need around 14GB RAM for the 7B model with float16).
@mauriceweber I tried to run your codes which you showed they give me this following error
This is error is because the model is not downloaded yet (I was assuming you had it downloaded to colab) -- I adjusted the code snippet above so that the model file gets pulled directly from the repo. You can check the other model versions here.
Let us know how it goes!:)