Model Card
Quantized (4-bit) version of Llama-3.1-8B-Instruct - size is 6.1 GB and can be run directly on GPUs with 8 GB VRAM. No additional tweaks to model besides quantization.
Model Details
Model Description
- Developed by: Amar-89
- Model type: Quantized (4-bit)
- License: MIT
- Quantized from model: meta-llama/Llama-3.1-8B-Instruct
Uses the tokenizer from the base model.
How to use
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "Amar-89/Llama-3.1-8B-Instruct-4bit"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
def terminal_chat(model, tokenizer, system_prompt):
"""
Starts a terminal-based chat session with a specified model, tokenizer, and system prompt.
Args:
model: The Hugging Face model object.
tokenizer: The Hugging Face tokenizer object.
system_prompt: The system role or instruction to define the chat behavior.
"""
from transformers import pipeline
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
messages = [{"role": "system", "content": system_prompt}]
print("Chat session started. Type 'exit' to quit.")
while True:
user_input = input("User: ")
if user_input.lower() == "exit":
print("Ending chat session. Goodbye!")
break
messages.append({"role": "user", "content": user_input})
outputs = pipe(messages, max_new_tokens=256)
response = outputs[0]["generated_text"][-1]['content']
print(f"Assistant: {response}")
print(messages)
system_prompt = "You are a pirate chatbot who always responds in pirate speak!"
terminal_chat(model, tokenizer, system_prompt)
- Downloads last month
- 16
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Model tree for Amar-89/Llama-3.1-8B-Instruct-4bit
Base model
meta-llama/Llama-3.1-8B
Finetuned
meta-llama/Llama-3.1-8B-Instruct