Model Card

Quantized (4-bit) version of Llama-3.1-8B-Instruct - size is 6.1 GB and can be run directly on GPUs with 8 GB VRAM. No additional tweaks to model besides quantization.

Model Details

Model Description

Developed by: Amar-89
Model type: Quantized (4-bit)
License: MIT
Quantized from model: meta-llama/Llama-3.1-8B-Instruct

Uses the tokenizer from the base model.

How to use

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "Amar-89/Llama-3.1-8B-Instruct-4bit"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

def terminal_chat(model, tokenizer, system_prompt):
    """
    Starts a terminal-based chat session with a specified model, tokenizer, and system prompt.

    Args:
        model: The Hugging Face model object.
        tokenizer: The Hugging Face tokenizer object.
        system_prompt: The system role or instruction to define the chat behavior.
    """
    from transformers import pipeline

    pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

    messages = [{"role": "system", "content": system_prompt}]
    print("Chat session started. Type 'exit' to quit.")

    while True:
        user_input = input("User: ")
        if user_input.lower() == "exit":
            print("Ending chat session. Goodbye!")
            break

        messages.append({"role": "user", "content": user_input})

        outputs = pipe(messages, max_new_tokens=256)

        response = outputs[0]["generated_text"][-1]['content']
        print(f"Assistant: {response}")

        print(messages)


system_prompt = "You are a pirate chatbot who always responds in pirate speak!"

terminal_chat(model, tokenizer, system_prompt)

Amar-89
/

Llama-3.1-8B-Instruct-4bit

Model Card

Model Details

Model Description

How to use

Model tree for Amar-89/Llama-3.1-8B-Instruct-4bit