The AWQ version

This is the AWQ version of MohamedRashad/Arabic-Orpo-Llama-3-8B-Instruct for the enthusiasts

How to use, you ask ?

First, Update your packages

pip3 install --upgrade autoawq transformers

Now, Copy and Run

from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

model_name_or_path = "MohamedRashad/Arabic-Orpo-Llama-3-8B-Instruct-AWQ"

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model = AutoModelForCausalLM.from_pretrained(
    model_name_or_path,
    attn_implementation="flash_attention_2", # disable if you have problems with flash attention 2
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    device_map="auto"
)

# Using the text streamer to stream output one token at a time
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "مرحبا"},
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

generation_params = {
    "do_sample": True,
    "temperature": 0.6,
    "top_p": 0.9,
    "top_k": 40,
    "max_new_tokens": 1024,
    "eos_token_id": terminators,
}

# Generate streamed output, visible one token at a time
generation_output = model.generate(
    tokens,
    streamer=streamer,
    **generation_params
)

# Generation without a streamer, which will include the prompt in the output
generation_output = model.generate(
    tokens,
    **generation_params
)

# Get the tokens from the output, decode them, print them
token_output = generation_output[0]
text_output = tokenizer.decode(token_output)
print("model.generate output: ", text_output)

# Inference is also possible via Transformers' pipeline
from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    **generation_params
)

pipe_output = pipe(prompt_template)[0]['generated_text']
print("pipeline output: ", pipe_output)
Downloads last month
20
Safetensors
Model size
1.98B params
Tensor type
FP16
·
I32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.