|
--- |
|
library_name: transformers |
|
license: llama3 |
|
language: |
|
- ar |
|
- en |
|
pipeline_tag: text-generation |
|
model_name: Arabic ORPO 8B chat |
|
model_type: llama3 |
|
quantized_by: MohamedRashad |
|
--- |
|
|
|
# The AWQ version |
|
This is the AWQ version of [MohamedRashad/Arabic-Orpo-Llama-3-8B-Instruct](https://huggingface.co./MohamedRashad/Arabic-Orpo-Llama-3-8B-Instruct) for the enthusiasts |
|
|
|
<center> |
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/6116d0584ef9fdfbf45dc4d9/4VqGvuqtWgLOTavTV861j.png"> |
|
</center> |
|
|
|
## How to use, you ask ? |
|
|
|
First, Update your packages |
|
|
|
```shell |
|
pip3 install --upgrade autoawq transformers |
|
``` |
|
|
|
Now, Copy and Run |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer |
|
|
|
model_name_or_path = "MohamedRashad/Arabic-Orpo-Llama-3-8B-Instruct-AWQ" |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) |
|
model = AutoModelForCausalLM.from_pretrained( |
|
model_name_or_path, |
|
attn_implementation="flash_attention_2", # disable if you have problems with flash attention 2 |
|
torch_dtype=torch.bfloat16, |
|
low_cpu_mem_usage=True, |
|
device_map="auto" |
|
) |
|
|
|
# Using the text streamer to stream output one token at a time |
|
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True) |
|
|
|
messages = [ |
|
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"}, |
|
{"role": "user", "content": "مرحبا"}, |
|
] |
|
|
|
input_ids = tokenizer.apply_chat_template( |
|
messages, |
|
add_generation_prompt=True, |
|
return_tensors="pt" |
|
).to(model.device) |
|
|
|
terminators = [ |
|
tokenizer.eos_token_id, |
|
tokenizer.convert_tokens_to_ids("<|eot_id|>") |
|
] |
|
|
|
generation_params = { |
|
"do_sample": True, |
|
"temperature": 0.6, |
|
"top_p": 0.9, |
|
"top_k": 40, |
|
"max_new_tokens": 1024, |
|
"eos_token_id": terminators, |
|
} |
|
|
|
# Generate streamed output, visible one token at a time |
|
generation_output = model.generate( |
|
tokens, |
|
streamer=streamer, |
|
**generation_params |
|
) |
|
|
|
# Generation without a streamer, which will include the prompt in the output |
|
generation_output = model.generate( |
|
tokens, |
|
**generation_params |
|
) |
|
|
|
# Get the tokens from the output, decode them, print them |
|
token_output = generation_output[0] |
|
text_output = tokenizer.decode(token_output) |
|
print("model.generate output: ", text_output) |
|
|
|
# Inference is also possible via Transformers' pipeline |
|
from transformers import pipeline |
|
|
|
pipe = pipeline( |
|
"text-generation", |
|
model=model, |
|
tokenizer=tokenizer, |
|
**generation_params |
|
) |
|
|
|
pipe_output = pipe(prompt_template)[0]['generated_text'] |
|
print("pipeline output: ", pipe_output) |
|
|
|
``` |
|
|