QuantFactory/Maral-7B-alpha-1-GGUF
This is quantized version of MaralGPT/Maral-7B-alpha-1 created using llama.cpp
Original Model Card
Maral 7B Alpha 1
What is Maral?
Maral is just a new large lanugage model, specializing on the Persian language. This model is based on Mistral and trained an Alpaca Persian dataset. This model is one of the few efforts in Persian speaking scene in order to bring our language to a new life in the era of AI.
Also, since Maral is based on Mistral, it's capable of producing English answers as well.
What does "Maral" mean?
Maral is the Persian name of Red Deer, which is a native species of deers in Iran. The name has chosen for quite a few reasons, one of them is that the environmental concerns we have and second, since it's a Persian LLM, made by Iranian people, it deserves an Iranian name.
Inference
Prompt Format
This model requires Guanaco format, which is like this:
### Human: <prompt>
### Assistant: <answer>
So in your code, you may write prompts like this:
prompt = "در سال ۱۹۹۶ چه کسی رییس جمهور آمریکا بود؟"
prompt = f"### Human:{prompt}\n### Assistant:"
More information about this on the inference sections.
4 bit Quantization
If you want to use 4 bit quantization, we have a PEFT for you here. Also, you can find Google Colab notebooks here.
Installing Libraries
pip install transformers accelerate bitsandbytes
NOTE: bitsandbytes
library is only needed for 8 bit version. Otherwise, it's not necessary.
Inference on a big GPU
If you have a big enough GPU like an A100 in your posession, this code is for you.
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
import torch
model_name_or_id = "MaralGPT/Maral-7B-alpha-1"
model = AutoModelForCausalLM.from_pretrained(model_name_or_id, torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name_or_id)
prompt = "در سال ۱۹۹۶ چه کسی رییس جمهور آمریکا بود؟"
prompt = f"### Human:{prompt}\n### Assistant:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
generation_config = GenerationConfig(
do_sample=True,
top_k=1,
temperature=0.5,
max_new_tokens=300,
pad_token_id=tokenizer.eos_token_id
)
outputs = model.generate(**inputs, generation_config=generation_config)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Inference on a small GPU (Consumer Hardware/Free Colab)
The code is pretty much the same as above, but with a slight diferrence.
- Make sure
bitsandbytes
is installed correctly. - Your model loading must be
model = AutoModelForCausalLM.from_pretrained(model_name_or_id, load_in_8bit=True, torch_dtype=torch.bfloat16, device_map="auto")
On free version of Google Colab, you may face RAM problems. I guess using low_cpu_mem_usage=True
in model loading would help.
Known Issues
- The model produces GPT-3.5 level answers in terms of grammar (specially Persian) but is capable of extremely insane hallucinations. This problem can be solved by a better dataset and better training procedures (such as DPO).
- According to the previous issue, the model can also generate misinforming answers specially when dealing with reasoning problems in Persian.
- The model is huge, so it requires a lot of resources in order to work correctly. However, we may provide GPTQ or GGUF versions as well.
- The prompt format works and it proves our concept of a instruct following LLM, but since we haven't changed
eos_token
andbos_token
to our own, you may see unncessary information being generated by the model. - According to the previous issue, the model is capable of repeating itself. To solve this problem temporarily you have to keep temperature below 1. According to our tests somewhere between 0.5 to 0.7 is a sweet spot.
Our Team
Special Thanks
- Mistral Team for providing the best open source base model ever.
- Sina Rashidi, who translated Alpaca dataset to Persian.
- Jupyto team for providing our infrastructure.
- Downloads last month
- 248