File size: 3,468 Bytes
6e8f421 f20c925 31fcafb ba9c102 529b77d 31fcafb 529b77d 52f346c 529b77d 52f346c 529b77d 1c63217 52f346c 529b77d 52f346c 5e0c393 1b71c5c 1c63217 5e0c393 52f346c f20c925 e09c0eb f20c925 52f346c 4fa0140 5ecf5cd afc9f0c 1c63217 a5d14ed 1c63217 98b1534 5ecf5cd 63ccaad 4fa0140 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
---
language:
- en
- fa
---
<p align="center">
<picture>
<img alt="Hugging Face Transformers Library" src="https://i.postimg.cc/VN4F7WRC/Untitled-design-modified.png" width="1000" height="450" style="max-width: 100%;">
</picture>
</p>
<h4 align="center">
<p>
<a href="https://huggingface.co./aidal/Persian-Mistral-7B#model-description">Model description</a> |
<a href="https://huggingface.co./aidal/Persian-Mistral-7B#example-output">Example output</a> |
<a href="https://huggingface.co./aidal/Persian-Mistral-7B#banchmark-results">Banchmark results</a> |
<a href="https://huggingface.co./aidal/Persian-Mistral-7B#how-to-use">How to use</a> |
<a href="https://huggingface.co./aidal/Persian-Mistral-7B#training-and-finetuning">Training and finetuning</a>
</p>
</h4>
----
# Model description
>Persian-mistral is the fintuned version of mistral-7b that design for persian QA and nlp tasks
----
# Example output:
**Example 1:**
- Input: "سلام، خوبی؟"
- Output: "سلام، خوشحالم که با شما صحبت می کنم. چطور می توانم به شما کمک کنم؟"
**Example 2:**
- Input: "سلام، خوبی؟"
- Output: "سلام، خوشحالم که با شما صحبت می کنم. چطور می توانم به شما کمک کنم؟"
----
# Banchmark results
| model | dataset | score |
|---------------|-------------------|--------|
| base-model-7b | ARC-easy |41.92% |
| base-model-7b | ARC-easy |39.12% |
| fa-model-7b | ARC-easy |37.89% |
| base-model-7b | ARC-challenge |37.12% |
| fa-model-7b | ARC-challenge |39.29% |
----
# How to use
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("aidal/Persian-Mistral-7B")
model = AutoModelForCausalLM.from_pretrained("aidal/Persian-Mistral-7B")
input_text = "پایتخت ایران کجاست؟"
input_ids = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))
```
----
# Training and finetuning
- **Extend tokenzer:** The base Mistral tokenizer does not support Persian. As an initial step, we trained a SentencePiece tokenizer on the Farsi Wikipedia corpus and subsequently integrated it with the Mistral tokenizer.
- **Pre-training:** In the following step, we expanded the embedding layer of the base model to match the size of the Persian tokenizer. We then employed the LoRA method to train the model on three distinct datasets: Wikipedia-Farsi, an Islamic book collection, and content from Khamenei.ir.
<p align="center">
<picture>
<img alt="Hugging Face Transformers Library" src="https://i.postimg.cc/LXSD4HnZ/Stakehozlder-Map-1-page-0001-modified.png" width="270" height="270" style="max-width: 100%;">
</picture>
</p>
<p align="center" style="font-size: 13px;">Wiki-farsi:183M tokens, Islamic books:55M tokens, Khamenei.ir:9M tokens</p>
- **Instruction Fine-tuning:** For the final step, we fine-tuned the model using the LoRA method on a translated version of the Stanford-alpaca to enhance the model's question-answering capabilities.
This diagram illustrates the steps described above:
<p align="center">
<picture>
<img alt="Hugging Face Transformers Library" src="https://i.postimg.cc/yY4dkwvT/Stakehozlder-Map-page-0001-modified.png" width="400" height="500" style="max-width: 100%;">
</picture>
</p> |