Update README.md
Browse files
README.md
CHANGED
@@ -63,7 +63,7 @@ print(tokenizer.decode(outputs[0]))
|
|
63 |
```
|
64 |
----
|
65 |
# Training and finetuning
|
66 |
-
- **Extend
|
67 |
- **Pre-training:** In the following step, we expanded the embedding layer of the base model to match the size of the Persian tokenizer. We then employed the LoRA method to train the model on three distinct datasets: Wikipedia-Farsi, an Islamic book collection, and content from Khamenei.ir.
|
68 |
<p align="center">
|
69 |
<picture>
|
@@ -71,6 +71,7 @@ print(tokenizer.decode(outputs[0]))
|
|
71 |
</picture>
|
72 |
</p>
|
73 |
<p align="center" style="font-size: 13px;">Wiki-farsi:183M tokens, Islamic books:55M tokens, Khamenei.ir:9M tokens</p>
|
|
|
74 |
- **Instruction Fine-tuning:** For the final step, we fine-tuned the model using the LoRA method on a translated version of the Stanford-alpaca to enhance the model's question-answering capabilities.
|
75 |
This diagram illustrates the steps described above:
|
76 |
<p align="center">
|
|
|
63 |
```
|
64 |
----
|
65 |
# Training and finetuning
|
66 |
+
- **Extend tokenizer:** The base Mistral tokenizer does not support Persian. As an initial step, we trained a SentencePiece tokenizer on the Farsi Wikipedia corpus and subsequently integrated it with the Mistral tokenizer.
|
67 |
- **Pre-training:** In the following step, we expanded the embedding layer of the base model to match the size of the Persian tokenizer. We then employed the LoRA method to train the model on three distinct datasets: Wikipedia-Farsi, an Islamic book collection, and content from Khamenei.ir.
|
68 |
<p align="center">
|
69 |
<picture>
|
|
|
71 |
</picture>
|
72 |
</p>
|
73 |
<p align="center" style="font-size: 13px;">Wiki-farsi:183M tokens, Islamic books:55M tokens, Khamenei.ir:9M tokens</p>
|
74 |
+
|
75 |
- **Instruction Fine-tuning:** For the final step, we fine-tuned the model using the LoRA method on a translated version of the Stanford-alpaca to enhance the model's question-answering capabilities.
|
76 |
This diagram illustrates the steps described above:
|
77 |
<p align="center">
|