|
--- |
|
base_model: |
|
- mistralai/Mistral-7B-v0.3 |
|
datasets: |
|
- wikimedia/wikipedia |
|
- FreedomIntelligence/alpaca-gpt4-arabic |
|
language: |
|
- ar |
|
- en |
|
license: apache-2.0 |
|
tags: |
|
- text-generation-inference |
|
- transformers |
|
- unsloth |
|
- mistral |
|
- trl |
|
--- |
|
|
|
Experimenting with pre-training Arabic language + finetuning on instructions using the quantized model `mistralai/Mistral-7B-v0.3` from `unsloth`. First time trying pre-training, expect issues and low quality outputs. The repo contains the merged, quantized model and a GGUF format. |
|
|
|
See [spaces demo](https://huggingface.co./spaces/nazimali/mistral-7b-v0.3-instruct-arabic) example. |
|
|
|
### Example usage |
|
|
|
#### llama-cpp-python |
|
|
|
```python |
|
from llama_cpp import Llama |
|
|
|
inference_prompt = """فيما يلي تعليمات تصف مهمة. اكتب استجابة تكمل الطلب بشكل مناسب. |
|
|
|
### تعليمات: |
|
{} |
|
|
|
### إجابة: |
|
""" |
|
|
|
llm = Llama.from_pretrained( |
|
repo_id="nazimali/mistral-7b-v0.3-instruct-arabic", |
|
filename="Q8_0.gguf", |
|
) |
|
|
|
llm.create_chat_completion( |
|
messages = [ |
|
{ |
|
"role": "user", |
|
"content": inference_prompt.format("السلام عليكم، هيا نموء") |
|
} |
|
] |
|
) |
|
``` |
|
|
|
#### llama.cpp |
|
|
|
```shell |
|
./llama-cli \ |
|
--hf-repo "nazimali/mistral-7b-v0.3-instruct-arabic" \ |
|
--hf-file Q8_0.gguf \ |
|
-p "السلام عليكم، هيا نموء" \ |
|
--conversation |
|
``` |
|
|
|
### Training |
|
|
|
#### Pre-training data: |
|
|
|
- `wikimedia/wikipedia` |
|
- `20231101.ar` |
|
- Used 6,096 rows, 0.05% of the total data |
|
|
|
#### Finetuning data: |
|
|
|
- `FreedomIntelligence/alpaca-gpt4-arabic` |
|
- Used 49,969 rows, 100% of all the data |
|
|
|
#### Finetuning instruction format: |
|
|
|
```python |
|
finetune_prompt = """فيما يلي تعليمات تصف مهمة. اكتب استجابة تكمل الطلب بشكل مناسب. |
|
|
|
### تعليمات: |
|
{} |
|
|
|
### إجابة: |
|
""" |
|
``` |