File size: 8,083 Bytes
280f9ff
b960df1
280f9ff
 
 
b960df1
 
 
280f9ff
 
 
b960df1
280f9ff
b960df1
 
690f70d
280f9ff
 
 
b960df1
280f9ff
b960df1
 
 
280f9ff
428508a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
290ecb9
 
 
 
 
 
 
 
5d476c4
 
 
 
 
 
 
 
 
 
 
290ecb9
 
5d476c4
 
 
 
290ecb9
 
5d476c4
 
 
4126739
5d476c4
 
290ecb9
 
 
 
 
280f9ff
 
8e395ff
 
 
 
b960df1
 
 
 
 
 
8e395ff
 
280f9ff
 
 
b960df1
 
 
 
 
280f9ff
b960df1
280f9ff
b960df1
280f9ff
 
 
b960df1
 
 
 
 
 
 
280f9ff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b517d78
93fbd8a
b517d78
 
 
 
 
 
 
 
 
 
 
 
93fbd8a
8cc03b0
 
 
93fbd8a
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
---
license: cc-by-nc-sa-4.0
base_model: BramVanroy/llama2-13b-ft-mc4_nl_cleaned_tiny
tags:
- generated_from_trainer
- llama
- lora
- adapters
datasets:
- BramVanroy/dutch_chat_datasets
model-index:
- name: Llama-2-13b-chat-dutch
  results: []
language:
- nl
inference: false
---


# Llama-2-13b-chat-dutch

This model is a fine-tuned version of [BramVanroy/llama2-13b-ft-mc4_nl_cleaned_tiny](https://huggingface.co./BramVanroy/llama2-13b-ft-mc4_nl_cleaned_tiny)
on the [BramVanroy/dutch_chat_datasets](https://huggingface.co./datasets/BramVanroy/dutch_chat_datasets) dataset on a context of 4096 tokens.
See the original [meta-llama/Llama-2-13b-hf](https://huggingface.co./meta-llama/Llama-2-13b-hf) for more information, intended use, and biases.

If you use this model or refer to it, please use the following citation:

Bram Vanroy. (2023). Llama v2 13b: Finetuned on Dutch Conversational Data. Hugging Face. https://doi.org/10.57967/HF/1018

```bibtex
@misc{https://doi.org/10.57967/hf/1018,
  doi = {10.57967/HF/1018},
  url = {https://huggingface.co./BramVanroy/Llama-2-13b-chat-dutch},
  author = {{Bram Vanroy}},
  title = {{Llama} v2 13b: {Finetuned} on {Dutch} Conversational Data},
  publisher = {{Hugging} {Face}},
  year = {2023}
}
```

## Usage

```python
from transformers import pipeline


# If you want to add a system message, add a dictionary with role "system". However, this will likely have little
# effect since the model was only finetuned using a single system message.
messages = [
    {
        "role": "user",
        "content": "Welke talen worden er in België gesproken?"
    }
]
pipe = pipeline(
    "text-generation",
    model="BramVanroy/Llama-2-13b-chat-dutch",
    device_map="auto"
)

# Just apply the template but leave the tokenization for the pipeline to do
prompt = pipe.tokenizer.apply_chat_template(
    messages,
    tokenize=False
)

# Only return the newly generated tokens, not prompt+new_tokens (return_full_text=False)
generated = pipe(
    prompt,
    do_sample=True,
    max_new_tokens=64,
    return_full_text=False
)

generated[0]["generated_text"]
# ' De officiële talen van België zijn Nederlands, Frans en Duits. Daarnaast worden er nog een aantal andere talen gesproken, waaronder Engels, Spaans, Italiaans, Portugees, Turks, Arabisch en veel meer. '
```

## Model description

I could not get the original Llama 2 13B to produce much Dutch, even though the description paper indicates that it was trained on a (small) portion of Dutch data. I therefore
continued training the original Llama 2 13B checkpoint on Dutch data [in regular CLM](https://huggingface.co./BramVanroy/llama2-13b-ft-mc4_nl_cleaned_tiny). In a second
step I finetuned that model on a collection of synthetic (translated) instruction and chat datasets that I have [collected](https://huggingface.co./datasets/BramVanroy/dutch_chat_datasets). 
See their pages for licensing, usage, creation, and citation information.

- https://huggingface.co./datasets/BramVanroy/dolly-15k-dutch
- https://huggingface.co./datasets/BramVanroy/alpaca-cleaned-dutch-baize
- https://huggingface.co./datasets/BramVanroy/stackoverflow-chat-dutch
- https://huggingface.co./datasets/BramVanroy/quora-chat-dutch

This model is the result of that process. While not perfect by any means, it can perform reasonably well in Dutch depending on the prompts. It is also decent at helping with programming tasks.


## Intended uses & limitations

Depending on the prompt, the model can return good results considering that it is only 13B in size and was only marginally pretrained on Dutch. That being said, the
model was not trained on human feedback and contains no safe-guards so it may produce unexpected and even offensive content depending on the query. The only attempt
of a safe-guard is the default prompt that it was trained on, which was

> Je bent een behulpzame, respectvolle en eerlijke assistent. Antwoord altijd zo behulpzaam mogelijk. Je antwoorden mogen geen schadelijke, onethische, racistische, seksistische, gevaarlijke of illegale inhoud bevatten. Zorg ervoor dat je antwoorden sociaal onbevooroordeeld en positief van aard zijn.\n\nAls een vraag nergens op slaat of feitelijk niet coherent is, leg dan uit waarom in plaats van iets niet correct te antwoorden. Als je het antwoord op een vraag niet weet, deel dan geen onjuiste informatie.\

Use with caution and at your own risk!

Because the model was trained on synthetic data, translated with OpenAI's API, you cannot use this model to create a competitive product to theirs.

## Training procedure

Trained with 4096 tokens context length. The dataset was preprocessed so that as many as possible dialogs were put in a single batch, without disrupting
dialogs. In other words, a dialog was never split up over different sequences or batches. During training, the human prompts were ignored in back propagation.

Trained with LoRA targetting ["q_proj", "v_proj"] in 4 bit and merged before upload. Trained with Flash Attention as borrowed from [here](https://github.com/philschmid/deep-learning-pytorch-huggingface/blob/main/training/utils/llama_patch.py).

The adapters are in the `adapters` branch.

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.0002
- train_batch_size: 2
- eval_batch_size: 2
- seed: 42
- distributed_type: multi-GPU
- num_devices: 4
- gradient_accumulation_steps: 8
- total_train_batch_size: 64
- total_eval_batch_size: 8
- optimizer: Adam with betas=(0.9,0.95) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.03
- num_epochs: 2

### Training results

| Training Loss | Epoch | Step | Validation Loss |
|:-------------:|:-----:|:----:|:---------------:|
| 1.0193        | 0.09  | 20   | 1.1583          |
| 0.9743        | 0.17  | 40   | 1.1339          |
| 0.9159        | 0.26  | 60   | 1.1218          |
| 0.9131        | 0.35  | 80   | 1.1153          |
| 0.8816        | 0.44  | 100  | 1.1130          |
| 0.8977        | 0.52  | 120  | 1.1069          |
| 0.9061        | 0.61  | 140  | 1.1025          |
| 0.8672        | 0.7   | 160  | 1.1024          |
| 0.8956        | 0.79  | 180  | 1.0971          |
| 0.8514        | 0.87  | 200  | 1.0995          |
| 0.8357        | 0.96  | 220  | 1.0952          |
| 0.8294        | 1.05  | 240  | 1.0964          |
| 0.8531        | 1.13  | 260  | 1.0947          |
| 0.8321        | 1.22  | 280  | 1.0951          |
| 0.8365        | 1.31  | 300  | 1.0910          |
| 0.8616        | 1.4   | 320  | 1.0894          |
| 0.8397        | 1.48  | 340  | 1.0904          |
| 0.861         | 1.57  | 360  | 1.0880          |
| 0.8116        | 1.66  | 380  | 1.0871          |
| 0.8285        | 1.74  | 400  | 1.0855          |
| 0.8603        | 1.83  | 420  | 1.0856          |
| 0.8126        | 1.92  | 440  | 1.0848          |


### Framework versions

- Transformers 4.31.0
- Pytorch 2.0.1+cu117
- Datasets 2.14.4
- Tokenizers 0.13.3

# [Open LLM Leaderboard Evaluation Results (English)](https://huggingface.co./spaces/HuggingFaceH4/open_llm_leaderboard)
Detailed results can be found [here](https://huggingface.co./datasets/open-llm-leaderboard/details_BramVanroy__Llama-2-13b-chat-dutch)

| Metric                | Value                     |
|-----------------------|---------------------------|
| Avg.                  | 46.91   |
| ARC (25-shot)         | 59.3          |
| HellaSwag (10-shot)   | 81.45    |
| MMLU (5-shot)         | 55.82         |
| TruthfulQA (0-shot)   | 38.23   |
| Winogrande (5-shot)   | 76.64   |
| GSM8K (5-shot)        | 10.69        |
| DROP (3-shot)         | 6.28         |

# Open LLM Leaderboard Evaluation Results (Dutch)

Results can be found [here](https://huggingface.co./spaces/BramVanroy/open_dutch_llm_leaderboard)

| Metric                | Value                     |
|-----------------------|---------------------------|
| Avg.                  | 0.43   |
| ARC (25-shot)         | 0.38          |
| HellaSwag (10-shot)   | 0.56    |
| MMLU (5-shot)         | 0.35         |
| TruthfulQA (0-shot)   | 0.44   |