BramVanroy commited on
Commit
5c90410
·
verified ·
1 Parent(s): 6f8ff6c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +58 -2
README.md CHANGED
@@ -48,7 +48,7 @@ Here is a break down of the training set (some data pages might not be available
48
 
49
  - [BramVanroy/ultrachat_200k_dutch](https://huggingface.co/datasets/BramVanroy/ultrachat_200k_dutch) (gpt-4-turbo; multi-turn; generated): 85.42%
50
  - [BramVanroy/no_robots_dutch](https://huggingface.co/datasets/BramVanroy/no_robots_dutch) (gpt-4-turbo; prompt translate, answer generated): 2.20%
51
- - [BramVanroy/stackoverflow-chat-dutch](https://huggingface.co/datasets/BramVanroy/stackoverflow-chat-dutch) (gpt-3.5-turbo; multi-turn; code; translated): 8.38%
52
  - [BramVanroy/alpaca-cleaned-dutch](https://huggingface.co/datasets/BramVanroy/alpaca-cleaned-dutch) (gpt-3.5-turbo; translated): 2.62%
53
  - [BramVanroy/dolly-15k-dutch](https://huggingface.co/datasets/BramVanroy/dolly-15k-dutch) (gpt-3.5-turbo; translated): 1.39%
54
 
@@ -58,7 +58,63 @@ Here is a break down of the training set (some data pages might not be available
58
 
59
  The great [alignment handbook](https://github.com/huggingface/alignment-handbook/) was used for training, with a custom slurm script for compatibility with our cluster. It was trained in full, without LoRA or other adapters.
60
 
61
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
 
63
  The model was trained on two nodes of four A100 80GB each for around 2.5 hours. I thank the [Flemish Super Computer](https://www.vscentrum.be/compute) for their compute.
64
 
 
48
 
49
  - [BramVanroy/ultrachat_200k_dutch](https://huggingface.co/datasets/BramVanroy/ultrachat_200k_dutch) (gpt-4-turbo; multi-turn; generated): 85.42%
50
  - [BramVanroy/no_robots_dutch](https://huggingface.co/datasets/BramVanroy/no_robots_dutch) (gpt-4-turbo; prompt translate, answer generated): 2.20%
51
+ - [BramVanroy/stackoverflow-chat-dutch](https://huggingface.co/datasets/BramVanroy/stackoverflow-chat-dutch) (gpt-3.5-turbo; multi-turn; code; translated; only 50% used): 8.38%
52
  - [BramVanroy/alpaca-cleaned-dutch](https://huggingface.co/datasets/BramVanroy/alpaca-cleaned-dutch) (gpt-3.5-turbo; translated): 2.62%
53
  - [BramVanroy/dolly-15k-dutch](https://huggingface.co/datasets/BramVanroy/dolly-15k-dutch) (gpt-3.5-turbo; translated): 1.39%
54
 
 
58
 
59
  The great [alignment handbook](https://github.com/huggingface/alignment-handbook/) was used for training, with a custom slurm script for compatibility with our cluster. It was trained in full, without LoRA or other adapters.
60
 
61
+ The model was trained in bfloat16 with flash attention 2 and a context length of 8192.
62
+
63
+ Recipe used with the handbook:
64
+
65
+ ```
66
+ # Model arguments
67
+ model_name_or_path: Rijgersberg/GEITje-7B
68
+ model_revision: main
69
+ torch_dtype: bfloat16
70
+ use_flash_attention_2: true
71
+
72
+ # Data training arguments
73
+ # Zephyr chat template
74
+ chat_template: "{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ '<|user|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'system' %}\n{{ '<|system|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'assistant' %}\n{{ '<|assistant|>\n' + message['content'] + eos_token }}\n{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ '<|assistant|>' }}\n{% endif %}\n{% endfor %}"
75
+ dataset_mixer:
76
+ BramVanroy/ultrachat_200k_dutch: 1.0
77
+ BramVanroy/stackoverflow-chat-dutch: 0.5
78
+ BramVanroy/alpaca-cleaned-dutch: 1.0
79
+ BramVanroy/dolly-15k-dutch: 1.0
80
+ BramVanroy/no_robots_dutch: 1.0
81
+ dataset_splits:
82
+ - train_sft
83
+ - test_sft
84
+ preprocessing_num_workers: 8
85
+
86
+ # SFT trainer config
87
+ bf16: true
88
+ do_eval: true
89
+ evaluation_strategy: epoch
90
+ gradient_accumulation_steps: 1
91
+ gradient_checkpointing: true
92
+ gradient_checkpointing_kwargs:
93
+ use_reentrant: False
94
+ hub_model_id: GEITje-ultra-sft
95
+ hub_strategy: every_save
96
+ learning_rate: 2.0e-05
97
+ log_level: info
98
+ logging_steps: 5
99
+ logging_strategy: steps
100
+ lr_scheduler_type: cosine
101
+ max_seq_length: 8192
102
+ max_steps: -1
103
+ num_train_epochs: 1
104
+ output_dir: data/GEITje-ultra-sft
105
+ overwrite_output_dir: true
106
+ per_device_eval_batch_size: 8
107
+ per_device_train_batch_size: 16
108
+ push_to_hub: true
109
+ remove_unused_columns: true
110
+ report_to:
111
+ - wandb
112
+ save_strategy: "steps"
113
+ save_steps: 100
114
+ save_total_limit: 1
115
+ seed: 42
116
+ warmup_ratio: 0.1
117
+ ```
118
 
119
  The model was trained on two nodes of four A100 80GB each for around 2.5 hours. I thank the [Flemish Super Computer](https://www.vscentrum.be/compute) for their compute.
120