BramVanroy
/

GEITje-7B-ultra-sft

@@ -48,7 +48,7 @@ Here is a break down of the training set (some data pages might not be available
 - [BramVanroy/ultrachat_200k_dutch](https://huggingface.co/datasets/BramVanroy/ultrachat_200k_dutch) (gpt-4-turbo; multi-turn; generated): 85.42%
 - [BramVanroy/no_robots_dutch](https://huggingface.co/datasets/BramVanroy/no_robots_dutch) (gpt-4-turbo; prompt translate, answer generated): 2.20%
-- [BramVanroy/stackoverflow-chat-dutch](https://huggingface.co/datasets/BramVanroy/stackoverflow-chat-dutch) (gpt-3.5-turbo; multi-turn; code; translated): 8.38%
 - [BramVanroy/alpaca-cleaned-dutch](https://huggingface.co/datasets/BramVanroy/alpaca-cleaned-dutch) (gpt-3.5-turbo; translated): 2.62%
 - [BramVanroy/dolly-15k-dutch](https://huggingface.co/datasets/BramVanroy/dolly-15k-dutch) (gpt-3.5-turbo; translated): 1.39%
@@ -58,7 +58,63 @@ Here is a break down of the training set (some data pages might not be available
 The great [alignment handbook](https://github.com/huggingface/alignment-handbook/) was used for training, with a custom slurm script for compatibility with our cluster. It was trained in full, without LoRA or other adapters.
 The model was trained on two nodes of four A100 80GB each for around 2.5 hours. I thank the [Flemish Super Computer](https://www.vscentrum.be/compute) for their compute.

 - [BramVanroy/ultrachat_200k_dutch](https://huggingface.co/datasets/BramVanroy/ultrachat_200k_dutch) (gpt-4-turbo; multi-turn; generated): 85.42%
 - [BramVanroy/no_robots_dutch](https://huggingface.co/datasets/BramVanroy/no_robots_dutch) (gpt-4-turbo; prompt translate, answer generated): 2.20%
+- [BramVanroy/stackoverflow-chat-dutch](https://huggingface.co/datasets/BramVanroy/stackoverflow-chat-dutch) (gpt-3.5-turbo; multi-turn; code; translated; only 50% used): 8.38%
 - [BramVanroy/alpaca-cleaned-dutch](https://huggingface.co/datasets/BramVanroy/alpaca-cleaned-dutch) (gpt-3.5-turbo; translated): 2.62%
 - [BramVanroy/dolly-15k-dutch](https://huggingface.co/datasets/BramVanroy/dolly-15k-dutch) (gpt-3.5-turbo; translated): 1.39%
 The great [alignment handbook](https://github.com/huggingface/alignment-handbook/) was used for training, with a custom slurm script for compatibility with our cluster. It was trained in full, without LoRA or other adapters.
+The model was trained in bfloat16 with flash attention 2 and a context length of 8192.
+Recipe used with the handbook:
+```
+# Model arguments
+model_name_or_path: Rijgersberg/GEITje-7B
+model_revision: main
+torch_dtype: bfloat16
+use_flash_attention_2: true
+# Data training arguments
+# Zephyr chat template
+chat_template: "{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ '<|user|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'system' %}\n{{ '<|system|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'assistant' %}\n{{ '<|assistant|>\n'  + message['content'] + eos_token }}\n{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ '<|assistant|>' }}\n{% endif %}\n{% endfor %}"
+dataset_mixer:
+  BramVanroy/ultrachat_200k_dutch: 1.0
+  BramVanroy/stackoverflow-chat-dutch: 0.5
+  BramVanroy/alpaca-cleaned-dutch: 1.0
+  BramVanroy/dolly-15k-dutch: 1.0
+  BramVanroy/no_robots_dutch: 1.0
+dataset_splits:
+- train_sft
+- test_sft
+preprocessing_num_workers: 8
+# SFT trainer config
+bf16: true
+do_eval: true
+evaluation_strategy: epoch
+gradient_accumulation_steps: 1
+gradient_checkpointing: true
+gradient_checkpointing_kwargs:
+  use_reentrant: False
+hub_model_id: GEITje-ultra-sft
+hub_strategy: every_save
+learning_rate: 2.0e-05
+log_level: info
+logging_steps: 5
+logging_strategy: steps
+lr_scheduler_type: cosine
+max_seq_length: 8192
+max_steps: -1
+num_train_epochs: 1
+output_dir: data/GEITje-ultra-sft
+overwrite_output_dir: true
+per_device_eval_batch_size: 8
+per_device_train_batch_size: 16
+push_to_hub: true
+remove_unused_columns: true
+report_to:
+- wandb
+save_strategy: "steps"
+save_steps: 100
+save_total_limit: 1
+seed: 42
+warmup_ratio: 0.1
+```
 The model was trained on two nodes of four A100 80GB each for around 2.5 hours. I thank the [Flemish Super Computer](https://www.vscentrum.be/compute) for their compute.