BramVanroy commited on
Commit
44137f8
·
verified ·
1 Parent(s): fa1a280

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -2
README.md CHANGED
@@ -40,11 +40,24 @@ Because the model was trained on synthetic data created with OpenAI/Azure servic
40
 
41
  ## Training and evaluation data
42
 
43
- More information needed
 
 
 
 
 
 
 
 
 
 
 
44
 
45
  ## Training procedure
46
 
47
- The great [alignment handbook](https://github.com/huggingface/alignment-handbook/) was used for training, with a custom slurm script for compatibility with our cluster.
 
 
48
 
49
  The model was trained on two nodes of four A100 80GB each for around 2.5 hours. I thank the [Flemish Super Computer](https://www.vscentrum.be/compute) for their compute.
50
 
 
40
 
41
  ## Training and evaluation data
42
 
43
+ Training data consists of older datasets that were translated to Dutch with OpenAI's gpt-3.5-turbo (alpaca, dolly, stackoverflow) and newer ones that were generated with gpt-4-turbo via Azure (no robots, ultrachat). In the case of no robots, the original English prompt (and optionally system message) were translated, and new answers were then generated with gpt-4-turbo. The case of UltraChat may be more interesting, where multi-turn conversations were generated in one go: through prompt engineering we provide the model with the original English first user message and ask it to create a conversation between a user and assistant in a single response. Additionally, and in my opinion excitedly, I created multiple personas that were randomly select from. The user messages in the dataset are written "as if" they were created by one of the personas, in hopes that the model learns to react well to different types of users. Personas include language learners, a direct conversationalist, someone who loves details, someone who is critical, a child, an expert in the field, a joyful, chaotic mind, a generalist, and "an average user". This is described in more detail [in the dataset](https://huggingface.co/datasets/BramVanroy/ultrachat_200k_dutch).
44
+
45
+ The training set (`train_sft`) consists of 240,527,565 tokens (calculated prior to applying a chat template). The test sets (`test_sft` in the datasets) account for 26,397,086 tokens, which is around 10.97\% of the training set.
46
+
47
+ Here is a break down of the training set:
48
+
49
+ BramVanroy/ultrachat_200k_dutch (gpt-4-turbo): 85.42%
50
+ BramVanroy/stackoverflow-chat-dutch (code; gpt-3.5-turbo): 8.38%
51
+ BramVanroy/alpaca-cleaned-dutch (gpt-3.5-turbo): 2.62%
52
+ BramVanroy/dolly-15k-dutch (gpt-3.5-turbo): 1.39%
53
+ BramVanroy/no_robots_dutch (gpt-4-turbo): 2.20%
54
+
55
 
56
  ## Training procedure
57
 
58
+ The great [alignment handbook](https://github.com/huggingface/alignment-handbook/) was used for training, with a custom slurm script for compatibility with our cluster. It was trained in full, without LoRA or other adapters.
59
+
60
+
61
 
62
  The model was trained on two nodes of four A100 80GB each for around 2.5 hours. I thank the [Flemish Super Computer](https://www.vscentrum.be/compute) for their compute.
63