BramVanroy
/

GEITje-7B-ultra-sft

Text Generation

alignment-handbook

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

BramVanroy commited on Jan 23, 2024

Commit

44137f8

·

verified ·

1 Parent(s): fa1a280

Update README.md

Files changed (1) hide show

README.md +15 -2

README.md CHANGED Viewed

@@ -40,11 +40,24 @@ Because the model was trained on synthetic data created with OpenAI/Azure servic
 ## Training and evaluation data
-More information needed
 ## Training procedure
-The great [alignment handbook](https://github.com/huggingface/alignment-handbook/) was used for training, with a custom slurm script for compatibility with our cluster.
 The model was trained on two nodes of four A100 80GB each for around 2.5 hours. I thank the [Flemish Super Computer](https://www.vscentrum.be/compute) for their compute.

 ## Training and evaluation data
+Training data consists of older datasets that were translated to Dutch with OpenAI's gpt-3.5-turbo (alpaca, dolly, stackoverflow) and newer ones that were generated with gpt-4-turbo via Azure (no robots, ultrachat). In the case of no robots, the original English prompt (and optionally system message) were translated, and new answers were then generated with gpt-4-turbo. The case of UltraChat may be more interesting, where multi-turn conversations were generated in one go: through prompt engineering we provide the model with the original English first user message and ask it to create a conversation between a user and assistant in a single response. Additionally, and in my opinion excitedly, I created multiple personas that were randomly select from. The user messages in the dataset are written "as if" they were created by one of the personas, in hopes that the model learns to react well to different types of users. Personas include language learners, a direct conversationalist, someone who loves details, someone who is critical, a child, an expert in the field, a joyful, chaotic mind, a generalist, and "an average user". This is described in more detail [in the dataset](https://huggingface.co/datasets/BramVanroy/ultrachat_200k_dutch).
+The training set (`train_sft`) consists of 240,527,565 tokens (calculated prior to applying a chat template). The test sets (`test_sft` in the datasets) account for 26,397,086 tokens, which is around 10.97\% of the training set.
+Here is a break down of the training set:
+BramVanroy/ultrachat_200k_dutch (gpt-4-turbo): 85.42%
+BramVanroy/stackoverflow-chat-dutch (code; gpt-3.5-turbo): 8.38%
+BramVanroy/alpaca-cleaned-dutch (gpt-3.5-turbo): 2.62%
+BramVanroy/dolly-15k-dutch (gpt-3.5-turbo): 1.39%
+BramVanroy/no_robots_dutch (gpt-4-turbo): 2.20%
 ## Training procedure
+The great [alignment handbook](https://github.com/huggingface/alignment-handbook/) was used for training, with a custom slurm script for compatibility with our cluster. It was trained in full, without LoRA or other adapters.
 The model was trained on two nodes of four A100 80GB each for around 2.5 hours. I thank the [Flemish Super Computer](https://www.vscentrum.be/compute) for their compute.