Instruct Training Dataset languages
It is known that the original gemma was trained, among other things, in languages other than English. What about the Instruct version? Is she trained only in English instructions or is there also some part of other languages?
I doubt 7B is enough to be good in many languages, but would like to know as well
https://storage.googleapis.com/deepmind-media/gemma/gemma-report.pdf
From page 4 Instruction tuning
We finetune Gemma 2B and 7B with supervised fine-tuning (SFT) on a mix of text-only,
Englishonly synthetic and human-generated promptresponse pairs and reinforcement learning from human feedback (RLHF) with the reward model trained on labelled
English-only preference data and the policy based on a set of high-quality prompts.
@andreaKIM is correct! as the report says, the finetuned models use English-only datasets.
See it was closed, thought I might as well give final thoughts as possible feedback. In future releases it would be great if you provided releases for other languages or maybe added more multi lingual data. Although, again, I don't think such small models are really suitable for many languages, so the latter would require Google being comfortable with releasing more powerful models as open source and even the bigger open models struggle somewhat to lesser extent. So separate versions would be more realistic
that makes a ton of sense, appreciate the feedback :)