language: | |
- bg | |
- cs | |
- da | |
- de | |
- el | |
- en | |
- es | |
- et | |
- fi | |
- fr | |
- ga | |
- hr | |
- hu | |
- it | |
- lt | |
- lv | |
- mt | |
- nl | |
- pl | |
- pt | |
- ro | |
- sk | |
- sl | |
- sv | |
- uk | |
- multilingual | |
license: mit | |
# EuroGPT2 | |
**NOTE: THIS IS THE ORIGINAL MEGATRON-DEEPSPEED CHECKPOINT INCLUDING OPTIMIZER STATES** | |
A GPT2 language model for European languages (EU-24 + Ukrainian). | |
The model follows the original architecture as [OpenAI's GPT2](https://huggingface.co./gpt2/) apart from using [rotary](https://arxiv.org/abs/2104.09864) instead of learned positional embeddigs. | |
## Model settings | |
- parameters: 124M | |
- number of layers: 12 | |
- hidden size: 768 | |
- number of heads: 12 | |
- sequence length: 1024 | |
- batch size: 168 | |
- test PPL after training: 23.6 (steps: 436,940) | |
## Training data | |
- [Wikimedia dumps](https://dumps.wikimedia.org/) (Wikipedia, Wikinews, Wikibooks, Wikisource, Wikivoyage; 20230301) | |
- [EUR-Lex](https://huggingface.co./datasets/joelito/eurlex_resources) | |
- [OSCAR 2023.01](https://huggingface.co./datasets/oscar-corpus/OSCAR-2301) | |
- Tokens: 75,167,662,080 | |
## Languages | |
Included languages: Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Irish, Croatian, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovenian, Swedish, and Ukrainian. | |
| Language | Ratio | | |
| -------- | ------ | | |
| bg | 5,92% | | |
| cs | 4,77% | | |
| da | 2,19% | | |
| de | 7,36% | | |
| el | 8,60% | | |
| en | 10,11% | | |
| es | 6,57% | | |
| et | 1,67% | | |
| fi | 2,70% | | |
| fr | 7,18% | | |
| ga | 0,25% | | |
| hr | 1,09% | | |
| hu | 6,38% | | |
| it | 5,80% | | |
| lt | 2,01% | | |
| lv | 1,76% | | |
| mt | 1,49% | | |
| nl | 5,20% | | |
| pl | 4,82% | | |
| pt | 4,64% | | |
| ro | 2,93% | | |
| sk | 2,03% | | |
| sl | 1,54% | | |
| sv | 3,00% | | |
## License | |
MIT | |