Continued, off-premise, pre-training of MedRoBERTa.nl using about 50GB of open Dutch and translated English corpora.
Data statistics
Sources:
- Dutch: medical guidelines (FMS, NHG)
- Dutch: NtvG papers
- English: Pubmed abstracts
- English: PMC abstracts translated using DeepL
- English: Apollo guidelines, papers and books
- English: Meditron guidelines
- English: MIMIC3
- English: MIMIC CXR
- English: MIMIC4
All translated (if not with DeepL) with performed with a combination of GeminiFlash 1.5/GPT4o mini, MariaNMT, NLLB200.
- Number of tokens: 15B
- Number of documents: 27M
Training
- Effective batch size: 5120
- Learning rate: 2e-4
- Weight decay: 1e-3
- Learning schedule: linear, with 5_000 warmup steps
- Num epochs: ~3
Train perplexity: 3.0 Validation perplexity: 3.0
Acknowledgement
We were happy to be able to use the Google TPU research cloud for training the model.
- Downloads last month
- 23
Model tree for UMCU/CardioLM_encoder_base
Base model
CLTL/MedRoBERTa.nl