Continued, on-premise, pre-training of MedRoBERTa.nl using de-identified Electronic Health Records from the University Medical Center Utrecht, related to the cardiology domain.
Data statistics
Sources:
Dutch medical guidelines (FMS, NHG)
NtvG papers
PMC abstracts translated using GeminiFlash 1.5
Number of tokens: 1.47B, of which 1B from UMCU EHRs
Number of documents: 5.8M, of which 3.5M UMCU EHRs
Average number of tokens per document: 253
Median number of tokens per document: 124
Training
- Effective batch size: 240
- Learning rate: 1e-4
- Weight decay: 1e-3
- Learning schedule: linear, with 25_000 warmup steps
- Num epochs: 3
Train perplexity: 3.0
Validation perplexity: 4.0
- Downloads last month
- 6
Model tree for UMCU/CardioMedRoBERTa.nl
Base model
CLTL/MedRoBERTa.nl