--- license: gpl-3.0 language: - nl base_model: - CLTL/MedRoBERTa.nl tags: - medical - healthcare metrics: - perplexity library_name: transformers --- Continued, on-premise, pre-training of [MedRoBERTa.nl](https://huggingface.co./CLTL/MedRoBERTa.nl) using de-identified Electronic Health Records from the University Medical Center Utrecht, related to the cardiology domain. # Data statistics Sources: * Dutch medical guidelines (FMS, NHG) * [NtvG](https://www.ntvg.nl/) papers * PMC abstracts translated using GeminiFlash 1.5 * Number of tokens: 1.47B, of which 1B from UMCU EHRs * Number of documents: 5.8M, of which 3.5M UMCU EHRs * Average number of tokens per document: 253 * Median number of tokens per document: 124 # Training * Effective batch size: 240 * Learning rate: 1e-4 * Weight decay: 1e-3 * Learning schedule: linear, with 25_000 warmup steps * Num epochs: 3 Train perplexity: 3.0 Validation perplexity: 4.0