|
--- |
|
license: gpl-3.0 |
|
language: |
|
- nl |
|
base_model: |
|
- CLTL/MedRoBERTa.nl |
|
tags: |
|
- medical |
|
- healthcare |
|
metrics: |
|
- perplexity |
|
library_name: transformers |
|
--- |
|
|
|
Continued, on-premise, pre-training of [MedRoBERTa.nl](https://huggingface.co./CLTL/MedRoBERTa.nl) using de-identified Electronic Health Records from the University Medical Center Utrecht, related to the cardiology domain. |
|
|
|
|
|
# Data statistics |
|
|
|
Sources: |
|
* Dutch medical guidelines (FMS, NHG) |
|
* [NtvG](https://www.ntvg.nl/) papers |
|
* PMC abstracts translated using GeminiFlash 1.5 |
|
|
|
* Number of tokens: 1.47B, of which 1B from UMCU EHRs |
|
* Number of documents: 5.8M, of which 3.5M UMCU EHRs |
|
* Average number of tokens per document: 253 |
|
* Median number of tokens per document: 124 |
|
|
|
# Training |
|
|
|
* Effective batch size: 240 |
|
* Learning rate: 1e-4 |
|
* Weight decay: 1e-3 |
|
* Learning schedule: linear, with 25_000 warmup steps |
|
* Num epochs: 3 |
|
|
|
Train perplexity: 3.0 |
|
|
|
Validation perplexity: 4.0 |
|
|
|
|