CardioMedRoBERTa.nl / README.md
UMCU's picture
Update README.md
0d48b38 verified
metadata
license: gpl-3.0
language:
  - nl
base_model:
  - CLTL/MedRoBERTa.nl
tags:
  - medical
  - healthcare
metrics:
  - perplexity
library_name: transformers

Continued, on-premise, pre-training of MedRoBERTa.nl using de-identified Electronic Health Records from the University Medical Center Utrecht, related to the cardiology domain.

Data statistics

Sources:

  • Dutch medical guidelines (FMS, NHG)

  • NtvG papers

  • PMC abstracts translated using GeminiFlash 1.5

  • Number of tokens: 1.47B, of which 1B from UMCU EHRs

  • Number of documents: 5.8M, of which 3.5M UMCU EHRs

  • Average number of tokens per document: 253

  • Median number of tokens per document: 124

Training

  • Effective batch size: 240
  • Learning rate: 1e-4
  • Weight decay: 1e-3
  • Learning schedule: linear, with 25_000 warmup steps
  • Num epochs: 3

Train perplexity: 3.0

Validation perplexity: 4.0