Continued, off-premise, pre-training of MedRoBERTa.nl using about 50GB of open Dutch and translated English corpora.

Data statistics

Sources:

  • Dutch: medical guidelines (FMS, NHG)
  • Dutch: NtvG papers
  • English: Pubmed abstracts
  • English: PMC abstracts translated using DeepL
  • English: Apollo guidelines, papers and books
  • English: Meditron guidelines
  • English: MIMIC3
  • English: MIMIC CXR
  • English: MIMIC4

All translated (if not with DeepL) with performed with a combination of GeminiFlash 1.5/GPT4o mini, MariaNMT, NLLB200.

  • Number of tokens: 15B
  • Number of documents: 27M

Training

  • Effective batch size: 5120
  • Learning rate: 2e-4
  • Weight decay: 1e-3
  • Learning schedule: linear, with 5_000 warmup steps
  • Num epochs: ~3

Train perplexity: 3.0 Validation perplexity: 3.0

Acknowledgement

We were happy to be able to use the Google TPU research cloud for training the model.

Downloads last month
23
Safetensors
Model size
166M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for UMCU/CardioLM_encoder_base

Base model

CLTL/MedRoBERTa.nl
Finetuned
(3)
this model