UMCU
/

CardioLM_encoder_base

Inference Endpoints

Model card Files Files and versions Community

Continued, off-premise, pre-training of MedRoBERTa.nl using about 50GB of open Dutch and translated English corpora.

Data statistics

Sources:

Dutch: medical guidelines (FMS, NHG)
Dutch: NtvG papers
English: Pubmed abstracts
English: PMC abstracts translated using DeepL
English: Apollo guidelines, papers and books
English: Meditron guidelines
English: MIMIC3
English: MIMIC CXR
English: MIMIC4

All translated (if not with DeepL) with performed with a combination of GeminiFlash 1.5/GPT4o mini, MariaNMT, NLLB200.

Number of tokens: 15B
Number of documents: 27M

Training

Effective batch size: 5120
Learning rate: 2e-4
Weight decay: 1e-3
Learning schedule: linear, with 5_000 warmup steps
Num epochs: ~3

Train perplexity: 3.0 Validation perplexity: 3.0

Acknowledgement

We were happy to be able to use the Google TPU research cloud for training the model.

Downloads last month: 23

Safetensors

Model size

166M params

Tensor type

F32

·

Inference Providers NEW

This model is not currently available via any of the supported Inference Providers.

Model tree for UMCU/CardioLM_encoder_base

Base model

CLTL/MedRoBERTa.nl

Finetuned

(3)

this model