File size: 934 Bytes
511bbc8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ec31852
0d48b38
 
ec31852
 
79f6d33
 
511bbc8
 
 
 
 
 
 
 
 
bb6c5f8
511bbc8
bb6c5f8
27c2cc8
bb6c5f8
511bbc8
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
---
license: gpl-3.0
language:
- nl
base_model:
- CLTL/MedRoBERTa.nl
tags:
- medical
- healthcare
metrics:
- perplexity
library_name: transformers
---

Continued, on-premise, pre-training of [MedRoBERTa.nl](https://huggingface.co./CLTL/MedRoBERTa.nl) using de-identified Electronic Health Records from the University Medical Center Utrecht, related to the cardiology domain.


# Data statistics 

Sources:
* Dutch medical guidelines (FMS, NHG)
* [NtvG](https://www.ntvg.nl/) papers
* PMC abstracts translated using GeminiFlash 1.5

* Number of tokens: 1.47B, of which 1B from UMCU EHRs
* Number of documents: 5.8M, of which 3.5M UMCU EHRs
* Average number of tokens per document: 253
* Median number of tokens per document: 124

# Training 

* Effective batch size: 240
* Learning rate: 1e-4
* Weight decay: 1e-3
* Learning schedule: linear, with 25_000 warmup steps
* Num epochs: 3

Train perplexity: 3.0

Validation perplexity: 4.0