File size: 2,859 Bytes

711d814
 
 
902359a
 
7fb66d6
902359a
 
 
 
 
 
 
 
2e44958
 
559bc6f
 
7a47c4f
 
eee7d11
2c7cc42
d8a81df
e5c9604
eee7d11
2c7cc42
eee7d11
7a47c4f
 
902359a
 
 
 
 
 
 
 
 
 
 
 
15647ed
cb3e9f1
15647ed
 
fd94375
15647ed
 
 
 
 
 
62dffc3
cb3e9f1
15647ed
902359a

---
license: apache-2.0
---
<h2>GatorTron-Base overview </h2>

Developed by a joint effort between the University of Florida and NVIDIA, GatorTron-Base is a clinical language model of 345 million parameters, pre-trained using a BERT architecure implemented in the Megatron package (https://github.com/NVIDIA/Megatron-LM). 

GatorTron-Base is pre-trained using a dataset consisting of:

- 82B words of de-identified clinical notes from the University of Florida Health System,
- 6.1B words from PubMed CC0,
- 2.5B words from WikiText,
- 0.5B words of de-identified clinical notes from MIMIC-III

The Github for GatorTron is at : https://github.com/uf-hobi-informatics-lab/GatorTron

This model is converted to Hugginface from : https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/models/gatortron_og

<h2>Model variations</h2>

| Model | Parameter |
| ------- | ------- |
|<a href="https://huggingface.co./UFNLP/gatortron-base">gatortron-base</a> | 345 million |
| ------- | ------- |
| gatortron-medium | 3.9 billion |
| ------- | ------- | 
| gatortron-large | 8.9 billion |


<h2>De-identification</h2>

We applied a de-identification system to remove protected health information (PHI) from clinical text. We adopted the safe-harbor method to identify 18 PHI categories defined in the Health Insurance Portability and Accountability Act (HIPAA) and replaced them with dummy strings (e.g., replace people’s names into [\*\*NAME\*\*]). 

The de-identifiation system is described in:

Yang X, Lyu T, Li Q, Lee C-Y, Bian J, Hogan WR, Wu Y†. A study of deep learning methods for de-identification of clinical notes in cross-institute settings. BMC Med Inform Decis Mak. 2020 Dec 5;19(5):232. https://www.ncbi.nlm.nih.gov/pubmed/31801524.

<h2>Citation info</h2>

Yang X, Chen A, PourNejatian N, Shin HC, Smith KE, Parisien C, Compas C, Martin C, Costa AB, Flores MG, Zhang Y, Magoc T, Harle CA, Lipori G, Mitchell DA, Hogan WR, Shenkman EA, Bian J, Wu Y†. A large language model for electronic health records. Npj Digit Med. Nature Publishing Group; . 2022 Dec 26;5(1):1–9. https://www.nature.com/articles/s41746-022-00742-2

- BibTeX entry
```
@article{yang2022large,
  title={A large language model for electronic health records},
  author={Yang, Xi and Chen, Aokun and PourNejatian, Nima and Shin, Hoo Chang and Smith, Kaleb E and Parisien, Christopher and Compas, Colin and Martin, Cheryl and Costa, Anthony B and Flores, Mona G and Zhang, Ying and Magoc, Tanja and Harle, Christopher A and Lipori, Gloria and Mitchell, Duane A and Hogan, William R and Shenkman, Elizabeth A and Bian, Jiang and Wu, Yonghui },
  journal={npj Digital Medicine},
  volume={5},
  number={1},
  pages={194},
  year={2022},
  publisher={Nature Publishing Group UK London}
} 
```

<h2>Contact</h2>

- Yonghui Wu: yonghui.wu 'at' ufl.edu
- Cheng Peng: c.peng 'at' ufl.edu