File size: 6,656 Bytes
6786873
 
 
 
27c6468
6786873
 
f845148
6786873
f845148
2dfc2f4
95fda1d
2dfc2f4
f845148
18535ad
f845148
 
2dfc2f4
f845148
 
2dfc2f4
f845148
2dfc2f4
cbbb66d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1855f2c
 
f845148
1855f2c
f845148
 
 
1855f2c
6786873
 
f845148
6786873
 
 
 
fead6e5
2257dc6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
---
language:
- en
license: apache-2.0
pipeline_tag: fill-mask
---

**_NOTE: `bioformer-cased-v1.0` has been renamed to `bioformer-8L`. All links to `bioformer-cased-v1.0` will automatically redirect to `bioformer-8L`, including git operations. However, to avoid confusion, we recommend updating any existing local clones to point to the new repository URL._**

Bioformer-8L is a lightweight BERT model for biomedical text mining. Bioformer-8L uses a biomedical vocabulary and is pre-trained from scratch only on biomedical domain corpora. Our experiments show that Bioformer-8L is 3x as fast as BERT-base, and achieves comparable or even better performance than BioBERT/PubMedBERT on downstream NLP tasks.

Bioformer-8L has 8 layers (transformer blocks) with a hidden embedding size of 512, and the number of self-attention heads is 8. Its total number of parameters is 42,820,610.

**The usage of Bioformer-8L is the same as a standard BERT model. The documentation of BERT can be found [here](https://huggingface.co./docs/transformers/model_doc/bert).**

## Vocabulary of Bioformer-8L
Bioformer-8L uses a cased WordPiece vocabulary trained from a biomedical corpus, which included all PubMed abstracts (33 million, as of Feb 1, 2021) and 1 million PMC full-text articles. PMC has 3.6 million articles but we down-sampled them to 1 million such that the total size of PubMed abstracts and PMC full-text articles are approximately equal. To mitigate the out-of-vocabulary issue and include special symbols (e.g. male and female symbols) in biomedical literature, we trained Bioformer’s vocabulary from the Unicode text of the two resources. The vocabulary size of Bioformer-8L is 32768 (2^15), which is similar to that of the original BERT.

## Pre-training of Bioformer-8L
Bioformer-8L was pre-trained from scratch on the same corpus as the vocabulary (33 million PubMed abstracts + 1 million PMC full-text articles). For the masked language modeling (MLM) objective, we used whole-word masking with a masking rate of 15%. There are debates on whether the next sentence prediction (NSP) objective could improve the performance on downstream tasks. We include it in our pre-training experiment in case the prediction of the next sentence is needed by end-users. Sentence segmentation of all training text was performed using [SciSpacy](https://allenai.github.io/scispacy/).

Pre-training of Bioformer-8L was performed on a single Cloud TPU device (TPUv2, 8 cores, 8GB memory per core). The maximum input sequence length was fixed to 512, and the batch size was set to 256. We pre-trained Bioformer-8L for 2 million steps, which took about 8.3 days.

## Usage

Prerequisites: python3, pytorch, transformers and datasets

We have tested the following commands on Python v3.9.16, PyTorch v1.13.1+cu117, Datasets v2.9.0 and Transformers v4.26.

To install pytorch, please refer to instructions [here](https://pytorch.org/get-started/locally).

To install the `transformers` and `datasets` library:
```
pip install transformers
pip install datasets
```

### Filling mask

```
from transformers import pipeline
unmasker8L = pipeline('fill-mask', model='bioformers/bioformer-8L')
unmasker8L("[MASK] refers to a group of diseases that affect how the body uses blood sugar (glucose)")

unmasker16L = pipeline('fill-mask', model='bioformers/bioformer-16L')
unmasker16L("[MASK] refers to a group of diseases that affect how the body uses blood sugar (glucose)")

```

Output of `bioformer-8L`:

```
[{'score': 0.3207533359527588, 
'token': 13473, 
'token_str': 'Diabetes', 
'sequence': 'Diabetes refers to a group of diseases that affect how the body uses blood sugar ( glucose )'}, 

{'score': 0.19234347343444824, 
'token': 17740, 
'token_str': 'Obesity', 
'sequence': 'Obesity refers to a group of diseases that affect how the body uses blood sugar ( glucose )'}, 

{'score': 0.09200277179479599, 
'token': 10778, 
'token_str': 'T2DM', 
'sequence': 'T2DM refers to a group of diseases that affect how the body uses blood sugar ( glucose )'}, 

{'score': 0.08494312316179276, 
'token': 2228, 
'token_str': 'It', 
'sequence': 'It refers to a group of diseases that affect how the body uses blood sugar ( glucose )'}, 

{'score': 0.0412776917219162, 
'token': 22263, 
'token_str': 
'Hypertension', 
'sequence': 'Hypertension refers to a group of diseases that affect how the body uses blood sugar ( glucose )'}]
```

Output of `bioformer-16L`:

```
[{'score': 0.7262957692146301,
'token': 13473,
'token_str': 'Diabetes',
'sequence': 'Diabetes refers to a group of diseases that affect how the body uses blood sugar ( glucose )'},

{'score': 0.124954953789711,
'token': 10778,
'token_str': 'T2DM',
'sequence': 'T2DM refers to a group of diseases that affect how the body uses blood sugar ( glucose )'},

{'score': 0.04062706232070923,
'token': 2228,
'token_str': 'It',
'sequence': 'It refers to a group of diseases that affect how the body uses blood sugar ( glucose )'}, 

{'score': 0.022694870829582214,
'token': 17740,
'token_str': 'Obesity',
'sequence': 'Obesity refers to a group of diseases that affect how the body uses blood sugar ( glucose )'},

{'score': 0.009743048809468746,
'token': 13960,
'token_str': 'T2D',
'sequence': 'T2D refers to a group of diseases that affect how the body uses blood sugar ( glucose )'}]
```

## Awards
Bioformer-8L achieved top performance (highest micro-F1 score) in the BioCreative VII COVID-19 multi-label topic classification challenge (https://doi.org/10.1093/database/baac069)

## Links

[Bioformer-16L](https://huggingface.co./bioformers/bioformer-16L)

## Acknowledgment

Training and evaluation of Bioformer-8L is supported by the Google TPU Research Cloud (TRC) program, the Intramural Research Program of the National Library of Medicine (NLM), National Institutes of Health (NIH), and NIH/NLM grants LM012895 and 1K99LM014024-01.

## Questions
If you have any questions, please submit an issue here: https://github.com/WGLab/bioformer/issues

You can also send an email to Li Fang ([email protected], https://fangli80.github.io/).


## Citation

You can cite our preprint on arXiv:

Fang L, Chen Q, Wei C-H, Lu Z, Wang K: Bioformer: an efficient transformer language model for biomedical text mining. arXiv preprint arXiv:2302.01588 (2023). DOI: https://doi.org/10.48550/arXiv.2302.01588


BibTeX format:
```
@ARTICLE{fangli2023bioformer,
    author = {{Fang}, Li and {Chen}, Qingyu and {Wei}, Chih-Hsuan and {Lu}, Zhiyong and {Wang}, Kai},
    title = "{Bioformer: an efficient transformer language model for biomedical text mining}",
    journal = {arXiv preprint arXiv:2302.01588},
    year = {2023}
}
```