File size: 3,125 Bytes
4107162
 
 
02d7cfe
6293ca0
02d7cfe
 
4107162
02d7cfe
 
4107162
 
c108c5f
 
5018757
 
dd8f135
c108c5f
dd8f135
c108c5f
 
 
dd8f135
c108c5f
 
 
 
 
 
dd8f135
c108c5f
e3674ac
c108c5f
73b9680
c108c5f
dd8f135
 
c108c5f
0a771f1
 
c108c5f
 
73b9680
0a771f1
 
 
 
 
 
73b9680
0a771f1
 
 
e3674ac
c108c5f
0a771f1
 
 
 
 
 
 
c108c5f
 
dd8f135
c108c5f
dd8f135
c108c5f
 
 
 
 
dd8f135
c108c5f
 
 
 
 
 
 
 
 
 
dd8f135
 
c108c5f
 
 
dd8f135
6a1ea73
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
---
base_model:
- google-bert/bert-base-uncased
datasets:
- gayanin/pubmed-gastro-maskfilling
language:
- en
library_name: transformers
license: apache-2.0
pipeline_tag: fill-mask
tags:
- math
---

![medBERT-logo](medBERT.png)

# **medBERT-base**

This repository contains a BERT-based model, **medBERT-base**, fine-tuned on the *gayanin/pubmed-gastro-maskfilling* dataset for the task of **Masked Language Modeling (MLM)**. The model is trained to predict masked tokens in medical and gastroenterological texts. The goal of this project is to improve the model's understanding and generation of medical-related information in natural language contexts.

## **Model Architecture**
- **Base Model**: `bert-base-uncased`
- **Task**: Masked Language Modeling (MLM) for medical texts
- **Tokenizer**: BERT's WordPiece tokenizer

## **Usage**

### **Loading the Pre-trained Model**

You can load the pre-trained **medBERT-base** model using the Hugging Face `transformers` library:

```py
from transformers import BertTokenizer, BertForMaskedLM
import torch

tokenizer = BertTokenizer.from_pretrained('suayptalha/medBERT-base')
model = BertForMaskedLM.from_pretrained('suayptalha/medBERT-base').to("cuda")

input_text = "Response to neoadjuvant chemotherapy best predicts survival [MASK] curative resection of gastric cancer."
inputs = tokenizer(input_text, return_tensors='pt').to("cuda")

outputs = model(**inputs)

masked_index = (inputs['input_ids'][0] == tokenizer.mask_token_id).nonzero(as_tuple=True)[0].item()

top_k = 5
logits = outputs.logits[0, masked_index]
top_k_ids = torch.topk(logits, k=top_k).indices.tolist()
top_k_tokens = tokenizer.convert_ids_to_tokens(top_k_ids)

print("Top 5 prediction:")
for i, token in enumerate(top_k_tokens):
    print(f"{i + 1}: {token}")
```

_Top 5 prediction:_
_1: from_
_2: of_
_3: after_
_4: by_
_5: through_

### **Fine-tuning the Model**

To fine-tune the **medBERT-base** model on your own medical dataset, follow these steps:

1. Prepare your dataset (e.g., medical texts or gastroenterology-related information) in text format.
2. Tokenize the dataset and apply masking.
3. Train the model using the provided training loop.

Here's the training code:

https://github.com/suayptalha/medBERT-base/blob/main/medBERT-base.ipynb

## **Training Details**

### **Hyperparameters**
- **Batch Size**: 16
- **Learning Rate**: 5e-5
- **Number of Epochs**: 1
- **Max Sequence Length**: 512 tokens

### **Dataset**
- **Dataset Name**: *gayanin/pubmed-gastro-maskfilling*
- **Task**: Masked Language Modeling (MLM) on medical texts

## **Acknowledgements**

- The *gayanin/pubmed-gastro-maskfilling* dataset is available on the Hugging Face dataset hub and provides a rich collection of medical and gastroenterology-related information for training.
- This model uses the Hugging Face `transformers` library, which is a state-of-the-art library for NLP models

<h3 align="left">Support:</h3>
<p><a href="https://www.buymeacoffee.com/suayptalha"> <img align="left" src="https://cdn.buymeacoffee.com/buttons/v2/default-yellow.png" height="50" width="210" alt="suayptalha" /></a></p><br><br>