File size: 4,929 Bytes
6b50c81 2ffc1a2 6b50c81 4a5591a 6b50c81 97e3a85 06ea462 6b50c81 6cb0e83 6b50c81 f860696 6b50c81 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 |
---
library_name: peft
base_model: cognitivecomputations/dolphin-2.1-mistral-7b
license: apache-2.0
language:
- en
---
# PathoIE-Dolphin-7B
<img src="https://cdn-uploads.huggingface.co/production/uploads/646704281dd5854d4de2cdda/Th9An9TEKYS9G9WiIL1ks.webp" width="500" />
## Training:
Check out our github: https://github.com/HIRC-SNUBH/Curation_LLM_PathoReport.git
- PEFT 0.4.0
## Inference
``` python
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
'cognitivecomputations/dolphin-2.1-mistral-7b',
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16, # Optional, if you have insufficient VRAM, lower the precision.
)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained('cognitivecomputations/dolphin-2.1-mistral-7b')
# Load PEFT
model = PeftModel.from_pretrained(base_model, 'Lowenzahn/PathoIE-Dolphin-7B')
model = model.merge_and_unload()
model = model.eval()
# Inference
prompts = ["Machine learning is"]
inputs = tokenizer(prompts, return_tensors="pt")
gen_kwargs = {"max_new_tokens": 1024, "top_p": 0.8, "temperature": 0.0, "do_sample": False, "repetition_penalty": 1.0}
output = model.generate(inputs['input_ids'], **gen_kwargs)
output = tokenizer.decode(output[0].tolist(), skip_special_tokens=True)
print(output)
```
# Prompt example
The pathology report used below is a fictive example.
```
<|im_start|> system
You are a pathologist who specialized in lung cancer.
Your task is extracting informations requested by the user from the lung cancer pathology report and formatting extracted informations into JSON.
The information to be extracted is clearly specified in the report, so one must avoid from inferring information that is not present.
Remember, you MUST answer in JSON only. Avoid any additional explanations.
<|im_end|>
<|im_start|> user
Extract the following informations (value-set) from the report I provide.
If the required information to extract each value in the value-set is not present in the pathology report, consider it as 'not submitted'.
<value-set>
- MORPHOLOGY_DIAGNOSIS
- SUBTYPE_DOMINANT
- MAX_SIZE_OF_TUMOR(invasive component only)
- MAX_SIZE_OF_TUMOR(including CIS=AIS)
- INVASION_TO_VISCERAL_PLEURAL
- MAIN_BRONCHUS
- INVASION_TO_CHEST_WALL
- INVASION_TO_PARIETAL_PLEURA
- INVASION_TO_PERICARDIUM
- INVASION_TO_PHRENIC_NERVE
- TUMOR_SIZE_CNT
- LUNG_TO_LUNG_METASTASIS
- INTRAPULMONARY_METASTASIS
- SATELLITE_TUMOR_LOCATION
- SEPARATE_TUMOR_LOCATION
- INVASION_TO_MEDIASTINUM
- INVASION_TO_DIAPHRAGM
- INVASION_TO_HEART
- INVASION_TO_RECURRENT_LARYNGEAL_NERVE
- INVASION_TO_TRACHEA
- INVASION_TO_ESOPHAGUS
- INVASION_TO_SPINE
- METASTATIC_RIGHT_UPPER_LOBE
- METASTATIC_RIGHT_MIDDLE_LOBE
- METASTATIC_RIGHT_LOWER_LOBE
- METASTATIC_LEFT_UPPER_LOBE
- METASTATIC_LEFT_LOWER_LOBE
- INVASION_TO_AORTA
- INVASION_TO_SVC
- INVASION_TO_IVC
- INVASION_TO_PULMONARY_ARTERY
- INVASION_TO_PULMONARY_VEIN
- INVASION_TO_CARINA
- PRIMARY_CANCER_LOCATION_RIGHT_UPPER_LOBE
- PRIMARY_CANCER_LOCATION_RIGHT_MIDDLE_LOBE
- PRIMARY_CANCER_LOCATION_RIGHT_LOWER_LOBE
- PRIMARY_CANCER_LOCATION_LEFT_UPPER_LOBE
- PRIMARY_CANCER_LOCATION_LEFT_LOWER_LOBE
- RELATED_TO_ATELECTASIS_OR_OBSTRUCTIVE_PNEUMONITIS
- PRIMARY_SITE_LATERALITY
- LYMPH_METASTASIS_SITES
- NUMER_OF_LYMPH_NODE_META_CASES
---
<report>
[A] Lung, left lower lobe, lobectomy
1. ADENOSQUAMOUS CARCINOMA [by 2015 WHO classification]
- other subtype: acinar (50%), lepidic (30%), solid (20%)
1) Pre-operative / Previous treatment: not done
2) Histologic grade: moderately differentiated
3) Size of tumor:
a. Invasive component only: 3.5 x 2.5 x 1.3 cm, 2.4 x 2.3 x 1.1 cm
b. Including CIS component: 3.9 x 2.6 x 1.3 cm, 3.8 x 3.1 x 1.2 cm
4) Extent of invasion
a. Invasion to visceral pleura: PRESENT (P2)
b. Invasion to superior vena cava: present
5) Main bronchus: not submitted
6) Necrosis: absent
7) Resection margin: free from carcinoma (safey margin: 1.1 cm)
8) Lymph node: metastasis in 2 out of 10 regional lymph nodes
(peribronchial lymph node: 1/3, LN#5,6 :0/1, LN#7:0/3, LN#12: 1/2)
<|im_end|>
<|im_start|> pathologist
```
## Developed by
- **_ezCaretech AI Team_**
- **_Office of eHealth Research and Business, [SNUBH](https://www.snubh.org/dh/en/)_**
## Citation
```
@article{cho2024ie,
title={Extracting lung cancer staging descriptors from pathology reports: a generative language model approach},
author={Hyeongmin Cho, Sooyoung Yoo, Borham Kim, Sowon Jang, Leonard Sunwoo, Sanghwan Kim, Donghyoung Lee, Seok Kim, Sejin Nam, Jin-Haeng Chung},
journal={Journal of Biomedical Informatics},
volume={157},
year={2024},
publisher={Elsevier},
issn={1532-0464},
doi={10.1016/j.jbi.2024.104720},
url={https://doi.org/10.1016/j.jbi.2024.104720}
}
``` |