license: cc-by-nc-sa-4.0
language:
- ko
tags:
- medical
datasets:
- sean0042/KorMedMCQA
- ChuGyouk/AI_healthcare_QA_samples_Sonnet3.5
- ChuGyouk/HFH4_ultrachat_200k_ko
๐ฅ Ko-Med-Gemma-2-9b-it
โ ๏ธ The output of this model should not be considered as professional medical advice, diagnosis, or treatment. For accurate diagnosis and treatment of any specific medical issue, please consult a qualified physician or healthcare professional. Also, the commercial usage of this model is prohibited.
๐ Training
We did continued fine-tuning using rtzr/ko-gemma-2-9b-it as a base model. We trained 1.5 epoch on our dataset and the training time was about 65 hours. 1 A100-80G NVIDIA GPU is used.
๐ Dataset
The dataset in the tag is not all of our training data
Our dataset consists of 449,500 data in total, including English/Korean medical data and English/Korean general domain data. It includes both single-turn and multi-turn, and includes doctor-patient conversations as well as medical exam QA with reasoning.
Korean medical data includes some that are translated from English medical data, while others are not.
๐ Evaluation
KorMedMCQA
[KorMedMCQA (ํ๊ตญ์ด ์ํ ๋ฒค์น๋งํฌ) Dataset]
We have uploaded the full results (including the exact outputs) in Google Drive [here].
Method
We followed the lm-eval direct generation method proposed in the original paper, but modified a little bit.
See my modified lm-evaluation-harness repo [here].
- Since there are many (relatively) easy problems in the nurse category so that the final average score tends to be high, weight_by_size was set to false during mean aggregation.
- Change the few-shot split from 'dev' to 'fewshot'
- Add 'dentist' category.
- Add multiple 'eos' tokens for generation_kwargs. (Since various recent models use different eos tokens.)
Note
- Due to the batch inference issue of gemma-2 models here, we used batch_size=1. (Also, The batch size is automatically 1 for closed-sourced models through API. (except openAI batch API)) We hope that there is no big difference from the case where batch_size=8.
- Other settings: num_fewshot=5, seed=42, 1 A100-80G NVIDIA GPU.
- (WANT) TODO: Do for other random seeds twice more and calculate average as the final score.
Results (5-shots, Direct Generation)
Model | Doctor | Dentist | Nurse | Pharm | Avg |
---|---|---|---|---|---|
Closed Source | |||||
gpt-4o-2024-08-06 โ | 85.75 | 78.91 | 91.00 | 85.65 | 85.33 |
gpt-4o-2024-05-13 โ | 85.98 | 60.67 โก | 84.97 | 84.18 | 78.95 |
gpt-4o-mini-2024-07-18 โ | 66.20 | 61.16 | 79.50 | 69.94 | 69.20 |
HyperCLOVA X HCX-003 ยง | 49.89 | 50.31 | 72.55 | 62.26 | 58.75 |
HyperCLOVA X HCX-DASH-001 ยง | 43.22 | 42.42 | 60.02 | 47.80 | 48.36 |
solar-1-mini-chat-240612 ยง | 43.45 | 39.21 | 57.52 | 46.33 | 46.63 |
Gemma-2 9B Family | |||||
ChuGyouk/ko-med-gemma-2-9b-it-merge2 (Ours, Merged) | 57.47 | 56.60 | 76.42 | 68.36 | 64.71 |
ChuGyouk/ko-med-gemma-2-9b-it-merge1 (Ours, merged) | 57.93 | 55.86 | 75.06 | 68.93 | 64.44 |
ChuGyouk/ko-med-gemma-2-9b-it-base (Ours, Base) | 57.47 | 55.24 | 76.08 | 68.81 | 64.40 |
rtzr/ko-gemma-2-9b-it | 54.02 | 53.14 | 73.46 | 64.63 | 61.32 |
google/gemma-2-9b-it | 52.41 | 52.90 | 73.58 | 64.29 | 60.80 |
Korean Models | |||||
LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct | 47.13 | 46.98 | 69.36 | 56.27 | 54.93 |
yanolja/EEVE-Korean-Instruct-10.8B-v1.0 | 44.37 | 41.18 | 66.63 | 57.29 | 52.37 |
upstage/SOLAR-10.7B-Instruct-v1.0 | 40.23 | 36.13 | 55.58 | 52.88 | 46.21 |
Multilingual Models | |||||
Qwen/Qwen2-7B-Instruct | 44.60 | 45.75 | 65.95 | 57.74 | 53.51 |
meta-llama/Meta-Llama-3.1-8B-Instruct | 39.77 | 41.68 | 59.34 | 56.61 | 49.35 |
>10B Models | |||||
google/gemma-2-27b-it (27.2B) | 58.85 | 56.47 | 79.27 | 71.86 | 66.61 |
CohereForAI/c4ai-command-r-08-2024 (32.3B) | 63.91 | 53.14 | 75.28 | 69.38 | 65.43 |
mistralai/Mistral-Nemo-Instruct-2407 (12.2B) | 42.53 | 44.51 | 66.17 | 56.38 | 52.40 |
โ : For the answers of GPT, we received a lot of responses in the form of "์ ๋ต: A", "์ ๋ต์ B", "์ ๋ต์ C ์ ๋๋ค". We manually changed these answers to simply A, B, and C, and then re-measured the score.
โก : For the dentist results of gpt-4o-2024-05-13, responses like "์ ๋ต์ ์ ๊ณตํด๋๋ฆฌ๊ฒ ์ต๋๋ค:" are observed quite frequently, so the actual score is expected to be slightly higher.
๐ Example
Python Code
import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "ChuGyouk/ko-med-gemma-2-9b-it-merge2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
model.eval()
# KorMedMCQA Doctor Test 2022 2nd period Question 61, Answer is B, ์ธ๊ท ์๋ง์ผ
input_prompt = "๋ค์ ์ํ ์ํ ๋ฌธ์ ์ ๋ํด ๋ณด๊ธฐ A~E ์ค ์ ๋ต์ ์ ํํ์ธ์. ์ ํ ์ด์ ๋ฅผ ๊ตฌ์ฒด์ ์ผ๋ก ์ ๊ณตํ์ธ์."
question = "6์ธ ์ฌ์๊ฐ ํ๋ฃจ ์ ๋ถํฐ ๋จธ๋ฆฌ๊ฐ ์ํ๊ณ ๊ตฌํ ๋ฅผ ํ์ฌ ๋ณ์์ ์๋ค. ํ์ 100/60 mmHg, ๋งฅ๋ฐ 110ํ/๋ถ, ํธํก 25ํ/๋ถ, ์ฒด์จ 38.7โ์ด๋ค. ์กธ๋ คํ๊ณ , ๋ชฉ๊ฒฝ์ง๊ณผ ์ปค๋๊ทธ(Kernig) ์งํ๋ ์์ฑ์ด์ง๋ง, ๋์ ๊ฒฝ๋ง๋น๋ ๋ถ๋ถ์ ๊ฒฝ ์งํ๋ ์๋ค. ๊ฒ์ฌ ๊ฒฐ๊ณผ๋ ๋ค์๊ณผ ๊ฐ๋ค. ์ง๋จ์?\nํ์ก: ํ์์ 13.8g/dL, ๋ฐฑํ๊ตฌ 14,200/mm^3, ํ์ํ 135,000/mm^3 ์ด๋จ๋ฐฑ์ง 7.4 g/dL, ์๋ถ๋ฏผ 4.3 g/dL, ํฌ๋๋น 105 mg/dL, C-๋ฐ์๋จ๋ฐฑ์ง 120 mg/L (์ฐธ๊ณ ์น, <10) ํ๋ก์นผ์ํ ๋ 90 ng/mL (์ฐธ๊ณ ์น, 0.00-0.49) ๊ฒฐํตํน์ด ์ธํฐํ๋ก ๊ฐ๋ง(interferon-ฮณ) ๋ฐฉ์ถ์ธก์ ์์ฑ์๋ณ: ์ ํ๊ตฌ 5๏ฝ10/๊ณ ๋ฐฐ์จ์์ผ, ๋ฐฑํ๊ตฌ 20๏ฝ30/๊ณ ๋ฐฐ์จ์์ผ, ๋์ฒ์์ก: ์๋ ฅ 240 mmH2O, ๋ฐฑํ๊ตฌ 650/mm^3 (๋คํํต๋ฐฑํ๊ตฌ 90%, ๋ฆผํ๊ตฌ 10%), ๋จ๋ฐฑ์ง 112 mg/dL, ํฌ๋๋น 35 mg/dL, ์ฌ๋ฆฌ๊ณ ํด๋ก ๋ ์์ฑ ์๋ฐ๋
ธ์ ํ์๋ฏธ๋
ธํจ์ํ์ฑ๋(ADA) 4.1 U/L (์ฐธ๊ณ ์น, <10)"
A = "๊ฒฐํต์๋ง์ผ"
B = "์ธ๊ท ์๋ง์ผ"
C = "์๊ฐ๋ฉด์ญ๋์ผ"
D = "๋ฐ์ด๋ฌ์ค์๋ง์ผ"
E = "๊ธ์ฑํ์ข
๋์ฒ์์ผ"
instruction = f"{input_prompt}\n\n{question}\nA. {A}\nB. {B}\nC. {C}\nD. {D}\nE. {E}"
messages = [
{
"role": "user",
"content": instruction
}
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
terminators = [
tokenizer.eos_token_id,
tokenizer.convert_tokens_to_ids("<end_of_turn>")
]
outputs = model.generate(
input_ids,
max_new_tokens=2048,
eos_token_id=terminators,
do_sample=True,
temperature=0.7,
top_p=1.0,
)
print(tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True))
Output
์ ๋ต์ B์
๋๋ค. ์ธ๊ท ์๋ง์ผ.
๋ค์์ ๊ฐ ์ ํ์ง์ ๋ํ ์ค๋ช
์
๋๋ค:
A. ๊ฒฐํต์๋ง์ผ: ๊ฒฐํตํน์ด ์ธํฐํ๋ก -๊ฐ๋ง ๋ฐฉ์ถ์ธก์ ์ด ์์ฑ์ด๋ฏ๋ก ๊ฒฐํต์๋ง์ผ์ ๊ฐ๋ฅ์ฑ์ด ๋ฎ์ต๋๋ค.
B. ์ธ๊ท ์๋ง์ผ: ํ์๋ ๊ณ ์ด, ๊ตฌํ , ๋ํต, ๋ชฉ๊ฒฝ์ง, ์ปค๋๊ทธ ์งํ ์์ฑ ๋ฑ ์ธ๊ท ์๋ง์ผ์ ์ ํ์ ์ธ ์ฆ์์ ๋ณด์
๋๋ค. ํ์ก ๊ฒ์ฌ์์ ๋ฐฑํ๊ตฌ ์ฆ๊ฐ๊ฐ ๋ํ๋๋ฉฐ, ๋์ฒ์์ก ๊ฒ์ฌ์์ ๋ฐฑํ๊ตฌ ์๊ฐ ๋๊ณ (650/mm3), ๋คํํต๋ฐฑํ๊ตฌ๊ฐ ์ฐจ์งํ๋ ๋น์จ์ด ๋์ผ๋ฉฐ(90%), ํฌ๋๋น ์์น๊ฐ ๋ฎ๊ณ (35 mg/dL), ๋จ๋ฐฑ์ง ์์น๊ฐ ๋์ต๋๋ค(112 mg/dL). ์ด๋ฌํ ์๊ฒฌ์ ์ธ๊ท ์ฑ ์ผ์ฆ์ ๋ํ๋
๋๋ค.
C. ์๊ฐ๋ฉด์ญ๋์ผ: ์๊ฐ๋ฉด์ญ๋์ผ์ ์ผ๋ฐ์ ์ผ๋ก ๋์ ๊ฒฝ๋ง๋น๋ ๋ถ๋ถ์ ๊ฒฝ ์งํ๋ฅผ ๋๋ฐํ์ง๋ง, ์ด ํ์๋ ์ด๋ฌํ ์ฆ์์ ๋ณด์ด์ง ์์ต๋๋ค. ๋ํ, ๋์ฒ์์ก ๊ฒ์ฌ์์ ADA ํ์ฑ๋๊ฐ ์ ์ ๋ฒ์์ ์์ผ๋ฏ๋ก ์๊ฐ๋ฉด์ญ๋์ผ์ ๊ฐ๋ฅ์ฑ์ด ๋ฎ์ต๋๋ค.
D. ๋ฐ์ด๋ฌ์ค์๋ง์ผ: ๋ฐ์ด๋ฌ์ค์๋ง์ผ์ ์ผ๋ฐ์ ์ผ๋ก ๋ฐฑํ๊ตฌ ์์น๊ฐ ๋ฎ๊ณ (200/mm3 ์ดํ), ๋จ๋ฐฑ์ง ์์น๊ฐ ๋ฎ์ผ๋ฉฐ, ํฌ๋๋น ์์น๋ ์ ์ ๋๋ ์ฝ๊ฐ ๋ฎ์ ์ ์์ต๋๋ค. ์ด ํ์๋ ์ธ๊ท ์ฑ ์ผ์ฆ์ ์ฆ๊ฑฐ๊ฐ ์์ผ๋ฉฐ, ๋ฐ์ด๋ฌ์ค ์นจ์ต์ ๋ํ๋ด๋ ํน์ด์ ์ธ ์๊ฒฌ์ด ์์ต๋๋ค.
E. ๊ธ์ฑํ์ข
๋์ฒ์์ผ: ๊ธ์ฑํ์ข
๋์ฒ์์ผ์ ์ผ๋ฐ์ ์ผ๋ก ๋ค์ํ ์ ๊ฒฝํ์ ๊ฒฐ์์ ๋๋ฐํ๋ฉฐ, ๋์ฒ์์ก์์ ์ฌ๋ฆฌ๊ณ ํด๋ก ๋ ๊ฐ ๋ํ๋ฉ๋๋ค. ์ด ํ์๋ ๋์ ๊ฒฝ๋ง๋น๋ ๋ถ๋ถ์ ๊ฒฝ ์งํ๊ฐ ์์ผ๋ฉฐ, ์ฌ๋ฆฌ๊ณ ํด๋ก ๋ ๋ ๋ํ๋์ง ์์ต๋๋ค.
๊ฒฐ๋ก ์ ์ผ๋ก, ํ์์ ์์ ์ฆ์๊ณผ ๋์ฒ์์ก ๊ฒ์ฌ ๊ฒฐ๊ณผ๋ ์ธ๊ท ์ฑ ์๋ง์ผ์ ๊ฐ์ฅ ๊ฐ๋ฅ์ฑ์ด ๋์ ์ง๋จ์ผ๋ก ์ง์งํฉ๋๋ค.
Other Example
# "ํ์ด๋ ๋์ ๋จน์๋๋ฐ๋ ๋ถ๊ตฌํ๊ณ ๋ํต์ด ๋๋ฌด ์ฌํด์. ์ด๋ป๊ฒ ํด์ผ ํ๋์?"
๋ํต์ด ์ฌํ์๋ค๋ ์ํ๊น์ต๋๋ค. ํ์ด๋ ๋์ ๋ณต์ฉํ์์๋ ๋ถ๊ตฌํ๊ณ ๋ํต์ด ์ง์๋๋ค๋ฉด, ๋ค์๊ณผ ๊ฐ์ ๋ฐฉ๋ฒ์ ๊ณ ๋ คํด ๋ณผ ์ ์์ต๋๋ค:
1. **์ถฉ๋ถํ ํด์๊ณผ ์๋ถ ์ญ์ทจ**: ๋ํต์ด ์ฌํ ๋๋ ์กฐ์ฉํ ๊ณณ์์ ํด์์ ์ทจํ๊ณ ์ถฉ๋ถํ ์๋ถ์ ์ญ์ทจํ๋ ๊ฒ์ด ๋์์ด ๋ ์ ์์ต๋๋ค.
2. **๋์ฐ์ง**: ์ผ์ํฉ์ด๋ ์ฐจ๊ฐ์ด ๋ฌผ๋ก ์ ์ ์๊ฑด์ ์ด๋ง๋ ๋ชฉ์ ๋์ด ๋์ฐ์ง์ ํด๋ณด์ธ์.
3. **์นดํ์ธ ์ญ์ทจ**: ์ ๋น๋์ ์นดํ์ธ์ ๋ํต์ ์ํ์ํฌ ์ ์์ต๋๋ค. ์ปคํผ๋ ์ฐจ๋ฅผ ์กฐ๊ธ ๋ง์
๋ณด์ธ์.
4. **์คํธ๋ ์นญ**: ๋ชฉ๊ณผ ์ด๊นจ๋ฅผ ๋ถ๋๋ฝ๊ฒ ์คํธ๋ ์นญํ์ฌ ๊ธด์ฅ์ ํ์ด๋ณด์ธ์.
5. **๋ค๋ฅธ ์งํต์ **: ํ์ด๋ ๋ ์ธ์ ์ด๋ถํ๋กํ์ด๋ ๋ํ๋ก์ผ๊ณผ ๊ฐ์ ๋ค๋ฅธ ์งํต์ ๋ฅผ ๋ณต์ฉํด ๋ณผ ์ ์์ต๋๋ค. ํ์ง๋ง ๋ณต์ฉ ์ ์ ๋ฐ๋์ ์ฝ์ฌ๋ ์์ฌ์ ์๋ดํ์ธ์.
6. **์์ฌ ์๋ด**: ๋ํต์ด ์ง์๋๊ฑฐ๋ ์ฌํด์ง๋ค๋ฉด, ์์ฌ์ ์๋ดํ์ฌ ์์ธ์ ํ์
ํ๊ณ ์ ์ ํ ์น๋ฃ๋ฅผ ๋ฐ๋ ๊ฒ์ด ์ค์ํฉ๋๋ค.
๋ํต์ ์์ธ์ ๋ค์ํ ์ ์์ผ๋ฏ๋ก, ์ ํํ ์ง๋จ๊ณผ ์น๋ฃ๋ฅผ ์ํด ์๋ฃ ์ ๋ฌธ๊ฐ์ ๋์์ ๋ฐ๋ ๊ฒ์ด ๊ฐ์ฅ ์ข์ต๋๋ค.
MergeKit Details
This is a merge of pre-trained language models created using mergekit.
Merge Details
Merge Method
This model was merged using the DARE TIES merge method using google/gemma-2-9b-it as a base.
Models Merged
The following models were included in the merge:
Configuration
The following YAML configuration was used to produce this model:
models:
- model: google/gemma-2-9b-it
# No parameters necessary for base model
- model: rtzr/ko-gemma-2-9b-it
parameters:
density: 0.53
weight: 0.4
- model: ChuGyouk/ko-med-gemma-2-9b-it-base
parameters:
density: 0.53
weight: 0.6
merge_method: dare_ties
base_model: google/gemma-2-9b-it
parameters:
int8_mask: true
dtype: bfloat16
Configuration for ChuGyouk/ko-med-gemma-2-9b-it-merge1 Model
models:
- model: rtzr/ko-gemma-2-9b-it
- model: ChuGyouk/ko-med-gemma-2-9b-it-base
parameters:
density: 0.5
weight: 0.5
merge_method: dare_ties
base_model: rtzr/ko-gemma-2-9b-it
parameters:
int8_mask: true
dtype: bfloat16
Contact
[email protected]
[email protected]