Safetensors
Korean
gemma2
medical
ChuGyouk's picture
Update README.md
2077d1e verified
metadata
license: cc-by-nc-sa-4.0
language:
  - ko
tags:
  - medical
datasets:
  - sean0042/KorMedMCQA
  - ChuGyouk/AI_healthcare_QA_samples_Sonnet3.5
  - ChuGyouk/HFH4_ultrachat_200k_ko

๐Ÿฅ Ko-Med-Gemma-2-9b-it

image

โš ๏ธ The output of this model should not be considered as professional medical advice, diagnosis, or treatment. For accurate diagnosis and treatment of any specific medical issue, please consult a qualified physician or healthcare professional. Also, the commercial usage of this model is prohibited.

๐Ÿš€ Training

We did continued fine-tuning using rtzr/ko-gemma-2-9b-it as a base model. We trained 1.5 epoch on our dataset and the training time was about 65 hours. 1 A100-80G NVIDIA GPU is used.

๐ŸŽ Dataset

The dataset in the tag is not all of our training data

Our dataset consists of 449,500 data in total, including English/Korean medical data and English/Korean general domain data. It includes both single-turn and multi-turn, and includes doctor-patient conversations as well as medical exam QA with reasoning.

Korean medical data includes some that are translated from English medical data, while others are not.

๐Ÿ† Evaluation

KorMedMCQA

[KorMedMCQA (ํ•œ๊ตญ์–ด ์˜ํ•™ ๋ฒค์น˜๋งˆํฌ) Dataset]

We have uploaded the full results (including the exact outputs) in Google Drive [here].

Method

We followed the lm-eval direct generation method proposed in the original paper, but modified a little bit.

See my modified lm-evaluation-harness repo [here].

  1. Since there are many (relatively) easy problems in the nurse category so that the final average score tends to be high, weight_by_size was set to false during mean aggregation.
  2. Change the few-shot split from 'dev' to 'fewshot'
  3. Add 'dentist' category.
  4. Add multiple 'eos' tokens for generation_kwargs. (Since various recent models use different eos tokens.)

Note

  • Due to the batch inference issue of gemma-2 models here, we used batch_size=1. (Also, The batch size is automatically 1 for closed-sourced models through API. (except openAI batch API)) We hope that there is no big difference from the case where batch_size=8.
  • Other settings: num_fewshot=5, seed=42, 1 A100-80G NVIDIA GPU.
  • (WANT) TODO: Do for other random seeds twice more and calculate average as the final score.

Results (5-shots, Direct Generation)

Model Doctor Dentist Nurse Pharm Avg
Closed Source
gpt-4o-2024-08-06 โ€  85.75 78.91 91.00 85.65 85.33
gpt-4o-2024-05-13 โ€  85.98 60.67 โ€ก 84.97 84.18 78.95
gpt-4o-mini-2024-07-18 โ€  66.20 61.16 79.50 69.94 69.20
HyperCLOVA X HCX-003 ยง 49.89 50.31 72.55 62.26 58.75
HyperCLOVA X HCX-DASH-001 ยง 43.22 42.42 60.02 47.80 48.36
solar-1-mini-chat-240612 ยง 43.45 39.21 57.52 46.33 46.63
Gemma-2 9B Family
ChuGyouk/ko-med-gemma-2-9b-it-merge2 (Ours, Merged) 57.47 56.60 76.42 68.36 64.71
ChuGyouk/ko-med-gemma-2-9b-it-merge1 (Ours, merged) 57.93 55.86 75.06 68.93 64.44
ChuGyouk/ko-med-gemma-2-9b-it-base (Ours, Base) 57.47 55.24 76.08 68.81 64.40
rtzr/ko-gemma-2-9b-it 54.02 53.14 73.46 64.63 61.32
google/gemma-2-9b-it 52.41 52.90 73.58 64.29 60.80
Korean Models
LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct 47.13 46.98 69.36 56.27 54.93
yanolja/EEVE-Korean-Instruct-10.8B-v1.0 44.37 41.18 66.63 57.29 52.37
upstage/SOLAR-10.7B-Instruct-v1.0 40.23 36.13 55.58 52.88 46.21
Multilingual Models
Qwen/Qwen2-7B-Instruct 44.60 45.75 65.95 57.74 53.51
meta-llama/Meta-Llama-3.1-8B-Instruct 39.77 41.68 59.34 56.61 49.35
>10B Models
google/gemma-2-27b-it (27.2B) 58.85 56.47 79.27 71.86 66.61
CohereForAI/c4ai-command-r-08-2024 (32.3B) 63.91 53.14 75.28 69.38 65.43
mistralai/Mistral-Nemo-Instruct-2407 (12.2B) 42.53 44.51 66.17 56.38 52.40

โ€  : For the answers of GPT, we received a lot of responses in the form of "์ •๋‹ต: A", "์ •๋‹ต์€ B", "์ •๋‹ต์€ C ์ž…๋‹ˆ๋‹ค". We manually changed these answers to simply A, B, and C, and then re-measured the score.

โ€ก : For the dentist results of gpt-4o-2024-05-13, responses like "์ •๋‹ต์„ ์ œ๊ณตํ•ด๋“œ๋ฆฌ๊ฒ ์Šต๋‹ˆ๋‹ค:" are observed quite frequently, so the actual score is expected to be slightly higher.

๐Ÿ“š Example

Python Code

import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM


model_id = "ChuGyouk/ko-med-gemma-2-9b-it-merge2"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

model.eval()
# KorMedMCQA Doctor Test 2022 2nd period Question 61, Answer is B, ์„ธ๊ท ์ˆ˜๋ง‰์—ผ
input_prompt = "๋‹ค์Œ ์˜ํ•™ ์‹œํ—˜ ๋ฌธ์ œ์— ๋Œ€ํ•ด ๋ณด๊ธฐ A~E ์ค‘ ์ •๋‹ต์„ ์„ ํƒํ•˜์„ธ์š”. ์„ ํƒ ์ด์œ ๋ฅผ ๊ตฌ์ฒด์ ์œผ๋กœ ์ œ๊ณตํ•˜์„ธ์š”."
question = "6์„ธ ์—ฌ์•„๊ฐ€ ํ•˜๋ฃจ ์ „๋ถ€ํ„ฐ ๋จธ๋ฆฌ๊ฐ€ ์•„ํ”„๊ณ  ๊ตฌํ† ๋ฅผ ํ•˜์—ฌ ๋ณ‘์›์— ์™”๋‹ค. ํ˜ˆ์•• 100/60 mmHg, ๋งฅ๋ฐ• 110ํšŒ/๋ถ„, ํ˜ธํก 25ํšŒ/๋ถ„, ์ฒด์˜จ 38.7โ„ƒ์ด๋‹ค. ์กธ๋ คํ•˜๊ณ , ๋ชฉ๊ฒฝ์ง๊ณผ ์ปค๋‹ˆ๊ทธ(Kernig) ์ง•ํ›„๋Š” ์–‘์„ฑ์ด์ง€๋งŒ, ๋‡Œ์‹ ๊ฒฝ๋งˆ๋น„๋‚˜ ๋ถ€๋ถ„์‹ ๊ฒฝ ์ง•ํ›„๋Š” ์—†๋‹ค. ๊ฒ€์‚ฌ ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. ์ง„๋‹จ์€?\nํ˜ˆ์•ก: ํ˜ˆ์ƒ‰์†Œ 13.8g/dL, ๋ฐฑํ˜ˆ๊ตฌ 14,200/mm^3, ํ˜ˆ์†ŒํŒ 135,000/mm^3 ์ด๋‹จ๋ฐฑ์งˆ 7.4 g/dL, ์•Œ๋ถ€๋ฏผ 4.3 g/dL, ํฌ๋„๋‹น 105 mg/dL, C-๋ฐ˜์‘๋‹จ๋ฐฑ์งˆ 120 mg/L (์ฐธ๊ณ ์น˜, <10) ํ”„๋กœ์นผ์‹œํ† ๋‹Œ 90 ng/mL (์ฐธ๊ณ ์น˜, 0.00-0.49) ๊ฒฐํ•ตํŠน์ด ์ธํ„ฐํŽ˜๋ก ๊ฐ๋งˆ(interferon-ฮณ) ๋ฐฉ์ถœ์ธก์ • ์Œ์„ฑ์†Œ๋ณ€: ์ ํ˜ˆ๊ตฌ 5๏ฝž10/๊ณ ๋ฐฐ์œจ์‹œ์•ผ, ๋ฐฑํ˜ˆ๊ตฌ 20๏ฝž30/๊ณ ๋ฐฐ์œจ์‹œ์•ผ, ๋‡Œ์ฒ™์ˆ˜์•ก: ์••๋ ฅ 240 mmH2O, ๋ฐฑํ˜ˆ๊ตฌ 650/mm^3 (๋‹คํ˜•ํ•ต๋ฐฑํ˜ˆ๊ตฌ 90%, ๋ฆผํ”„๊ตฌ 10%), ๋‹จ๋ฐฑ์งˆ 112 mg/dL, ํฌ๋„๋‹น 35 mg/dL, ์˜ฌ๋ฆฌ๊ณ ํด๋ก ๋  ์Œ์„ฑ ์•„๋ฐ๋…ธ์‹ ํƒˆ์•„๋ฏธ๋…ธํšจ์†Œํ™œ์„ฑ๋„(ADA) 4.1 U/L (์ฐธ๊ณ ์น˜, <10)"
A = "๊ฒฐํ•ต์ˆ˜๋ง‰์—ผ"
B = "์„ธ๊ท ์ˆ˜๋ง‰์—ผ"
C = "์ž๊ฐ€๋ฉด์—ญ๋‡Œ์—ผ"
D = "๋ฐ”์ด๋Ÿฌ์Šค์ˆ˜๋ง‰์—ผ"
E = "๊ธ‰์„ฑํŒŒ์ข…๋‡Œ์ฒ™์ˆ˜์—ผ"
instruction = f"{input_prompt}\n\n{question}\nA. {A}\nB. {B}\nC. {C}\nD. {D}\nE. {E}"

messages = [
    {
        "role": "user", 
        "content": instruction
    }
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<end_of_turn>")
]

outputs = model.generate(
    input_ids,
    max_new_tokens=2048,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.7,
    top_p=1.0,
)

print(tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True))

Output

์ •๋‹ต์€ B์ž…๋‹ˆ๋‹ค. ์„ธ๊ท ์ˆ˜๋ง‰์—ผ.

๋‹ค์Œ์€ ๊ฐ ์„ ํƒ์ง€์— ๋Œ€ํ•œ ์„ค๋ช…์ž…๋‹ˆ๋‹ค:

A. ๊ฒฐํ•ต์ˆ˜๋ง‰์—ผ: ๊ฒฐํ•ตํŠน์ด ์ธํ„ฐํŽ˜๋ก -๊ฐ๋งˆ ๋ฐฉ์ถœ์ธก์ •์ด ์Œ์„ฑ์ด๋ฏ€๋กœ ๊ฒฐํ•ต์ˆ˜๋ง‰์—ผ์€ ๊ฐ€๋Šฅ์„ฑ์ด ๋‚ฎ์Šต๋‹ˆ๋‹ค.

B. ์„ธ๊ท ์ˆ˜๋ง‰์—ผ: ํ™˜์ž๋Š” ๊ณ ์—ด, ๊ตฌํ† , ๋‘ํ†ต, ๋ชฉ๊ฒฝ์ง, ์ปค๋‹ˆ๊ทธ ์ง•ํ›„ ์–‘์„ฑ ๋“ฑ ์„ธ๊ท ์ˆ˜๋ง‰์—ผ์˜ ์ „ํ˜•์ ์ธ ์ฆ์ƒ์„ ๋ณด์ž…๋‹ˆ๋‹ค. ํ˜ˆ์•ก ๊ฒ€์‚ฌ์—์„œ ๋ฐฑํ˜ˆ๊ตฌ ์ฆ๊ฐ€๊ฐ€ ๋‚˜ํƒ€๋‚˜๋ฉฐ, ๋‡Œ์ฒ™์ˆ˜์•ก ๊ฒ€์‚ฌ์—์„œ ๋ฐฑํ˜ˆ๊ตฌ ์ˆ˜๊ฐ€ ๋†’๊ณ (650/mm3), ๋‹คํ˜•ํ•ต๋ฐฑํ˜ˆ๊ตฌ๊ฐ€ ์ฐจ์ง€ํ•˜๋Š” ๋น„์œจ์ด ๋†’์œผ๋ฉฐ(90%), ํฌ๋„๋‹น ์ˆ˜์น˜๊ฐ€ ๋‚ฎ๊ณ (35 mg/dL), ๋‹จ๋ฐฑ์งˆ ์ˆ˜์น˜๊ฐ€ ๋†’์Šต๋‹ˆ๋‹ค(112 mg/dL). ์ด๋Ÿฌํ•œ ์†Œ๊ฒฌ์€ ์„ธ๊ท ์„ฑ ์—ผ์ฆ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

C. ์ž๊ฐ€๋ฉด์—ญ๋‡Œ์—ผ: ์ž๊ฐ€๋ฉด์—ญ๋‡Œ์—ผ์€ ์ผ๋ฐ˜์ ์œผ๋กœ ๋‡Œ์‹ ๊ฒฝ๋งˆ๋น„๋‚˜ ๋ถ€๋ถ„์‹ ๊ฒฝ ์ง•ํ›„๋ฅผ ๋™๋ฐ˜ํ•˜์ง€๋งŒ, ์ด ํ™˜์ž๋Š” ์ด๋Ÿฌํ•œ ์ฆ์ƒ์„ ๋ณด์ด์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, ๋‡Œ์ฒ™์ˆ˜์•ก ๊ฒ€์‚ฌ์—์„œ ADA ํ™œ์„ฑ๋„๊ฐ€ ์ •์ƒ ๋ฒ”์œ„์— ์žˆ์œผ๋ฏ€๋กœ ์ž๊ฐ€๋ฉด์—ญ๋‡Œ์—ผ์€ ๊ฐ€๋Šฅ์„ฑ์ด ๋‚ฎ์Šต๋‹ˆ๋‹ค.

D. ๋ฐ”์ด๋Ÿฌ์Šค์ˆ˜๋ง‰์—ผ: ๋ฐ”์ด๋Ÿฌ์Šค์ˆ˜๋ง‰์—ผ์€ ์ผ๋ฐ˜์ ์œผ๋กœ ๋ฐฑํ˜ˆ๊ตฌ ์ˆ˜์น˜๊ฐ€ ๋‚ฎ๊ณ (200/mm3 ์ดํ•˜), ๋‹จ๋ฐฑ์งˆ ์ˆ˜์น˜๊ฐ€ ๋‚ฎ์œผ๋ฉฐ, ํฌ๋„๋‹น ์ˆ˜์น˜๋Š” ์ •์ƒ ๋˜๋Š” ์•ฝ๊ฐ„ ๋‚ฎ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ํ™˜์ž๋Š” ์„ธ๊ท ์„ฑ ์—ผ์ฆ์˜ ์ฆ๊ฑฐ๊ฐ€ ์žˆ์œผ๋ฉฐ, ๋ฐ”์ด๋Ÿฌ์Šค ์นจ์Šต์„ ๋‚˜ํƒ€๋‚ด๋Š” ํŠน์ด์ ์ธ ์†Œ๊ฒฌ์ด ์—†์Šต๋‹ˆ๋‹ค.

E. ๊ธ‰์„ฑํŒŒ์ข…๋‡Œ์ฒ™์ˆ˜์—ผ: ๊ธ‰์„ฑํŒŒ์ข…๋‡Œ์ฒ™์ˆ˜์—ผ์€ ์ผ๋ฐ˜์ ์œผ๋กœ ๋‹ค์–‘ํ•œ ์‹ ๊ฒฝํ•™์  ๊ฒฐ์†์„ ๋™๋ฐ˜ํ•˜๋ฉฐ, ๋‡Œ์ฒ™์ˆ˜์•ก์—์„œ ์˜ฌ๋ฆฌ๊ณ ํด๋ก ๋ ๊ฐ€ ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค. ์ด ํ™˜์ž๋Š” ๋‡Œ์‹ ๊ฒฝ๋งˆ๋น„๋‚˜ ๋ถ€๋ถ„์‹ ๊ฒฝ ์ง•ํ›„๊ฐ€ ์—†์œผ๋ฉฐ, ์˜ฌ๋ฆฌ๊ณ ํด๋ก ๋ ๋„ ๋‚˜ํƒ€๋‚˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

๊ฒฐ๋ก ์ ์œผ๋กœ, ํ™˜์ž์˜ ์ž„์ƒ ์ฆ์ƒ๊ณผ ๋‡Œ์ฒ™์ˆ˜์•ก ๊ฒ€์‚ฌ ๊ฒฐ๊ณผ๋Š” ์„ธ๊ท ์„ฑ ์ˆ˜๋ง‰์—ผ์„ ๊ฐ€์žฅ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์€ ์ง„๋‹จ์œผ๋กœ ์ง€์ง€ํ•ฉ๋‹ˆ๋‹ค.

Other Example

# "ํƒ€์ด๋ ˆ๋†€์„ ๋จน์—ˆ๋Š”๋ฐ๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ๋‘ํ†ต์ด ๋„ˆ๋ฌด ์‹ฌํ•ด์š”. ์–ด๋–ป๊ฒŒ ํ•ด์•ผ ํ•˜๋‚˜์š”?"

๋‘ํ†ต์ด ์‹ฌํ•˜์‹œ๋‹ค๋‹ˆ ์•ˆํƒ€๊น์Šต๋‹ˆ๋‹ค. ํƒ€์ด๋ ˆ๋†€์„ ๋ณต์šฉํ–ˆ์Œ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ๋‘ํ†ต์ด ์ง€์†๋œ๋‹ค๋ฉด, ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฐฉ๋ฒ•์„ ๊ณ ๋ คํ•ด ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

1. **์ถฉ๋ถ„ํ•œ ํœด์‹๊ณผ ์ˆ˜๋ถ„ ์„ญ์ทจ**: ๋‘ํ†ต์ด ์‹ฌํ•  ๋•Œ๋Š” ์กฐ์šฉํ•œ ๊ณณ์—์„œ ํœด์‹์„ ์ทจํ•˜๊ณ  ์ถฉ๋ถ„ํ•œ ์ˆ˜๋ถ„์„ ์„ญ์ทจํ•˜๋Š” ๊ฒƒ์ด ๋„์›€์ด ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

2. **๋ƒ‰์ฐœ์งˆ**: ์–ผ์ŒํŒฉ์ด๋‚˜ ์ฐจ๊ฐ€์šด ๋ฌผ๋กœ ์ ์‹  ์ˆ˜๊ฑด์„ ์ด๋งˆ๋‚˜ ๋ชฉ์— ๋Œ€์–ด ๋ƒ‰์ฐœ์งˆ์„ ํ•ด๋ณด์„ธ์š”.

3. **์นดํŽ˜์ธ ์„ญ์ทจ**: ์ ๋‹น๋Ÿ‰์˜ ์นดํŽ˜์ธ์€ ๋‘ํ†ต์„ ์™„ํ™”์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ปคํ”ผ๋‚˜ ์ฐจ๋ฅผ ์กฐ๊ธˆ ๋งˆ์…”๋ณด์„ธ์š”.

4. **์ŠคํŠธ๋ ˆ์นญ**: ๋ชฉ๊ณผ ์–ด๊นจ๋ฅผ ๋ถ€๋“œ๋Ÿฝ๊ฒŒ ์ŠคํŠธ๋ ˆ์นญํ•˜์—ฌ ๊ธด์žฅ์„ ํ’€์–ด๋ณด์„ธ์š”.

5. **๋‹ค๋ฅธ ์ง„ํ†ต์ œ**: ํƒ€์ด๋ ˆ๋†€ ์™ธ์— ์ด๋ถ€ํ”„๋กœํŽœ์ด๋‚˜ ๋‚˜ํ”„๋ก์„ผ๊ณผ ๊ฐ™์€ ๋‹ค๋ฅธ ์ง„ํ†ต์ œ๋ฅผ ๋ณต์šฉํ•ด ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋ณต์šฉ ์ „์— ๋ฐ˜๋“œ์‹œ ์•ฝ์‚ฌ๋‚˜ ์˜์‚ฌ์™€ ์ƒ๋‹ดํ•˜์„ธ์š”.

6. **์˜์‚ฌ ์ƒ๋‹ด**: ๋‘ํ†ต์ด ์ง€์†๋˜๊ฑฐ๋‚˜ ์‹ฌํ•ด์ง„๋‹ค๋ฉด, ์˜์‚ฌ์™€ ์ƒ๋‹ดํ•˜์—ฌ ์›์ธ์„ ํŒŒ์•…ํ•˜๊ณ  ์ ์ ˆํ•œ ์น˜๋ฃŒ๋ฅผ ๋ฐ›๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

๋‘ํ†ต์˜ ์›์ธ์€ ๋‹ค์–‘ํ•  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ, ์ •ํ™•ํ•œ ์ง„๋‹จ๊ณผ ์น˜๋ฃŒ๋ฅผ ์œ„ํ•ด ์˜๋ฃŒ ์ „๋ฌธ๊ฐ€์˜ ๋„์›€์„ ๋ฐ›๋Š” ๊ฒƒ์ด ๊ฐ€์žฅ ์ข‹์Šต๋‹ˆ๋‹ค.

MergeKit Details

This is a merge of pre-trained language models created using mergekit.

Merge Details

Merge Method

This model was merged using the DARE TIES merge method using google/gemma-2-9b-it as a base.

Models Merged

The following models were included in the merge:

Configuration

The following YAML configuration was used to produce this model:

models:
  - model: google/gemma-2-9b-it
  # No parameters necessary for base model
  - model: rtzr/ko-gemma-2-9b-it
    parameters:
      density: 0.53
      weight: 0.4
  - model: ChuGyouk/ko-med-gemma-2-9b-it-base
    parameters:
      density: 0.53
      weight: 0.6
merge_method: dare_ties
base_model: google/gemma-2-9b-it
parameters:
  int8_mask: true
dtype: bfloat16

Configuration for ChuGyouk/ko-med-gemma-2-9b-it-merge1 Model

models:
  - model: rtzr/ko-gemma-2-9b-it
  - model: ChuGyouk/ko-med-gemma-2-9b-it-base
    parameters:
      density: 0.5
      weight: 0.5
merge_method: dare_ties
base_model: rtzr/ko-gemma-2-9b-it
parameters:
  int8_mask: true
dtype: bfloat16

Contact

[email protected]
[email protected]