Spaces:
Running
Running
Merge branch 'main' of https://huggingface.co./spaces/m42-health/MEDIC-Benchmark
Browse files- src/about.py +33 -6
src/about.py
CHANGED
@@ -104,27 +104,54 @@ LOGO = """<img src="https://huggingface.co/spaces/m42-health/MEDIC-Benchmark/res
|
|
104 |
|
105 |
# What does your leaderboard evaluate?
|
106 |
INTRODUCTION_TEXT = """
|
107 |
-
|
108 |
"""
|
109 |
|
110 |
# Which evaluations are you running? how can people reproduce what you have?
|
111 |
LLM_BENCHMARKS_TEXT_1 = f"""
|
112 |
|
113 |
## About
|
|
|
114 |
|
115 |
-
|
|
|
|
|
|
|
|
|
116 |
|
117 |
-
|
118 |
|
|
|
119 |
### Close-ended Questions
|
120 |
|
121 |
-
|
|
|
|
|
122 |
|
123 |
### Open-ended Questions
|
124 |
|
125 |
-
|
126 |
-
|
|
|
|
|
|
|
|
|
127 |
It's important to note that this evaluation only assesses the quality of response writing, not medical accuracy. To properly evaluate clinical accuracy, a thorough study involving real healthcare professionals would be necessary.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
128 |
"""
|
129 |
|
130 |
EVALUATION_QUEUE_TEXT = """
|
|
|
104 |
|
105 |
# What does your leaderboard evaluate?
|
106 |
INTRODUCTION_TEXT = """
|
107 |
+
Deploying a good clinical LLM requires more than just acing closed-ended medical QA exams. It needs to be safe, ethical, comprehensive in its responses, and capable of reasoning and tackling complex medical tasks. The MEDIC framework aims to provide a transparent and comprehensive evaluation of LLM performance across various clinically relevant dimensions.
|
108 |
"""
|
109 |
|
110 |
# Which evaluations are you running? how can people reproduce what you have?
|
111 |
LLM_BENCHMARKS_TEXT_1 = f"""
|
112 |
|
113 |
## About
|
114 |
+
The MEDIC Leaderboard provides a comprehensive evaluation of large language models (LLMs) on various healthcare tasks. It assesses the performance of different LLMs across five key dimensions:
|
115 |
|
116 |
+
- Medical Reasoning
|
117 |
+
- Ethics and Bias Concerns
|
118 |
+
- Data and Language Understanding
|
119 |
+
- In-Context Learning
|
120 |
+
- Clinical Safety and Risk Assessment
|
121 |
|
122 |
+
By evaluating these dimensions, MEDIC aims to measure how effective and safe LLMs would be when used in real healthcare settings.
|
123 |
|
124 |
+
## Evaluation Categories
|
125 |
### Close-ended Questions
|
126 |
|
127 |
+
This category measures the accuracy of an LLM's medical knowledge by having it answer multiple-choice questions from datasets like MedQA, MedMCQA, MMLU, MMLU Pro, PubMedQA, USMLE and Toxigen.
|
128 |
+
|
129 |
+
We used the Eleuther AI's Evaluation Harness framework, which focuses on the likelihood of a model generating each proposed answer rather than directly evaluating the generated text itself. We modified the framework's codebase to provide more detailed and relevant results. Rather than just calculating the probability of generating answer choice labels (e.g., a., b., c., or d.), we calculate the probability of generating the full answer text.
|
130 |
|
131 |
### Open-ended Questions
|
132 |
|
133 |
+
This category assesses the quality of the LLM's reasoning and explanations. The LLM is tasked with answering open-ended medical questions from various datasets:
|
134 |
+
- MedicationQA
|
135 |
+
- HealthSearchQA
|
136 |
+
- ExpertQA
|
137 |
+
|
138 |
+
Each question is presented to the models without special prompting to test their baseline capabilities. To compare models, we use a tournament-style approach. A judge (Llama3.1 70b Instruct) evaluates pairs of responses to the same question from different models. To eliminate position bias, each comparison is performed twice with reversed response positions. If the winner changes when positions are swapped, we consider the responses too close and declare a tie. After multiple comparisons, we calculate win rates and convert them to Elo ratings to rank the models.
|
139 |
It's important to note that this evaluation only assesses the quality of response writing, not medical accuracy. To properly evaluate clinical accuracy, a thorough study involving real healthcare professionals would be necessary.
|
140 |
+
|
141 |
+
### Medical Safety
|
142 |
+
The Medical Safety category uses the "med-safety" benchmark dataset, which consists of 900 scenarios presenting harmful medical requests. These scenarios cover all nine principles of medical ethics as defined by the American Medical Association (AMA).
|
143 |
+
In this evaluation, an LLM acts as a judge. Each response from the evaluated model is assigned a "harmfulness" score on a predefined scale. A low score means the model refused the harmful request, while a high score means it complied with the request.
|
144 |
+
|
145 |
+
### Medical Summarization
|
146 |
+
This category evaluates the LLM's ability to summarize medical texts, such as clinical trial descriptions and progress notes. It uses a novel "cross-examination" framework, where questions are generated from the original document and the LLM's summary to assess the scores of the summary. The four key scores calculated are:
|
147 |
+
|
148 |
+
- Coverage: Measures how thoroughly the summary covers the original document. A higher score means the summary includes more details from the original.
|
149 |
+
- Conformity: Also called the non-contradiction score, this checks if the summary avoids contradicting the original document. A higher score means the summary aligns better with the original.
|
150 |
+
- Consistency: Measures the level of non-hallucination, or how much the summary sticks to the facts in the document. A higher score means the summary is more factual and accurate.
|
151 |
+
- Conciseness: Measures how brief the summary is. A higher score means the summary is more concise. A negative score means the summary is longer than the original document.
|
152 |
+
|
153 |
+
### Note Generation
|
154 |
+
This category assesses the LLM's ability to generate structured clinical (ACI-Bench) and SOAP notes from doctor-patient conversations. It uses the same cross-examination framework as Medical Summarization.
|
155 |
"""
|
156 |
|
157 |
EVALUATION_QUEUE_TEXT = """
|