tathagataraha commited on
Commit
7d6aad6
·
2 Parent(s): 3c09632 818cb65

Merge branch 'main' of https://huggingface.co./spaces/m42-health/MEDIC-Benchmark

Browse files
Files changed (1) hide show
  1. src/about.py +33 -6
src/about.py CHANGED
@@ -104,27 +104,54 @@ LOGO = """<img src="https://huggingface.co/spaces/m42-health/MEDIC-Benchmark/res
104
 
105
  # What does your leaderboard evaluate?
106
  INTRODUCTION_TEXT = """
107
- The rapid development of Large Language Models (LLMs) for healthcare applications has spurred calls for holistic evaluation beyond frequently-cited benchmarks like USMLE, to better reflect real-world performance. While real-world assessments are valuable indicators of utility, they often lag behind the pace of LLM evolution, likely rendering findings obsolete upon deployment. This temporal disconnect necessitates a comprehensive upfront evaluation that can guide model selection for specific clinical applications. We introduce MEDIC, a framework assessing LLMs across five critical dimensions of clinical competence: medical reasoning, ethics and bias, data and language understanding, in-context learning, and clinical safety. MEDIC features a novel cross-examination framework quantifying LLM performance across areas like coverage and hallucination detection, without requiring reference outputs. We apply MEDIC to evaluate LLMs on medical question-answering, safety, summarization, note generation, and other tasks. Our results show performance disparities across model sizes, baseline vs medically finetuned models, and have implications on model selection for applications requiring specific model strengths, such as low hallucination or lower cost of inference. MEDIC's multifaceted evaluation reveals these performance trade-offs, bridging the gap between theoretical capabilities and practical implementation in healthcare settings, ensuring that the most promising models are identified and adapted for diverse healthcare applications.
108
  """
109
 
110
  # Which evaluations are you running? how can people reproduce what you have?
111
  LLM_BENCHMARKS_TEXT_1 = f"""
112
 
113
  ## About
 
114
 
115
- The MEDIC Leaderboard is aimed at providing a comprehensive evaluations of clinical language models. It provides a standardized platform for evaluating and comparing the performance of various language models across 5 dimensions: Medical reasoning, Ethical and bias concerns, Data and language understanding, In-context learning, and Clinical safety and risk assessment. This comprehensive structure acknowledges the diverse facets of clinical competence and the varied requirements of healthcare applications. By addressing these critical dimensions, MEDIC aims to bridge the gap between benchmark performance and real-world clinical utility, providing a more robust prediction of an LLM’s potential effectiveness and safety in actual healthcare settings.
 
 
 
 
116
 
117
- ## Evaluation Tasks and Metrics
118
 
 
119
  ### Close-ended Questions
120
 
121
- Closed-ended question evaluation for LLMs provides insights into their medical knowledge breadth and accuracy. With this approach, we aim to quantify an LLM's comprehension of medical concepts across various specialties, ranging from basic to advanced professional levels. The following datasets serve as standardized benchmarks: MedQA, MedMCQA, MMLU, MMLU Pro, PubMedQA, USMLE, ToxiGen. We used the Eleuther AI's Evaluation Harness framework, which focuses on the likelihood of a model generating each proposed answer rather than directly evaluating the generated text itself. We modified the framework's codebase to provide more detailed and relevant results. Rather than just calculating the probability of generating answer choice labels (e.g., a., b., c., or d.), we calculate the probability of generating the full answer text.
 
 
122
 
123
  ### Open-ended Questions
124
 
125
- We evaluate LLMs' medical knowledge using three datasets: MedicationQA, HealthSearchQA, and ExpertQA. Each question is presented to the models without special prompting to test their baseline capabilities.
126
- To compare models, we use a tournament-style approach. A judge (Llama3.1 70b Instruct) evaluates pairs of responses to the same question from different models. To eliminate position bias, each comparison is performed twice with reversed response positions. If the winner changes when positions are swapped, we consider the responses too close and declare a tie. After multiple comparisons, we calculate win rates and convert them to Elo ratings to rank the models.
 
 
 
 
127
  It's important to note that this evaluation only assesses the quality of response writing, not medical accuracy. To properly evaluate clinical accuracy, a thorough study involving real healthcare professionals would be necessary.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
128
  """
129
 
130
  EVALUATION_QUEUE_TEXT = """
 
104
 
105
  # What does your leaderboard evaluate?
106
  INTRODUCTION_TEXT = """
107
+ Deploying a good clinical LLM requires more than just acing closed-ended medical QA exams. It needs to be safe, ethical, comprehensive in its responses, and capable of reasoning and tackling complex medical tasks. The MEDIC framework aims to provide a transparent and comprehensive evaluation of LLM performance across various clinically relevant dimensions.
108
  """
109
 
110
  # Which evaluations are you running? how can people reproduce what you have?
111
  LLM_BENCHMARKS_TEXT_1 = f"""
112
 
113
  ## About
114
+ The MEDIC Leaderboard provides a comprehensive evaluation of large language models (LLMs) on various healthcare tasks. It assesses the performance of different LLMs across five key dimensions:
115
 
116
+ - Medical Reasoning
117
+ - Ethics and Bias Concerns
118
+ - Data and Language Understanding
119
+ - In-Context Learning
120
+ - Clinical Safety and Risk Assessment
121
 
122
+ By evaluating these dimensions, MEDIC aims to measure how effective and safe LLMs would be when used in real healthcare settings.
123
 
124
+ ## Evaluation Categories
125
  ### Close-ended Questions
126
 
127
+ This category measures the accuracy of an LLM's medical knowledge by having it answer multiple-choice questions from datasets like MedQA, MedMCQA, MMLU, MMLU Pro, PubMedQA, USMLE and Toxigen.
128
+
129
+ We used the Eleuther AI's Evaluation Harness framework, which focuses on the likelihood of a model generating each proposed answer rather than directly evaluating the generated text itself. We modified the framework's codebase to provide more detailed and relevant results. Rather than just calculating the probability of generating answer choice labels (e.g., a., b., c., or d.), we calculate the probability of generating the full answer text.
130
 
131
  ### Open-ended Questions
132
 
133
+ This category assesses the quality of the LLM's reasoning and explanations. The LLM is tasked with answering open-ended medical questions from various datasets:
134
+ - MedicationQA
135
+ - HealthSearchQA
136
+ - ExpertQA
137
+
138
+ Each question is presented to the models without special prompting to test their baseline capabilities. To compare models, we use a tournament-style approach. A judge (Llama3.1 70b Instruct) evaluates pairs of responses to the same question from different models. To eliminate position bias, each comparison is performed twice with reversed response positions. If the winner changes when positions are swapped, we consider the responses too close and declare a tie. After multiple comparisons, we calculate win rates and convert them to Elo ratings to rank the models.
139
  It's important to note that this evaluation only assesses the quality of response writing, not medical accuracy. To properly evaluate clinical accuracy, a thorough study involving real healthcare professionals would be necessary.
140
+
141
+ ### Medical Safety
142
+ The Medical Safety category uses the "med-safety" benchmark dataset, which consists of 900 scenarios presenting harmful medical requests. These scenarios cover all nine principles of medical ethics as defined by the American Medical Association (AMA).
143
+ In this evaluation, an LLM acts as a judge. Each response from the evaluated model is assigned a "harmfulness" score on a predefined scale. A low score means the model refused the harmful request, while a high score means it complied with the request.
144
+
145
+ ### Medical Summarization
146
+ This category evaluates the LLM's ability to summarize medical texts, such as clinical trial descriptions and progress notes. It uses a novel "cross-examination" framework, where questions are generated from the original document and the LLM's summary to assess the scores of the summary. The four key scores calculated are:
147
+
148
+ - Coverage: Measures how thoroughly the summary covers the original document. A higher score means the summary includes more details from the original.
149
+ - Conformity: Also called the non-contradiction score, this checks if the summary avoids contradicting the original document. A higher score means the summary aligns better with the original.
150
+ - Consistency: Measures the level of non-hallucination, or how much the summary sticks to the facts in the document. A higher score means the summary is more factual and accurate.
151
+ - Conciseness: Measures how brief the summary is. A higher score means the summary is more concise. A negative score means the summary is longer than the original document.
152
+
153
+ ### Note Generation
154
+ This category assesses the LLM's ability to generate structured clinical (ACI-Bench) and SOAP notes from doctor-patient conversations. It uses the same cross-examination framework as Medical Summarization.
155
  """
156
 
157
  EVALUATION_QUEUE_TEXT = """