Spaces:

m42-health
/

MEDIC-Benchmark

Running

App Files Files Community

tathagataraha commited on Jan 6

Commit

7d6aad6

2 Parent(s): 3c09632 818cb65

Merge branch 'main' of https://huggingface.co./spaces/m42-health/MEDIC-Benchmark

Browse files

Files changed (1) hide show

src/about.py +33 -6

src/about.py CHANGED Viewed

@@ -104,27 +104,54 @@ LOGO = """<img src="https://huggingface.co/spaces/m42-health/MEDIC-Benchmark/res
 # What does your leaderboard evaluate?
 INTRODUCTION_TEXT = """
-The rapid development of Large Language Models (LLMs) for healthcare applications has spurred calls for holistic evaluation beyond frequently-cited benchmarks like USMLE, to better reflect real-world performance. While real-world assessments are valuable indicators of utility, they often lag behind the pace of LLM evolution, likely rendering findings obsolete upon deployment. This temporal disconnect necessitates a comprehensive upfront evaluation that can guide model selection for specific clinical applications. We introduce MEDIC, a framework assessing LLMs across five critical dimensions of clinical competence: medical reasoning, ethics and bias, data and language understanding, in-context learning, and clinical safety. MEDIC features a novel cross-examination framework quantifying LLM performance across areas like coverage and hallucination detection, without requiring reference outputs. We apply MEDIC to evaluate LLMs on medical question-answering, safety, summarization, note generation, and other tasks. Our results show performance disparities across model sizes, baseline vs medically finetuned models, and have implications on model selection for applications requiring specific model strengths, such as low hallucination or lower cost of inference. MEDIC's multifaceted evaluation reveals these performance trade-offs, bridging the gap between theoretical capabilities and practical implementation in healthcare settings, ensuring that the most promising models are identified and adapted for diverse healthcare applications.
 """
 # Which evaluations are you running? how can people reproduce what you have?
 LLM_BENCHMARKS_TEXT_1 = f"""
 ## About
-The MEDIC Leaderboard is aimed at providing a comprehensive evaluations of clinical language models. It provides a standardized platform for evaluating and comparing the performance of various language models across 5 dimensions: Medical reasoning, Ethical and bias concerns, Data and language understanding, In-context learning, and Clinical safety and risk assessment. This comprehensive structure acknowledges the diverse facets of clinical competence and the varied requirements of healthcare applications. By addressing these critical dimensions, MEDIC aims to bridge the gap between benchmark performance and real-world clinical utility, providing a more robust prediction of an LLM’s potential effectiveness and safety in actual healthcare settings.
-## Evaluation Tasks and Metrics
 ### Close-ended Questions
-Closed-ended question evaluation for LLMs provides insights into their medical knowledge breadth and accuracy. With this approach, we aim to quantify an LLM's comprehension of medical concepts across various specialties, ranging from basic to advanced professional levels. The following datasets serve as standardized benchmarks: MedQA, MedMCQA, MMLU, MMLU Pro, PubMedQA, USMLE, ToxiGen. We used the Eleuther AI's Evaluation Harness framework, which focuses on the likelihood of a model generating each proposed answer rather than directly evaluating the generated text itself. We modified the framework's codebase to provide more detailed and relevant results. Rather than just calculating the probability of generating answer choice labels (e.g., a., b., c., or d.), we calculate the probability of generating the full answer text.
 ### Open-ended Questions
-We evaluate LLMs' medical knowledge using three datasets: MedicationQA, HealthSearchQA, and ExpertQA. Each question is presented to the models without special prompting to test their baseline capabilities.
-To compare models, we use a tournament-style approach. A judge (Llama3.1 70b Instruct) evaluates pairs of responses to the same question from different models. To eliminate position bias, each comparison is performed twice with reversed response positions. If the winner changes when positions are swapped, we consider the responses too close and declare a tie. After multiple comparisons, we calculate win rates and convert them to Elo ratings to rank the models.
 It's important to note that this evaluation only assesses the quality of response writing, not medical accuracy. To properly evaluate clinical accuracy, a thorough study involving real healthcare professionals would be necessary.
 """
 EVALUATION_QUEUE_TEXT = """

 # What does your leaderboard evaluate?
 INTRODUCTION_TEXT = """
+Deploying a good clinical LLM requires more than just acing closed-ended medical QA exams. It needs to be safe, ethical, comprehensive in its responses, and capable of reasoning and tackling complex medical tasks. The MEDIC framework aims to provide a transparent and comprehensive evaluation of LLM performance across various clinically relevant dimensions.
 """
 # Which evaluations are you running? how can people reproduce what you have?
 LLM_BENCHMARKS_TEXT_1 = f"""
 ## About
+The MEDIC Leaderboard provides a comprehensive evaluation of large language models (LLMs) on various healthcare tasks. It assesses the performance of different LLMs across five key dimensions:
+- Medical Reasoning
+- Ethics and Bias Concerns
+- Data and Language Understanding
+- In-Context Learning
+- Clinical Safety and Risk Assessment
+By evaluating these dimensions, MEDIC aims to measure how effective and safe LLMs would be when used in real healthcare settings.
+## Evaluation Categories
 ### Close-ended Questions
+This category measures the accuracy of an LLM's medical knowledge by having it answer multiple-choice questions from datasets like MedQA, MedMCQA, MMLU, MMLU Pro, PubMedQA, USMLE and Toxigen.
+We used the Eleuther AI's Evaluation Harness framework, which focuses on the likelihood of a model generating each proposed answer rather than directly evaluating the generated text itself. We modified the framework's codebase to provide more detailed and relevant results. Rather than just calculating the probability of generating answer choice labels (e.g., a., b., c., or d.), we calculate the probability of generating the full answer text.
 ### Open-ended Questions
+This category assesses the quality of the LLM's reasoning and explanations. The LLM is tasked with answering open-ended medical questions from various datasets:
+- MedicationQA
+- HealthSearchQA
+- ExpertQA
+Each question is presented to the models without special prompting to test their baseline capabilities. To compare models, we use a tournament-style approach. A judge (Llama3.1 70b Instruct) evaluates pairs of responses to the same question from different models. To eliminate position bias, each comparison is performed twice with reversed response positions. If the winner changes when positions are swapped, we consider the responses too close and declare a tie. After multiple comparisons, we calculate win rates and convert them to Elo ratings to rank the models.
 It's important to note that this evaluation only assesses the quality of response writing, not medical accuracy. To properly evaluate clinical accuracy, a thorough study involving real healthcare professionals would be necessary.
+### Medical Safety
+The Medical Safety category uses the "med-safety" benchmark dataset, which consists of 900 scenarios presenting harmful medical requests. These scenarios cover all nine principles of medical ethics as defined by the American Medical Association (AMA).
+In this evaluation, an LLM acts as a judge. Each response from the evaluated model is assigned a "harmfulness" score on a predefined scale. A low score means the model refused the harmful request, while a high score means it complied with the request.
+### Medical Summarization
+This category evaluates the LLM's ability to summarize medical texts, such as clinical trial descriptions and progress notes. It uses a novel "cross-examination" framework, where questions are generated from the original document and the LLM's summary to assess the scores of the summary. The four key scores calculated are:
+- Coverage: Measures how thoroughly the summary covers the original document. A higher score means the summary includes more details from the original.
+- Conformity: Also called the non-contradiction score, this checks if the summary avoids contradicting the original document. A higher score means the summary aligns better with the original.
+- Consistency: Measures the level of non-hallucination, or how much the summary sticks to the facts in the document. A higher score means the summary is more factual and accurate.
+- Conciseness: Measures how brief the summary is. A higher score means the summary is more concise. A negative score means the summary is longer than the original document.
+### Note Generation
+This category assesses the LLM's ability to generate structured clinical (ACI-Bench) and SOAP notes from doctor-patient conversations. It uses the same cross-examination framework as Medical Summarization.
 """
 EVALUATION_QUEUE_TEXT = """