Med42v2 Dataset - a ChuGyouk Collection

ChuGyouk 's Collections

Korean Medical Dataset

Korean Math Dataset

Med42v2 Dataset

Med42v2 Dataset

updated Aug 14, 2024

Based on the Table 5 in Appendix A in the original paper

openlifescienceai/medmcqa

Viewer • Updated Jan 4, 2024 • 193k • 15.1k • 134

Note # of samples = 187,005 | 180,462 in paper
medalpaca/medical_meadow_medical_flashcards

Viewer • Updated Apr 6, 2023 • 34k • 1.97k • 34

Note # of samples = 33,955 | 30,106 in paper
ChuGyouk/StackExchange-Medical

Viewer • Updated Aug 14, 2024 • 40.6k • 56 • 1

Note # of samples = 40,625 | 64,246 in paper
bigbio/med_qa

Updated Apr 6, 2024 • 3.62k • 94

Note # of samples = 11,450 | 11,290 in paper
medalpaca/medical_meadow_cord19

Viewer • Updated Apr 6, 2023 • 821k • 127 • 10

Note # of samples = 821,007 | 17,721 in paper I took it from medalpaca, not allenai.
vblagoje/PubMedQA_instruction

Viewer • Updated Apr 12, 2024 • 273k • 282 • 7

Note # of samples = 272,458 | 499 in paper Maybe authors used dev dataset, which consists of 500 data
dvilares/head_qa

Updated Jan 18, 2024 • 1.09k • 18

Note # of samples = 2,657 | 2,657 in paper
medalpaca/medical_meadow_mediqa

Viewer • Updated Apr 16, 2023 • 2.21k • 366 • 20

Note # of samples = 2,208 | 1,950 in paper
bigbio/sciq

Viewer • Updated Dec 22, 2022 • 27.4k • 542 • 2

Note # of samples = 11,679 | 11,679 in paper
medalpaca/medical_meadow_pubmed_causal

Viewer • Updated Apr 6, 2023 • 2.45k • 423 • 8

Note # of samples = 2,446 | 2,169 in paper
openchat/cogstack-opengpt-sharegpt

Viewer • Updated Apr 16, 2024 • 31.5k • 77 • 7

Note # of samples = 31,532 | 66,026 in paper
keivalya/MedQuad-MedicalQnADataset

Viewer • Updated Oct 11, 2023 • 16.4k • 1.68k • 98

Note # of samples = 16,407 | 14,553 in paper
junyeong-nero/mmlu_medical_filtered

Viewer • Updated Aug 20, 2024 • 435 • 62

Note # of samples = 435 | 244 in paper
ChuGyouk/Niv2-Medical

Viewer • Updated Aug 14, 2024 • 42 • 54 • 1

Note From https://github.com/allenai/natural-instructions, I filtered out json files with Domains "Healthcare", "Medicine", and "Scientific Research Papers". Note that this is not a data to be used for actual training, but a kind of metadata. # of samples 11,447 in paper
bigbio/pubhealth

Viewer • Updated Dec 22, 2022 • 24.5k • 454 • 2

Note # of samples = 9,804 | 9,804 in paper
Mohammed-Altaf/medical-instruction-120k

Viewer • Updated Nov 16, 2023 • 112k • 148 • 6

Note # of samples = 106,555 | 120,000 in paper ACI-Bench link: https://github.com/wyim/aci-bench # of samples = 87 | 87 in paper
har1/MTS_Dialogue-Clinical_Note

Viewer • Updated Apr 1, 2024 • 1.3k • 112 • 5

Note # of samples = 1,301 | 2,602 in paper [General Domain] I'm not sure what is SlimOrca T0 and SlimOrca CoT. Maybe authors sampled from OpenOrca, not SlimOrca.
Open-Orca/SlimOrca-Dedup

Viewer • Updated Dec 8, 2023 • 363k • 924 • 83

Note # of samples = 292,576 in paper
stingning/ultrachat

Viewer • Updated Feb 22, 2024 • 774k • 1.66k • 435
HuggingFaceH4/ultrachat_200k

Viewer • Updated Oct 16, 2024 • 515k • 16.3k • 518

Note # of samples = 50,953 in paper [DPO dataset]
HuggingFaceH4/ultrafeedback_binarized

Viewer • Updated Oct 16, 2024 • 187k • 5.47k • 275
snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset

Viewer • Updated Jan 23, 2024 • 62.7k • 193 • 42