Spaces:
Running
on
CPU Upgrade
Update on the Evaluation Method for Ko Common Gen V2
Update on the Evaluation Method for Ko Common Gen V2
We previously employed datasets utilizing the MMLU (which only considers the alphabet/numbers of the answer) and ARC (which solely looks at the sample answers) for evaluations. However, we did not use the widely-adopted AI Harness method (which looks at both the answer's alphabet and sample sentences). To address this gap, we have modified the KoCommonGen dataset to evaluate models using the AI Harness method. More information can be found at AI Harness (https://huggingface.co./blog/evaluating-mmlu-leaderboard).
AI Harness (MMLU+ARC) Method
concept set: {I, moral, content, topic, lecture, do}
1. I give a lecture with a moral theme.
2. I did not give a lecture on a moral topic.
3. I gave a lecture because of a moral topic.
4. Moral content makes me lecture.
Answer: 2. I did not give a lecture on a moral topic.
MMLU Method
concept set: {I, moral, content, topic, lecture, do}
1. I give a lecture with a moral theme.
2. I did not give a lecture on a moral topic.
3. I gave a lecture because of a moral topic.
4. Moral content makes me lecture.
Answer: 2
ARC Method
concept set: {I, moral, content, topic, lecture, do}
1. I give a lecture with a moral theme.
2. I did not give a lecture on a moral topic.
3. I gave a lecture because of a moral topic.
4. Moral content makes me lecture.
Answer: I did not give a lecture on a moral topic.
Additionally, the previous dataset had a tendency for the distribution of correct answers to be skewed to one side. This was adjusted to resemble methods like MMLU and ARC, ensuring each answer has an equal distribution. In other words, answers 1, 2, 3, and 4 in the test set were randomized so that each has a 25% chance of appearing.
Models submitted will be re-evaluated sequentially. During the re-evaluation period, they might temporarily disappear or become unstable on the leaderboard.
We plan to continuously evaluate models using a wider variety of methods and adding more datasets in the future. Due to additions or changes in the data, scores of some models may change. We kindly ask for your understanding in advance. Rest assured, we are committed to ensuring that more models are evaluated using a comprehensive set of data.
NLP & AI Lab at Korea University / Upstage
Ko Common Gen V2 νκ° λ°©λ² μ λ°μ΄νΈ μλ΄
μλ νμΈμ.
μ ν¬κ° κΈ°μ‘΄μλ MMLU (μ λ΅μ μνλ²³/μ«μλ§ λ³΄λ λ°©μ)κ³Ό ARC (μ λ΅ μλ¬Έλ§ λ³΄λ λ°©μ)μ λ°μ΄ν° μ μ μ¬μ©νμ¬ νκ°νμμ§λ§, λ§μ΄ μ¬μ©λκ³ μλ AI Harness (μ λ΅ μνλ²³κ³Ό μλ¬Έμ κ°μ΄ 보λ) λ°©μμ λ°μ΄ν°λ μμμ΅λλ€. μ΄λ₯Ό 보μνκΈ° μν΄ KoCommonGen λ°μ΄ν°μ μ AI Harness (https://huggingface.co./blog/evaluating-mmlu-leaderboard) λ°©μμΌλ‘ μμ νμ¬ λͺ¨λΈμ νκ°ν©λλ€.
AI Harness (MMLU+ARC) λ°©μ
concept set: {λ, κ΅νμ , λ΄μ©, μ£Όμ , κ°μ°, νλ€}
1. λλ κ΅νμ μΈ λ΄μ©μ΄ κ°μ°μ νλ€.
2. λλ κ΅νμ μΈ λ΄μ©μ μ£Όμ λ‘ κ°μ°μ νμ§ μμλ€.
3. λλ κ΅νμ μΈ λ΄μ©μ΄ μ£Όμ λλ¬Έμ κ°μ°μ νμ΄.
4. κ΅νμ μΈ λ΄μ©μ΄ λλ₯Ό κ°μ°μκ² νλ€.
μ λ΅: 2. λλ κ΅νμ μΈ λ΄μ©μ μ£Όμ λ‘ κ°μ°μ νμ§ μμλ€.
MMLU λ°©μ
concept set: {λ, κ΅νμ , λ΄μ©, μ£Όμ , κ°μ°, νλ€}
1. λλ κ΅νμ μΈ λ΄μ©μ΄ κ°μ°μ νλ€.
2. λλ κ΅νμ μΈ λ΄μ©μ μ£Όμ λ‘ κ°μ°μ νμ§ μμλ€.
3. λλ κ΅νμ μΈ λ΄μ©μ΄ μ£Όμ λλ¬Έμ κ°μ°μ νμ΄.
4. κ΅νμ μΈ λ΄μ©μ΄ λλ₯Ό κ°μ°μκ² νλ€.
μ λ΅: 2
ARC λ°©μ
concept set: {λ, κ΅νμ , λ΄μ©, μ£Όμ , κ°μ°, νλ€}
1. λλ κ΅νμ μΈ λ΄μ©μ΄ κ°μ°μ νλ€.
2. λλ κ΅νμ μΈ λ΄μ©μ μ£Όμ λ‘ κ°μ°μ νμ§ μμλ€.
3. λλ κ΅νμ μΈ λ΄μ©μ΄ μ£Όμ λλ¬Έμ κ°μ°μ νμ΄.
4. κ΅νμ μΈ λ΄μ©μ΄ λλ₯Ό κ°μ°μκ² νλ€.
μ λ΅: λλ κ΅νμ μΈ λ΄μ©μ μ£Όμ λ‘ κ°μ°μ νμ§ μμλ€.
λν, μ΄μ λ°μ΄ν°μ μ μ λ΅ λ²νΈμ λΆν¬κ° ν μͺ½μΌλ‘ μΉμ°μΉλ κ²½ν₯μ΄ μμμ΅λλ€. μ΄λ₯Ό MMLU, ARC λ±κ³Ό κ°μ΄ κ° λ΅μ λΆν¬κ° λμΌνκ² μμ νμμ΅λλ€. μ¦, ν μ€νΈμ μ 1, 2, 3, 4λ² λ΅μ΄ κ°κ° 25%μ νλ₯ λ‘ λμ¬ μ μλλ‘ μ λ΅μ 무μμλ‘ μ¬λ°°μ΄νμμ΅λλ€.
μ μΆλ λͺ¨λΈλ€μ μμ°¨μ μΌλ‘ μ¬νκ°λλ©°, μ¬νκ°λλ κΈ°κ° λμ 리λ보λμμ μΌμμ μΌλ‘ λΉ μ§κ±°λ λΆμμ ν μ μμ΅λλ€. μ ν¬λ μμΌλ‘λ λ λ€μν λ°©λ²μΌλ‘, κ·Έλ¦¬κ³ μ§μμ μΌλ‘ λ λ§μ λ°μ΄ν°μ μ μΆκ°νμ¬ λͺ¨λΈλ€μ νκ°ν μμ μ λλ€. λ°μ΄ν° λ±μ΄ μΆκ°/λ³κ²½λ¨μΌλ‘ μΌλΆ λͺ¨λΈμ μ μλ λ³κ²½λ μ μμμ λν΄ λ―Έλ¦¬ μν΄ λΆν λ립λλ€. λ λ§μ λͺ¨λΈλ€μ΄ λ λ§μ λ°μ΄ν°λ‘ νκ°λ μ μλλ‘ μ΅μ μ λ€νκ² μ΅λλ€.
κ³ λ €λνκ΅ NLP & AI Lab / μ μ€ν μ΄μ§
Hello. Starting from now, we plan to evaluate ko-commongenv2 scores. To prevent confusion in the results, all scores currently on the leaderboard will be deleted. To avoid inconvenience in score comparison during reevaluation and to allow for checking past scores, we have created a copy of the current leaderboard at the link below:
https://huggingface.co./spaces/choco9966/open-ko-llm-leaderboard
Please note that submitted model on the choco9966/open-ko-llm-leaderboard will not be evaluated, so use it as a reference.
μλ νμΈμ. ν μμ λΆν° ko-commongenv2 μ μμ μ¬μΈ‘μ μ μ§ννλ €κ³ ν©λλ€. κ²°κ³Όλ¬Όμ νΌλμ λ§κΈ° μν΄ λ¦¬λ보λμ μ¬λΌμ¨ μ μλ λͺ¨λ μμ νκ³ μ§ννλ €ν©λλ€. μ¬νκ°λλ λμμ μ μ λΉκ΅μ λΆνΈν¨μ λ°©μ§νκ³ κ³Όκ±°μ μ μλ₯Ό νμΈν μ μλλ‘ μλμ λ§ν¬λ‘ νμ¬ λ¦¬λ보λμ 볡μ¬λ³Έμ λ§λ€μ΄λμμ΅λλ€.
https://huggingface.co./spaces/choco9966/open-ko-llm-leaderboard
μ°Έκ³ λ‘, μμ 리λ보λμ μ μΆλ λͺ¨λΈμ μ±μ μ λμ§λ μμΌλ μ°Έκ³ νμκΈ° λ°λλλ€.
What evaluation metrics are used to evaluate Common Gen V2? @choco9966 @Limerobot