Spaces:

upstage
/

open-ko-llm-leaderboard

Restarting on CPU Upgrade

App Files Files Community

Update on the Evaluation Method for Ko Common Gen V2

#11

by Limerobot - opened Oct 27, 2023

Discussion

Limerobot

upstage org Oct 27, 2023

•

edited Oct 27, 2023

Update on the Evaluation Method for Ko Common Gen V2

We previously employed datasets utilizing the MMLU (which only considers the alphabet/numbers of the answer) and ARC (which solely looks at the sample answers) for evaluations. However, we did not use the widely-adopted AI Harness method (which looks at both the answer's alphabet and sample sentences). To address this gap, we have modified the KoCommonGen dataset to evaluate models using the AI Harness method. More information can be found at AI Harness (https://huggingface.co./blog/evaluating-mmlu-leaderboard).

AI Harness (MMLU+ARC) Method

concept set: {I, moral, content, topic, lecture, do}
1. I give a lecture with a moral theme.
2. I did not give a lecture on a moral topic.
3. I gave a lecture because of a moral topic.
4. Moral content makes me lecture.
Answer: 2. I did not give a lecture on a moral topic.

MMLU Method

concept set: {I, moral, content, topic, lecture, do}
1. I give a lecture with a moral theme.
2. I did not give a lecture on a moral topic.
3. I gave a lecture because of a moral topic.
4. Moral content makes me lecture.
Answer: 2

ARC Method

concept set: {I, moral, content, topic, lecture, do}
1. I give a lecture with a moral theme.
2. I did not give a lecture on a moral topic.
3. I gave a lecture because of a moral topic.
4. Moral content makes me lecture.
Answer: I did not give a lecture on a moral topic.

Additionally, the previous dataset had a tendency for the distribution of correct answers to be skewed to one side. This was adjusted to resemble methods like MMLU and ARC, ensuring each answer has an equal distribution. In other words, answers 1, 2, 3, and 4 in the test set were randomized so that each has a 25% chance of appearing.

Models submitted will be re-evaluated sequentially. During the re-evaluation period, they might temporarily disappear or become unstable on the leaderboard.

We plan to continuously evaluate models using a wider variety of methods and adding more datasets in the future. Due to additions or changes in the data, scores of some models may change. We kindly ask for your understanding in advance. Rest assured, we are committed to ensuring that more models are evaluated using a comprehensive set of data.

NLP & AI Lab at Korea University / Upstage

Ko Common Gen V2 평가 방법 업데이트 안내

안녕하세요.

저희가 기존에는 MMLU (정답의 알파벳/숫자만 보는 방식)과 ARC (정답 예문만 보는 방식)의 데이터 셋을 사용하여 평가하였지만, 많이 사용되고 있는 AI Harness (정답 알파벳과 예문을 같이 보는) 방식의 데이터는 없었습니다. 이를 보완하기 위해 KoCommonGen 데이터셋을 AI Harness (https://huggingface.co./blog/evaluating-mmlu-leaderboard) 방식으로 수정하여 모델을 평가합니다.

AI Harness (MMLU+ARC) 방식

concept set: {나, 교훈적, 내용, 주제, 강연, 하다}
1. 나는 교훈적인 내용이 강연을 하다.
2. 나는 교훈적인 내용을 주제로 강연을 하지 않았다.
3. 나는 교훈적인 내용이 주제 때문에 강연을 했어.
4. 교훈적인 내용이 나를 강연에게 하다.
정답: 2. 나는 교훈적인 내용을 주제로 강연을 하지 않았다.

MMLU 방식

concept set: {나, 교훈적, 내용, 주제, 강연, 하다}
1. 나는 교훈적인 내용이 강연을 하다.
2. 나는 교훈적인 내용을 주제로 강연을 하지 않았다.
3. 나는 교훈적인 내용이 주제 때문에 강연을 했어.
4. 교훈적인 내용이 나를 강연에게 하다.
정답: 2

ARC 방식

concept set: {나, 교훈적, 내용, 주제, 강연, 하다}
1. 나는 교훈적인 내용이 강연을 하다.
2. 나는 교훈적인 내용을 주제로 강연을 하지 않았다.
3. 나는 교훈적인 내용이 주제 때문에 강연을 했어.
4. 교훈적인 내용이 나를 강연에게 하다.
정답: 나는 교훈적인 내용을 주제로 강연을 하지 않았다.

또한, 이전 데이터셋은 정답 번호의 분포가 한 쪽으로 치우치는 경향이 있었습니다. 이를 MMLU, ARC 등과 같이 각 답의 분포가 동일하게 수정하였습니다. 즉, 테스트셋의 1, 2, 3, 4번 답이 각각 25%의 확률로 나올 수 있도록 정답을 무작위로 재배열하였습니다.

제출된 모델들은 순차적으로 재평가되며, 재평가되는 기간 동안 리더보드에서 일시적으로 빠지거나 불안정할 수 있습니다. 저희는 앞으로도 더 다양한 방법으로, 그리고 지속적으로 더 많은 데이터셋을 추가하여 모델들을 평가할 예정입니다. 데이터 등이 추가/변경됨으로 일부 모델의 점수는 변경될 수 있음에 대해 미리 양해 부탁 드립니다. 더 많은 모델들이 더 많은 데이터로 평가될 수 있도록 최선을 다하겠습니다.

고려대학교 NLP & AI Lab / 업스테이지

hunkim pinned discussion Oct 27, 2023

choco9966

upstage org Oct 27, 2023

•

edited Oct 27, 2023

Hello. Starting from now, we plan to evaluate ko-commongenv2 scores. To prevent confusion in the results, all scores currently on the leaderboard will be deleted. To avoid inconvenience in score comparison during reevaluation and to allow for checking past scores, we have created a copy of the current leaderboard at the link below:

https://huggingface.co./spaces/choco9966/open-ko-llm-leaderboard

Please note that submitted model on the choco9966/open-ko-llm-leaderboard will not be evaluated, so use it as a reference.

안녕하세요. 현 시점부터 ko-commongenv2 점수의 재측정을 진행하려고 합니다. 결과물의 혼동을 막기 위해 리더보드에 올라온 점수는 모두 삭제하고 진행하려합니다. 재평가되는 동안의 점수 비교의 불편함을 방지하고 과거의 점수를 확인할 수 있도록 아래의 링크로 현재 리더보드의 복사본을 만들어두었습니다.

https://huggingface.co./spaces/choco9966/open-ko-llm-leaderboard

참고로, 위의 리더보드에 제출된 모델은 채점은 되지는 않으니 참고하시기 바랍니다.

zhiminy

Nov 29, 2023

•

edited Nov 29, 2023

What evaluation metrics are used to evaluate Common Gen V2? @choco9966 @Limerobot

Chanjun unpinned discussion Aug 1, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment