nokazaki commited on
Commit
5d8b2e9
1 Parent(s): 7e7e086

Fixed some typos.

Browse files
Files changed (1) hide show
  1. README.md +3 -4
README.md CHANGED
@@ -126,13 +126,12 @@ We used the Language Model Evaluation Harness(v.0.4.2) and Code Generation LM Ev
126
 
127
  ### MT-Bench JA
128
 
129
- We used [Japanese MT-Bench](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question) to assess the instruction-following capabilities of models.
130
- We utilized the following settings:
131
 
132
- - Implemantation: FastChat [Zheng+, 2023] (commit #e86e70d0)
133
  - Question: [Nejumi LLM-Leaderboard NEO, mtbench_ja_question_v3](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question/v3)
134
  - Reference Answer: [Nejumi LLM-Leaderboard NEO, mtbench_ja_referenceanswer_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_referenceanswer/v1)
135
- - Prompt for Judge: [Nejumi LLM-Lederboard NEO, mtbench_ja_prompt_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_prompt/v1)
136
  - Judge: `gpt-4-1106-preview`
137
  - Scoring: Absolute scale normalized to a 0-1 range, averaged over five runs.
138
 
 
126
 
127
  ### MT-Bench JA
128
 
129
+ We used [Japanese MT-Bench](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question) to assess the capabilities of multi-turn dialogue with the following settings:
 
130
 
131
+ - Implementation: FastChat [Zheng+, 2023] (commit #e86e70d0)
132
  - Question: [Nejumi LLM-Leaderboard NEO, mtbench_ja_question_v3](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question/v3)
133
  - Reference Answer: [Nejumi LLM-Leaderboard NEO, mtbench_ja_referenceanswer_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_referenceanswer/v1)
134
+ - Prompt for Judge: [Nejumi LLM-Leaderboard NEO, mtbench_ja_prompt_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_prompt/v1)
135
  - Judge: `gpt-4-1106-preview`
136
  - Scoring: Absolute scale normalized to a 0-1 range, averaged over five runs.
137