tokyotech-llm
/

Llama-3.1-Swallow-70B-Instruct-v0.1

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

nokazaki commited on 18 days ago

Commit

5d8b2e9

•

1 Parent(s): 7e7e086

Fixed some typos.

Files changed (1) hide show

README.md +3 -4

README.md CHANGED Viewed

@@ -126,13 +126,12 @@ We used the Language Model Evaluation Harness(v.0.4.2) and Code Generation LM Ev
 ### MT-Bench JA
-We used [Japanese MT-Bench](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question) to assess the instruction-following capabilities of models.
-We utilized the following settings:
-- Implemantation: FastChat [Zheng+, 2023] (commit #e86e70d0)
 - Question: [Nejumi LLM-Leaderboard NEO, mtbench_ja_question_v3](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question/v3)
 - Reference Answer: [Nejumi LLM-Leaderboard NEO, mtbench_ja_referenceanswer_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_referenceanswer/v1)
-- Prompt for Judge: [Nejumi LLM-Lederboard NEO, mtbench_ja_prompt_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_prompt/v1)
 - Judge: `gpt-4-1106-preview`
 - Scoring: Absolute scale normalized to a 0-1 range, averaged over five runs.

 ### MT-Bench JA
+We used [Japanese MT-Bench](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question) to assess the capabilities of multi-turn dialogue with the following settings:
+- Implementation: FastChat [Zheng+, 2023] (commit #e86e70d0)
 - Question: [Nejumi LLM-Leaderboard NEO, mtbench_ja_question_v3](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question/v3)
 - Reference Answer: [Nejumi LLM-Leaderboard NEO, mtbench_ja_referenceanswer_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_referenceanswer/v1)
+- Prompt for Judge: [Nejumi LLM-Leaderboard NEO, mtbench_ja_prompt_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_prompt/v1)
 - Judge: `gpt-4-1106-preview`
 - Scoring: Absolute scale normalized to a 0-1 range, averaged over five runs.