ptrdvn commited on
Commit
80e1691
1 Parent(s): 3439dd3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -0
README.md CHANGED
@@ -53,8 +53,20 @@ for output in outputs:
53
 
54
  We find that this is the best performing model in the 7/8B class of LLMs on a multitude of Japanese language benchmarks.
55
 
 
 
56
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b63f8ad57e02621dc93c8b/2obyDbrjiNV3PGfwom6EI.png)
57
 
 
 
 
 
 
 
 
 
 
 
58
  # Training data
59
 
60
  We train on three sources of data to create this model
 
53
 
54
  We find that this is the best performing model in the 7/8B class of LLMs on a multitude of Japanese language benchmarks.
55
 
56
+ We calculate our Japanese evaluation scores using our [lightblue-tech/japanese_llm_eval](https://github.com/lightblue-tech/japanese_llm_eval) repo.
57
+
58
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b63f8ad57e02621dc93c8b/2obyDbrjiNV3PGfwom6EI.png)
59
 
60
+ We also compare our Japanese model to our multilingual model using our [multilingual_mt_bench](https://github.com/Peter-Devine/multilingual_mt_bench/tree/main/fastchat/llm_judge) repo.
61
+
62
+ | | **lightblue/suzume-llama-3-8B-japanese** | **lightblue/suzume-llama-3-8B-multilingual** | **Nexusflow/Starling-LM-7B-beta** | **gpt-3.5-turbo** |
63
+ |-----------------|------------------------------------------|----------------------------------------------|-----------------------------------|-------------------|
64
+ | **Japanese 🇯🇵** | 6.24 | 6.56 | 6.22 | 7.84 |
65
+
66
+ Here, we find that our multilingual model outperforms our Japanese model on the Japanese MT-Bench benchmark, indicating that our multilingual model was able to generalize better to the Japanese MT-Bench benchmark from training on more data, even if that added data was not in Japanese.
67
+
68
+ Note - the discrepancy between the MT-Bench scores of the first and second evaluation of `lightblue/suzume-llama-3-8B-japanese` are due to the difference in system message of the two evaluation harnesses. The former's system message is in Japanese while the latter's is in English.
69
+
70
  # Training data
71
 
72
  We train on three sources of data to create this model