Camille7777 commited on
Commit
dbfcfb7
1 Parent(s): 90c95f3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -6
README.md CHANGED
@@ -13,6 +13,10 @@ language:
13
  </h1>
14
  </div>
15
 
 
 
 
 
16
  <div align="center">
17
  |<a href="https://github.com/hpcaitech/Colossal-LLaMA-2/" target="_blank">🔥 GitHub </a> |
18
  <a href="https://github.com/baichuan-inc/Baichuan-7B/blob/main/media/wechat.jpeg?raw=true" target="_blank">😊 Slack</a>|
@@ -67,7 +71,15 @@ print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True)[len(input):])
67
 
68
 
69
  # Performance Evaluation
70
- We conducted comprehensive evaluation on 5 datasets and compare our Colossal-LLaMA-2-7b-base model with various models. We use 5-shot for MMLU and CMMLU and calculate scores based on the logits of first predicted token. We use 5-shot for AGIEval and only calculate scores for 4-choice questions using a combination metric of exact match and the logits of first predicted token. We use 0-shot for GAOKAO-Bench and only calculate scores for single-choice questions using the same metric used for AGIEval. The generation config for AGIEval and GAOKAO-Bench is greedy search. We also provided CEval scores from its lastest leaderboard or the official repository of the model.
 
 
 
 
 
 
 
 
71
 
72
  | | Backbone | Tokens Consumed | | MMLU | CMMLU | AGIEval | GAOKAO | CEval |
73
  | :----------------------------: | :--------: | :-------------: | :------------------: | :-----------: | :-----: | :----: | :----: | :------------------------------: |
@@ -79,7 +91,7 @@ We conducted comprehensive evaluation on 5 datasets and compare our Colossal-LLa
79
  | ChatGLM-6B | - | 1.0T | | 39.67 (40.63) | 41.17 (-) | 40.10 | 36.53 | 38.90 |
80
  | ChatGLM2-6B | - | 1.4T | | 44.74 (45.46) | 49.40 (-) | 46.36 | 45.49 | 51.70 |
81
  | InternLM-7B | - | - | | 46.70 (51.00) | 52.00 (-) | 44.77 | 61.64 | 52.80 |
82
- | Qwen-7B | - | 2.2T | | 48.54 (56.70) | 56.03 (58.80) | 52.47 | 56.42 | 59.60 |
83
  | | | | | | | | | |
84
  | Llama-2-7B | - | 2.0T | | 44.47 (45.30) | 32.97 (-) | 32.60 | 25.46 | - |
85
  | Linly-AI/Chinese-LLaMA-2-7B-hf | Llama-2-7B | 1.0T | | 37.43 | 29.92 | 32.00 | 27.57 | - |
@@ -92,11 +104,15 @@ We conducted comprehensive evaluation on 5 datasets and compare our Colossal-LLa
92
  | | | | | | | | | |
93
  | **Colossal-LLaMA-2-7b-base** | Llama-2-7B | **0.0085T** | | 53.06 | 49.89 | 51.48 | 58.82 | 50.2 |
94
 
95
- - The score in parentheses corresponds to the scores in the official repository of the model.
96
- - We use zero-shot for ChatGLM models.
97
- - Qwen-7B is now inaccessible in Hugging Face, we are using the latest version of it before it was made inaccessible.
 
 
 
 
98
 
99
- ❗️ More details of the evaluation methods and reproduction of the results, please refer to [TODO ColossalEval]().
100
 
101
 
102
  # Technical Insights
@@ -177,6 +193,8 @@ Our experiments have revealed that the distributions within the training dataset
177
 
178
  In an effort to achieve a more balanced distribution and exert control over the dataset's ordering, we have adopted a method where we divide each sub-dataset into discrete bins. These bins are then combined to construct individual data buckets, with one bin contributed by each sub-dataset.
179
 
 
 
180
 
181
  # Limitations
182
  Colossal-LLaMA-2-7B is a derivation of LLaMA-2 that carries risks with use. Testing conducted to date has been exclusively performed in English and Chinese languages, and it is important to acknowledge that it could not encompass all possible scenarios. Same as other LLMs, it is impossible to predict the potential outcomes of Colossal-LLaMA-2-7B-base in advance. In certain situations, Colossal-LLaMA-2-7B-base may generate responses that are inaccurate, biased, or otherwise poisonous. Consequently, prior to deploying any applications powered by Colossal-LLaMA-2-7B-base, it is imperative for developers to engage in safety testing and tuning tailored the model to meet the specific requirements of their applications.
 
13
  </h1>
14
  </div>
15
 
16
+ <div align="center">
17
+ 🎉 We released Colossal-LLaMA-2-7B-base based on LLaMA-2 !!
18
+ </div>
19
+
20
  <div align="center">
21
  |<a href="https://github.com/hpcaitech/Colossal-LLaMA-2/" target="_blank">🔥 GitHub </a> |
22
  <a href="https://github.com/baichuan-inc/Baichuan-7B/blob/main/media/wechat.jpeg?raw=true" target="_blank">😊 Slack</a>|
 
71
 
72
 
73
  # Performance Evaluation
74
+ ### Performance Evaluation
75
+ We conducted comprehensive evaluation on 4 dataset and compare our Colossal-Llama-2-7b-base model with various models.
76
+
77
+ * We use 5-shot for MMLU and calculate scores based on the logits of first predicted token.
78
+ * We use 5-shot for CMMLU and calculate scores based on the logits of first predicted token.
79
+ * We use 5-shot for AGIEval and only calculate scores for 4-choice questions using a combination metric of exact match and the logits of first predicted token. If any of the exact match or logits of first predicted token is correct, the model will get the score.
80
+ * We use 0-shot for GAOKAO-Bench and only calculate scores for 4-choice questions based on the logits of first predicted token.
81
+ The generation config for all dataset is greedy search.
82
+ * We also provided CEval scores from its lastest leaderboard or the official repository of the model.
83
 
84
  | | Backbone | Tokens Consumed | | MMLU | CMMLU | AGIEval | GAOKAO | CEval |
85
  | :----------------------------: | :--------: | :-------------: | :------------------: | :-----------: | :-----: | :----: | :----: | :------------------------------: |
 
91
  | ChatGLM-6B | - | 1.0T | | 39.67 (40.63) | 41.17 (-) | 40.10 | 36.53 | 38.90 |
92
  | ChatGLM2-6B | - | 1.4T | | 44.74 (45.46) | 49.40 (-) | 46.36 | 45.49 | 51.70 |
93
  | InternLM-7B | - | - | | 46.70 (51.00) | 52.00 (-) | 44.77 | 61.64 | 52.80 |
94
+ | Qwen-7B | - | 2.2T | | 54.29 (56.70) | 56.03 (58.80) | 52.47 | 56.42 | 59.60 |
95
  | | | | | | | | | |
96
  | Llama-2-7B | - | 2.0T | | 44.47 (45.30) | 32.97 (-) | 32.60 | 25.46 | - |
97
  | Linly-AI/Chinese-LLaMA-2-7B-hf | Llama-2-7B | 1.0T | | 37.43 | 29.92 | 32.00 | 27.57 | - |
 
104
  | | | | | | | | | |
105
  | **Colossal-LLaMA-2-7b-base** | Llama-2-7B | **0.0085T** | | 53.06 | 49.89 | 51.48 | 58.82 | 50.2 |
106
 
107
+ > The score in parentheses corresponds to the scores in the official repository of the model.
108
+ >
109
+ > We use zero-shot for ChatGLM models.
110
+ >
111
+ > Qwen-7B is now inaccessible in Hugging Face, we are using the latest version of it before it was made inaccessible. Only for dataset MMLU, the prompt would be "xxx Answer:"(remove the space after ":") and we calculate the logits over " A", " B", " C" and " D" for Qwen-7B. Qwen-7B tends to be much more deterministic than other models. For example, the logits over " A" can be `-inf` and softmax would be exact `0`.
112
+ >
113
+ > For other models and other dataset, we calculate logits over "A", "B", "C" and "D".
114
 
115
+ ❗️ More details of the evaluation methods and reproduction of the results, please refer to [TODO: ColossalEval]().
116
 
117
 
118
  # Technical Insights
 
193
 
194
  In an effort to achieve a more balanced distribution and exert control over the dataset's ordering, we have adopted a method where we divide each sub-dataset into discrete bins. These bins are then combined to construct individual data buckets, with one bin contributed by each sub-dataset.
195
 
196
+ For more details, please refer to our [Github](https://github.com/hpcaitech/Colossal-LLaMA-2).
197
+
198
 
199
  # Limitations
200
  Colossal-LLaMA-2-7B is a derivation of LLaMA-2 that carries risks with use. Testing conducted to date has been exclusively performed in English and Chinese languages, and it is important to acknowledge that it could not encompass all possible scenarios. Same as other LLMs, it is impossible to predict the potential outcomes of Colossal-LLaMA-2-7B-base in advance. In certain situations, Colossal-LLaMA-2-7B-base may generate responses that are inaccurate, biased, or otherwise poisonous. Consequently, prior to deploying any applications powered by Colossal-LLaMA-2-7B-base, it is imperative for developers to engage in safety testing and tuning tailored the model to meet the specific requirements of their applications.