BAAI
/

ldwang commited on
Commit
3ea6e6b
·
verified ·
1 Parent(s): 0f7783a

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +129 -0
README.md ADDED
@@ -0,0 +1,129 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - zh
6
+ library_name: transformers
7
+ datasets:
8
+ - BAAI/Infinity-Instruct
9
+ - BAAI/CCI3-HQ
10
+ - mlfoundations/dclm-baseline-1.0
11
+ - HuggingFaceFW/fineweb-edu
12
+ - HuggingFaceTB/cosmopedia
13
+ pipeline_tag: text-generation
14
+ ---
15
+
16
+ # Introduction
17
+
18
+ The **Aquila-135M** model is a small language model trained using a pre-training and annealing paradigm.
19
+ This model used 1.66TB bilingual tokens in Chinese and English for pre-training and 100B tokens for annealing training. During annealing stage, we selected 100B tokens of high-quality bilingual (Chinese and English) data for annealing training, finally got our model.
20
+
21
+ We have open-sourced all [bilingual datasets](https://huggingface.co/datasets/BAAI/Aquila-135M-Datasets) during both pre-training and annealing phrases.
22
+ Also we have open-sourced all [intermediate checkpoints](https://huggingface.co/BAAI/Aquila-135M-Intermediate).
23
+
24
+ The **Aquila-135M-Instuct** model is finetuned using [Infinity Instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct).
25
+
26
+ Excluding the parameter count of the vocabulary, Aquila-135M and SmolLM2-135M share an identical structure. The parameter count excludes the embedding part.
27
+
28
+ The entire training process was conducted using our self-developed Triton operator library, [FlagGems](https://github.com/FlagOpen/FlagGems), and parallel training framework, [FlagScale](https://github.com/FlagOpen/FlagScale).
29
+
30
+ ## News
31
+ - `2024/12/24`: We have released Aquila-135M and Aquila-135M-Instruct.
32
+ - `2024/12/24`: We have released all datasets and intermediate checkpoints during training. Please feel free to use these models for analysis and experimentation.
33
+
34
+ # Evaluation
35
+
36
+ We followed evaluation setting of SmolLM models and evaluated the model using the [lighteval](https://github.com/huggingface/lighteval) tool.
37
+
38
+ While their performance on English benchmarks is comparable, Aquila-135M demonstrates significantly better results on Chinese benchmarks.
39
+
40
+ Among small models with a total parameter count below and around 400M, Aquila-135M maintains a leading position in processing capabilities while significantly enhancing Chinese language proficiency.
41
+
42
+ | Metrics (0-shot) | Aquila-135M (Trition) | Aquila-135M (CUDA) | SmolLM-135M | SmolLM2-135M | gpt2-medium-360M | TinyMistral-248M | TinyMistral-248M-2.5 | OpenELM-270M | Wide-Sheared-LLaMA-290M | opt-350m | MobileLLM-350M | pythia-410m | SmolLM-360M | SmolLM2-360M |
43
+ |---------------------------|-----------------------|--------------------|-------------|---------------|------------------|------------------|----------------------|--------------|--------------------------|----------|----------------|-------------|-------------|--------------|
44
+ | **HellaSwag** | 41.19 | 41.12 | 41.15 | 42.10 | 37.08 | 27.06 | 26.80 | 45.74 | 24.94 | 36.08 | 26.28 | 39.22 | 51.73 | 54.66 |
45
+ | **ARC (Average)** | 44.76 | 44.15 | 42.34 | 43.93 | 34.34 | 29.71 | 27.63 | 35.74 | 26.20 | 31.91 | 27.72 | 35.14 | 49.95 | 53.24 |
46
+ | **PIQA** | 66.38 | 67.52 | 68.28 | 68.44 | 66.38 | 57.40 | 53.92 | 69.75 | 50.60 | 64.36 | 50.27 | 67.19 | 71.55 | 71.98 |
47
+ | **MMLU (cloze)** | 31.07 | 30.67 | 30.26 | 31.58 | 27.75 | 25.82 | 25.59 | 27.89 | 24.75 | 26.58 | 24.86 | 28.88 | 34.32 | 36.09 |
48
+ | **CommonsenseQA** | 32.10 | 31.70 | 32.02 | 32.92 | 31.70 | 24.57 | 21.46 | 35.71 | 16.54 | 32.10 | 17.53 | 31.45 | 36.61 | 38.74 |
49
+ | **TriviaQA** | 6.65 | 7.02 | 4.24 | 4.03 | 2.36 | 0.50 | 0.08 | 1.34 | 0.00 | 1.38 | 0.00 | 2.06 | 9.19 | 16.92 |
50
+ | **Winograde** | 51.07 | 51.70 | 51.22 | 50.99 | 49.49 | 49.25 | 49.01 | 52.41 | 49.72 | 51.54 | 49.41 | 49.96 | 53.12 | 52.49 |
51
+ | **OpenBookQA** | 34.40 | 34.40 | 33.80 | 34.60 | 31.40 | 29.40 | 27.40 | 30.60 | 26.00 | 27.80 | 24.80 | 28.40 | 37.20 | 37.00 |
52
+ | **GSM8K (5-shot)** | 2.12 | 2.12 | 1.00 | 1.52 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 2.81 |
53
+ | **SIQA** | 41.81 | 42.32 | 41.15 | 41.45 | 41.30 | 41.86 | 39.71 | 42.73 | 39.76 | 42.37 | 37.10 | 42.02 | 43.45 | 41.61 |
54
+ | **CEval** | 29.22 | 29.82 | 28.28 | 26.41 | 25.40 | 25.38 | 26.89 | 26.69 | 26.37 | 26.67 | 25.68 | 27.97 | 27.66 | 28.51 |
55
+ | **CMMLU** | 29.48 | 29.63 | 26.01 | 26.66 | 27.20 | 26.67 | 25.57 | 26.25 | 26.33 | 26.93 | 25.61 | 26.91 | 27.06 | 27.39 |
56
+ | **Average-English** | 35.16 | 35.27 | 34.55 | 35.16 | 32.18 | 28.56 | 27.16 | 34.19 | 25.85 | 31.41 | 25.80 | 32.43 | 38.71 | 40.55 |
57
+ | **Average-Chinese** | 29.35 | 29.73 | 27.15 | 26.54 | 26.30 | 26.03 | 26.23 | 26.47 | 26.35 | 26.80 | 25.65 | 27.44 | 27.36 | 27.95 |
58
+ | **Average** | 32.25 | 32.50 | 30.85 | 30.85 | 29.24 | 27.29 | 26.70 | 30.33 | 26.10 | 29.11 | 25.72 | 29.94 | 33.04 | 34.25 |
59
+
60
+ For comparison models, evaluations were conducted in a local environment, so the scores may differ slightly from those reported in papers.
61
+
62
+ # How to use
63
+
64
+ ## Base Model
65
+ ```python
66
+ from transformers import AutoModelForCausalLM, AutoTokenizer
67
+ checkpoint = "BAAI/Aquila-135M"
68
+
69
+ device = "cuda" # for GPU usage or "cpu" for CPU usage
70
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
71
+ # for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
72
+ model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
73
+
74
+ input_text = "什么是引力?"
75
+ inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
76
+ outputs = model.generate(inputs, max_new_tokens=500)
77
+ print(tokenizer.decode(outputs[0]))
78
+
79
+ input_text = "What is gravity?"
80
+ inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
81
+ outputs = model.generate(inputs, max_new_tokens=500)
82
+ print(tokenizer.decode(outputs[0]))
83
+ ```
84
+
85
+ ## Instruct Model
86
+ ```python
87
+ from transformers import AutoModelForCausalLM, AutoTokenizer
88
+ checkpoint = "BAAI/Aquila-135M-Instruct"
89
+
90
+ device = "cuda" # for GPU usage or "cpu" for CPU usage
91
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
92
+ # for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
93
+ model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
94
+
95
+ messages = [{"role": "user", "content": "什么是引力?"}]
96
+ input_text=tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
97
+ print(input_text)
98
+ inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
99
+ outputs = model.generate(inputs, max_new_tokens=500)
100
+ print(tokenizer.decode(outputs[0]))
101
+
102
+ messages = [{"role": "user", "content": "What is gravity?"}]
103
+ input_text=tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
104
+ print(input_text)
105
+ inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
106
+ outputs = model.generate(inputs, max_new_tokens=500)
107
+ print(tokenizer.decode(outputs[0]))
108
+
109
+ ```
110
+
111
+
112
+ # Future Plan
113
+
114
+ * We plan to optimize the selection of better datasets and their proportions.
115
+
116
+
117
+ ## **Citation**
118
+ If you find this useful, please cite the following work
119
+ ```
120
+ @misc{aquila-280m,
121
+ title={Aquila-135M: A Bilingual Small Language Model in Chinese and English},
122
+ author={BAAI},
123
+ year={},
124
+ eprint={},
125
+ archivePrefix={arXiv},
126
+ primaryClass={cs.CL},
127
+ url={},
128
+ }
129
+ ```