Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,129 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
- zh
|
6 |
+
library_name: transformers
|
7 |
+
datasets:
|
8 |
+
- BAAI/Infinity-Instruct
|
9 |
+
- BAAI/CCI3-HQ
|
10 |
+
- mlfoundations/dclm-baseline-1.0
|
11 |
+
- HuggingFaceFW/fineweb-edu
|
12 |
+
- HuggingFaceTB/cosmopedia
|
13 |
+
pipeline_tag: text-generation
|
14 |
+
---
|
15 |
+
|
16 |
+
# Introduction
|
17 |
+
|
18 |
+
The **Aquila-135M** model is a small language model trained using a pre-training and annealing paradigm.
|
19 |
+
This model used 1.66TB bilingual tokens in Chinese and English for pre-training and 100B tokens for annealing training. During annealing stage, we selected 100B tokens of high-quality bilingual (Chinese and English) data for annealing training, finally got our model.
|
20 |
+
|
21 |
+
We have open-sourced all [bilingual datasets](https://huggingface.co/datasets/BAAI/Aquila-135M-Datasets) during both pre-training and annealing phrases.
|
22 |
+
Also we have open-sourced all [intermediate checkpoints](https://huggingface.co/BAAI/Aquila-135M-Intermediate).
|
23 |
+
|
24 |
+
The **Aquila-135M-Instuct** model is finetuned using [Infinity Instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct).
|
25 |
+
|
26 |
+
Excluding the parameter count of the vocabulary, Aquila-135M and SmolLM2-135M share an identical structure. The parameter count excludes the embedding part.
|
27 |
+
|
28 |
+
The entire training process was conducted using our self-developed Triton operator library, [FlagGems](https://github.com/FlagOpen/FlagGems), and parallel training framework, [FlagScale](https://github.com/FlagOpen/FlagScale).
|
29 |
+
|
30 |
+
## News
|
31 |
+
- `2024/12/24`: We have released Aquila-135M and Aquila-135M-Instruct.
|
32 |
+
- `2024/12/24`: We have released all datasets and intermediate checkpoints during training. Please feel free to use these models for analysis and experimentation.
|
33 |
+
|
34 |
+
# Evaluation
|
35 |
+
|
36 |
+
We followed evaluation setting of SmolLM models and evaluated the model using the [lighteval](https://github.com/huggingface/lighteval) tool.
|
37 |
+
|
38 |
+
While their performance on English benchmarks is comparable, Aquila-135M demonstrates significantly better results on Chinese benchmarks.
|
39 |
+
|
40 |
+
Among small models with a total parameter count below and around 400M, Aquila-135M maintains a leading position in processing capabilities while significantly enhancing Chinese language proficiency.
|
41 |
+
|
42 |
+
| Metrics (0-shot) | Aquila-135M (Trition) | Aquila-135M (CUDA) | SmolLM-135M | SmolLM2-135M | gpt2-medium-360M | TinyMistral-248M | TinyMistral-248M-2.5 | OpenELM-270M | Wide-Sheared-LLaMA-290M | opt-350m | MobileLLM-350M | pythia-410m | SmolLM-360M | SmolLM2-360M |
|
43 |
+
|---------------------------|-----------------------|--------------------|-------------|---------------|------------------|------------------|----------------------|--------------|--------------------------|----------|----------------|-------------|-------------|--------------|
|
44 |
+
| **HellaSwag** | 41.19 | 41.12 | 41.15 | 42.10 | 37.08 | 27.06 | 26.80 | 45.74 | 24.94 | 36.08 | 26.28 | 39.22 | 51.73 | 54.66 |
|
45 |
+
| **ARC (Average)** | 44.76 | 44.15 | 42.34 | 43.93 | 34.34 | 29.71 | 27.63 | 35.74 | 26.20 | 31.91 | 27.72 | 35.14 | 49.95 | 53.24 |
|
46 |
+
| **PIQA** | 66.38 | 67.52 | 68.28 | 68.44 | 66.38 | 57.40 | 53.92 | 69.75 | 50.60 | 64.36 | 50.27 | 67.19 | 71.55 | 71.98 |
|
47 |
+
| **MMLU (cloze)** | 31.07 | 30.67 | 30.26 | 31.58 | 27.75 | 25.82 | 25.59 | 27.89 | 24.75 | 26.58 | 24.86 | 28.88 | 34.32 | 36.09 |
|
48 |
+
| **CommonsenseQA** | 32.10 | 31.70 | 32.02 | 32.92 | 31.70 | 24.57 | 21.46 | 35.71 | 16.54 | 32.10 | 17.53 | 31.45 | 36.61 | 38.74 |
|
49 |
+
| **TriviaQA** | 6.65 | 7.02 | 4.24 | 4.03 | 2.36 | 0.50 | 0.08 | 1.34 | 0.00 | 1.38 | 0.00 | 2.06 | 9.19 | 16.92 |
|
50 |
+
| **Winograde** | 51.07 | 51.70 | 51.22 | 50.99 | 49.49 | 49.25 | 49.01 | 52.41 | 49.72 | 51.54 | 49.41 | 49.96 | 53.12 | 52.49 |
|
51 |
+
| **OpenBookQA** | 34.40 | 34.40 | 33.80 | 34.60 | 31.40 | 29.40 | 27.40 | 30.60 | 26.00 | 27.80 | 24.80 | 28.40 | 37.20 | 37.00 |
|
52 |
+
| **GSM8K (5-shot)** | 2.12 | 2.12 | 1.00 | 1.52 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 2.81 |
|
53 |
+
| **SIQA** | 41.81 | 42.32 | 41.15 | 41.45 | 41.30 | 41.86 | 39.71 | 42.73 | 39.76 | 42.37 | 37.10 | 42.02 | 43.45 | 41.61 |
|
54 |
+
| **CEval** | 29.22 | 29.82 | 28.28 | 26.41 | 25.40 | 25.38 | 26.89 | 26.69 | 26.37 | 26.67 | 25.68 | 27.97 | 27.66 | 28.51 |
|
55 |
+
| **CMMLU** | 29.48 | 29.63 | 26.01 | 26.66 | 27.20 | 26.67 | 25.57 | 26.25 | 26.33 | 26.93 | 25.61 | 26.91 | 27.06 | 27.39 |
|
56 |
+
| **Average-English** | 35.16 | 35.27 | 34.55 | 35.16 | 32.18 | 28.56 | 27.16 | 34.19 | 25.85 | 31.41 | 25.80 | 32.43 | 38.71 | 40.55 |
|
57 |
+
| **Average-Chinese** | 29.35 | 29.73 | 27.15 | 26.54 | 26.30 | 26.03 | 26.23 | 26.47 | 26.35 | 26.80 | 25.65 | 27.44 | 27.36 | 27.95 |
|
58 |
+
| **Average** | 32.25 | 32.50 | 30.85 | 30.85 | 29.24 | 27.29 | 26.70 | 30.33 | 26.10 | 29.11 | 25.72 | 29.94 | 33.04 | 34.25 |
|
59 |
+
|
60 |
+
For comparison models, evaluations were conducted in a local environment, so the scores may differ slightly from those reported in papers.
|
61 |
+
|
62 |
+
# How to use
|
63 |
+
|
64 |
+
## Base Model
|
65 |
+
```python
|
66 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
67 |
+
checkpoint = "BAAI/Aquila-135M"
|
68 |
+
|
69 |
+
device = "cuda" # for GPU usage or "cpu" for CPU usage
|
70 |
+
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
|
71 |
+
# for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
|
72 |
+
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
|
73 |
+
|
74 |
+
input_text = "什么是引力?"
|
75 |
+
inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
|
76 |
+
outputs = model.generate(inputs, max_new_tokens=500)
|
77 |
+
print(tokenizer.decode(outputs[0]))
|
78 |
+
|
79 |
+
input_text = "What is gravity?"
|
80 |
+
inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
|
81 |
+
outputs = model.generate(inputs, max_new_tokens=500)
|
82 |
+
print(tokenizer.decode(outputs[0]))
|
83 |
+
```
|
84 |
+
|
85 |
+
## Instruct Model
|
86 |
+
```python
|
87 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
88 |
+
checkpoint = "BAAI/Aquila-135M-Instruct"
|
89 |
+
|
90 |
+
device = "cuda" # for GPU usage or "cpu" for CPU usage
|
91 |
+
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
|
92 |
+
# for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
|
93 |
+
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
|
94 |
+
|
95 |
+
messages = [{"role": "user", "content": "什么是引力?"}]
|
96 |
+
input_text=tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
97 |
+
print(input_text)
|
98 |
+
inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
|
99 |
+
outputs = model.generate(inputs, max_new_tokens=500)
|
100 |
+
print(tokenizer.decode(outputs[0]))
|
101 |
+
|
102 |
+
messages = [{"role": "user", "content": "What is gravity?"}]
|
103 |
+
input_text=tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
104 |
+
print(input_text)
|
105 |
+
inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
|
106 |
+
outputs = model.generate(inputs, max_new_tokens=500)
|
107 |
+
print(tokenizer.decode(outputs[0]))
|
108 |
+
|
109 |
+
```
|
110 |
+
|
111 |
+
|
112 |
+
# Future Plan
|
113 |
+
|
114 |
+
* We plan to optimize the selection of better datasets and their proportions.
|
115 |
+
|
116 |
+
|
117 |
+
## **Citation**
|
118 |
+
If you find this useful, please cite the following work
|
119 |
+
```
|
120 |
+
@misc{aquila-280m,
|
121 |
+
title={Aquila-135M: A Bilingual Small Language Model in Chinese and English},
|
122 |
+
author={BAAI},
|
123 |
+
year={},
|
124 |
+
eprint={},
|
125 |
+
archivePrefix={arXiv},
|
126 |
+
primaryClass={cs.CL},
|
127 |
+
url={},
|
128 |
+
}
|
129 |
+
```
|