BAAI
/

Aquila-135M-Intermediate

+---
+license: apache-2.0
+language:
+- en
+- zh
+library_name: transformers
+datasets:
+- BAAI/Infinity-Instruct
+- BAAI/CCI3-HQ
+- mlfoundations/dclm-baseline-1.0
+- HuggingFaceFW/fineweb-edu
+- HuggingFaceTB/cosmopedia
+pipeline_tag: text-generation
+---
+# Introduction
+The **Aquila-135M** model is a small language model trained using a pre-training and annealing paradigm.
+This model used 1.66TB bilingual tokens in Chinese and English for pre-training and 100B tokens for annealing training. During annealing stage, we selected 100B tokens of high-quality bilingual (Chinese and English) data for annealing training, finally got our model.
+We have open-sourced all [bilingual datasets](https://huggingface.co/datasets/BAAI/Aquila-135M-Datasets) during both pre-training and annealing phrases.
+Also we have open-sourced all [intermediate checkpoints](https://huggingface.co/BAAI/Aquila-135M-Intermediate).
+The **Aquila-135M-Instuct** model is finetuned using [Infinity Instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct).
+Excluding the parameter count of the vocabulary, Aquila-135M and SmolLM2-135M share an identical structure. The parameter count excludes the embedding part.
+The entire training process was conducted using our self-developed Triton operator library, [FlagGems](https://github.com/FlagOpen/FlagGems), and parallel training framework, [FlagScale](https://github.com/FlagOpen/FlagScale).
+## News
+- `2024/12/24`:  We have released Aquila-135M and Aquila-135M-Instruct.
+- `2024/12/24`:  We have released all datasets and intermediate checkpoints during training. Please feel free to use these models for analysis and experimentation.
+# Evaluation
+We followed evaluation setting of SmolLM models and evaluated the model using the [lighteval](https://github.com/huggingface/lighteval) tool.
+While their performance on English benchmarks is comparable, Aquila-135M demonstrates significantly better results on Chinese benchmarks.
+Among small models with a total parameter count below and around 400M, Aquila-135M maintains a leading position in processing capabilities while significantly enhancing Chinese language proficiency.
+| Metrics (0-shot)          | Aquila-135M (Trition) | Aquila-135M (CUDA) | SmolLM-135M | SmolLM2-135M | gpt2-medium-360M | TinyMistral-248M | TinyMistral-248M-2.5 | OpenELM-270M | Wide-Sheared-LLaMA-290M | opt-350m | MobileLLM-350M | pythia-410m | SmolLM-360M | SmolLM2-360M |
+|---------------------------|-----------------------|--------------------|-------------|---------------|------------------|------------------|----------------------|--------------|--------------------------|----------|----------------|-------------|-------------|--------------|
+| **HellaSwag**             | 41.19                | 41.12             | 41.15       | 42.10         | 37.08           | 27.06           | 26.80               | 45.74        | 24.94                   | 36.08    | 26.28         | 39.22       | 51.73       | 54.66        |
+| **ARC (Average)**         | 44.76                | 44.15             | 42.34       | 43.93         | 34.34           | 29.71           | 27.63               | 35.74        | 26.20                   | 31.91    | 27.72         | 35.14       | 49.95       | 53.24        |
+| **PIQA**                  | 66.38                | 67.52             | 68.28       | 68.44         | 66.38           | 57.40           | 53.92               | 69.75        | 50.60                   | 64.36    | 50.27         | 67.19       | 71.55       | 71.98        |
+| **MMLU (cloze)**          | 31.07                | 30.67             | 30.26       | 31.58         | 27.75           | 25.82           | 25.59               | 27.89        | 24.75                   | 26.58    | 24.86         | 28.88       | 34.32       | 36.09        |
+| **CommonsenseQA**         | 32.10                | 31.70             | 32.02       | 32.92         | 31.70           | 24.57           | 21.46               | 35.71        | 16.54                   | 32.10    | 17.53         | 31.45       | 36.61       | 38.74        |
+| **TriviaQA**              | 6.65                 | 7.02              | 4.24        | 4.03          | 2.36            | 0.50            | 0.08                | 1.34         | 0.00                    | 1.38     | 0.00          | 2.06        | 9.19        | 16.92        |
+| **Winograde**             | 51.07                | 51.70             | 51.22       | 50.99         | 49.49           | 49.25           | 49.01               | 52.41        | 49.72                   | 51.54    | 49.41         | 49.96       | 53.12       | 52.49        |
+| **OpenBookQA**            | 34.40                | 34.40             | 33.80       | 34.60         | 31.40           | 29.40           | 27.40               | 30.60        | 26.00                   | 27.80    | 24.80         | 28.40       | 37.20       | 37.00        |
+| **GSM8K (5-shot)**        | 2.12                 | 2.12              | 1.00        | 1.52          | 0.00            | 0.00            | 0.00                | 0.00         | 0.00                    | 0.00     | 0.00          | 0.00        | 0.00        | 2.81         |
+| **SIQA**                  | 41.81                | 42.32             | 41.15       | 41.45         | 41.30           | 41.86           | 39.71               | 42.73        | 39.76                   | 42.37    | 37.10         | 42.02       | 43.45       | 41.61        |
+| **CEval**                 | 29.22                | 29.82             | 28.28       | 26.41         | 25.40           | 25.38           | 26.89               | 26.69        | 26.37                   | 26.67    | 25.68         | 27.97       | 27.66       | 28.51        |
+| **CMMLU**                 | 29.48                | 29.63             | 26.01       | 26.66         | 27.20           | 26.67           | 25.57               | 26.25        | 26.33                   | 26.93    | 25.61         | 26.91       | 27.06       | 27.39        |
+| **Average-English**       | 35.16                | 35.27             | 34.55       | 35.16         | 32.18           | 28.56           | 27.16               | 34.19        | 25.85                   | 31.41    | 25.80         | 32.43       | 38.71       | 40.55        |
+| **Average-Chinese**       | 29.35                | 29.73             | 27.15       | 26.54         | 26.30           | 26.03           | 26.23               | 26.47        | 26.35                   | 26.80    | 25.65         | 27.44       | 27.36       | 27.95        |
+| **Average**           | 32.25                | 32.50             | 30.85       | 30.85         | 29.24           | 27.29           | 26.70               | 30.33        | 26.10                   | 29.11    | 25.72         | 29.94       | 33.04       | 34.25        |
+For comparison models, evaluations were conducted in a local environment, so the scores may differ slightly from those reported in papers.
+# How to use
+## Base Model
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+checkpoint = "BAAI/Aquila-135M"
+device = "cuda" # for GPU usage or "cpu" for CPU usage
+tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
+# for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
+model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
+input_text = "什么是引力？"
+inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
+outputs = model.generate(inputs, max_new_tokens=500)
+print(tokenizer.decode(outputs[0]))
+input_text = "What is gravity?"
+inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
+outputs = model.generate(inputs, max_new_tokens=500)
+print(tokenizer.decode(outputs[0]))
+```
+## Instruct Model
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+checkpoint = "BAAI/Aquila-135M-Instruct"
+device = "cuda" # for GPU usage or "cpu" for CPU usage
+tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
+# for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
+model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
+messages = [{"role": "user", "content": "什么是引力？"}]
+input_text=tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+print(input_text)
+inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
+outputs = model.generate(inputs, max_new_tokens=500)
+print(tokenizer.decode(outputs[0]))
+messages = [{"role": "user", "content": "What is gravity?"}]
+input_text=tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+print(input_text)
+inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
+outputs = model.generate(inputs, max_new_tokens=500)
+print(tokenizer.decode(outputs[0]))
+```
+# Future Plan
+* We plan to optimize the selection of better datasets and their proportions.
+## **Citation**
+If you find this useful, please cite the following work
+```
+@misc{aquila-280m,
+      title={Aquila-135M: A Bilingual Small Language Model in Chinese and English},
+      author={BAAI},
+      year={},
+      eprint={},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={},
+}
+```