BAAI
/

ZacLiu commited on
Commit
d3c46ac
·
verified ·
1 Parent(s): 398c0f3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -0
README.md CHANGED
@@ -22,6 +22,11 @@ tags:
22
 
23
  We present **AquilaMoE**, a cutting-edge bilingual 8\*16B Mixture of Experts (MoE) language model developed using an innovative training methodology called EfficientScale. This approach optimizes performance while minimizing data requirements through a two-stage process. The first stage, termed Scale-Up, initializes the larger model with weights from a pre-trained smaller model, enabling substantial knowledge transfer and continuous pretraining with significantly less data. The second stage, Scale-Out, uses a pre-trained dense model to initialize the MoE experts, further enhancing knowledge transfer and performance. Extensive validation experiments on 1.8B and 7B models compared various initialization schemes, achieving models that maintain and reduce loss during continuous pretraining. Utilizing the optimal scheme, we successfully trained a 16B model and subsequently the 8\*16B AquilaMoE model, demonstrating significant improvements in performance and training efficiency.
24
 
 
 
 
 
 
25
  ## Training Details
26
 
27
  ### Datasets
 
22
 
23
  We present **AquilaMoE**, a cutting-edge bilingual 8\*16B Mixture of Experts (MoE) language model developed using an innovative training methodology called EfficientScale. This approach optimizes performance while minimizing data requirements through a two-stage process. The first stage, termed Scale-Up, initializes the larger model with weights from a pre-trained smaller model, enabling substantial knowledge transfer and continuous pretraining with significantly less data. The second stage, Scale-Out, uses a pre-trained dense model to initialize the MoE experts, further enhancing knowledge transfer and performance. Extensive validation experiments on 1.8B and 7B models compared various initialization schemes, achieving models that maintain and reduce loss during continuous pretraining. Utilizing the optimal scheme, we successfully trained a 16B model and subsequently the 8\*16B AquilaMoE model, demonstrating significant improvements in performance and training efficiency.
24
 
25
+ ## Evaluation
26
+ | Model | GPT 3.5 Turbo (11/06) | GPT 3.5 Turbo (03/01) | AquilaMoE-SFT(our) |
27
+ |------------------|-----------------------|-----------------------|---------------|
28
+ | AlpacaEval 2.0 | 19.3 | 18.1 | *21.1* |
29
+
30
  ## Training Details
31
 
32
  ### Datasets