Update README.md
Browse files
README.md
CHANGED
@@ -17,7 +17,7 @@ We build LLaMA-MoE with the following two steps:
|
|
17 |
1. Partition LLaMA's FFNs into sparse experts and insert top-K gate for each layer of experts.
|
18 |
2. Continually pre-train the initialized MoE model with an optimized data sampling weights from [Sheared LLaMA](https://arxiv.org/abs/2310.06694) and filtered datasets from [SlimPajama](https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama).
|
19 |
|
20 |
-
The number of activated model parameters is only 3.0~3.5B, which is friendly for deployment and research usage.
|
21 |
|
22 |
|
23 |
| Model | \#Activated Experts | \#Experts | \#Activated Params | Links |
|
|
|
17 |
1. Partition LLaMA's FFNs into sparse experts and insert top-K gate for each layer of experts.
|
18 |
2. Continually pre-train the initialized MoE model with an optimized data sampling weights from [Sheared LLaMA](https://arxiv.org/abs/2310.06694) and filtered datasets from [SlimPajama](https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama).
|
19 |
|
20 |
+
The number of activated model parameters is only 3.0~3.5B, which is friendly for deployment and research usage. Please refer to our [technical report](https://github.com/pjlab-sys4nlp/llama-moe/blob/main/docs/LLaMA_MoE.pdf) for more details.
|
21 |
|
22 |
|
23 |
| Model | \#Activated Experts | \#Experts | \#Activated Params | Links |
|