Taishi-N324 commited on
Commit
4a82cef
Β·
verified Β·
1 Parent(s): d23a476
Files changed (3) hide show
  1. .gitattributes +1 -0
  2. README.md +71 -0
  3. images/drop-upcycling.png +3 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ images/drop-upcycling.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -2,3 +2,74 @@
2
  license: apache-2.0
3
  ---
4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: apache-2.0
3
  ---
4
 
5
+ <h1 align="center">
6
+ <img alt="Drop-Upcycling" src="images/drop-upcycling.png"></a><br>
7
+ <b>Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization</b><br>
8
+ </h1>
9
+
10
+ <p align="center">
11
+ πŸ“„ <a href="https://openreview.net/forum?id=gx1wHnf5Vp">[Paper]</a> |
12
+ πŸ€— <a href="https://huggingface.co/collections/llm-jp/drop-upcycling-674dc5be7bbb45e12a476b80">[Hugging Face]</a>
13
+ πŸ“ <a href="https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus-v3">[Dataset]</a>
14
+ πŸ’» <a href="https://github.com/Taishi-N324/Drop-Upcycling">[Code]</a> |
15
+ πŸ“Š <a href="https://wandb.ai/taishi-nakamura/Drop-Upcycling">[Log]</a>
16
+ </p>
17
+
18
+ # Model Index
19
+
20
+ We provide model checkpoints for all experiments to ensure reproducibility of the results presented in Tables 1 and 2.
21
+
22
+ ## Table 1
23
+
24
+ |Model|Link|
25
+ |---|---|
26
+ |1 Dense 152M| [Link](https://huggingface.co/llm-jp/Dense-152M) |
27
+ |2 MoE FS 8x152M| [Link](https://huggingface.co/llm-jp/FS-8x152M) |
28
+ |3 MoE BTX 8x152M| [Link](https://huggingface.co/llm-jp/BTX-8x152M) |
29
+ |4 MoE NU 8x152M| [Link](https://huggingface.co/llm-jp/NU-8x152M) |
30
+ |5 MoE RNU (r=0.5) 8x152M| [Link](https://huggingface.co/llm-jp/RNU-0.5-8x152M) |
31
+ |6 MoE DU (r=0.5) 8x152M| [Link](https://huggingface.co/llm-jp/DU-0.5-8x152M) |
32
+ |7 MoE DU (r=1.0) 8x152M| [Link](https://huggingface.co/llm-jp/DU-1.0-8x152M) |
33
+ |8 Dense 1.5B| [Link](https://huggingface.co/llm-jp/Dense-1.5B) |
34
+ |9 MoE FS 8x1.5B| [Link](https://huggingface.co/llm-jp/FS-8x1.5B) |
35
+ |10 MoE BTX 8x1.5B| [Link](https://huggingface.co/llm-jp/BTX-8x1.5B) |
36
+ |11 MoE NU 8x1.5B| [Link](https://huggingface.co/llm-jp/NU-8x1.5B) |
37
+ |12 MoE RNU (r=0.5) 8x1.5B| [Link](https://huggingface.co/llm-jp/RNU-0.5-8x1.5B) |
38
+ |13 MoE DU (r=0.5) 8x1.5B| [Link](https://huggingface.co/llm-jp/DU-0.5-8x1.5B) |
39
+ |14 MoE DU (r=1.0) 8x1.5B| [Link](https://huggingface.co/llm-jp/DU-1.0-8x1.5B) |
40
+
41
+ ## Table 2
42
+
43
+ |Model|Link|
44
+ |---|---|
45
+ |1 Dense 3.7B| [Link](https://huggingface.co/llm-jp/Dense-3.7B) |
46
+ |2 MoE FS 8x3.7B| [Link](https://huggingface.co/llm-jp/FS-8x3.7B) |
47
+ |3 MoE DU (r=0.5) 8x3.7B| [Link](https://huggingface.co/llm-jp/DU-0.5-8x3.7B) |
48
+ |4 Dense 13B| [Link](https://huggingface.co/llm-jp/Dense-13B) |
49
+ |5 Dense 3.7B| [Link](https://huggingface.co/llm-jp/llm-jp-3-3.7b) |
50
+
51
+ ## BTX Experts
52
+
53
+ |Model|Link|
54
+ |---|---|
55
+ |Japanese expert 152M| [Link](https://huggingface.co/llm-jp/Dense-btx-japanese-expert-152M) |
56
+ |English expert 152M| [Link](https://huggingface.co/llm-jp/Dense-btx-english-expert-152M) |
57
+ |Code expert 152M| [Link](https://huggingface.co/llm-jp/Dense-btx-code-expert-152M) |
58
+ |Japanese expert 1.5B| [Link](https://huggingface.co/llm-jp/Dense-btx-japanese-expert-1.5B) |
59
+ |English expert 1.5B| [Link](https://huggingface.co/llm-jp/Dense-btx-english-expert-1.5B) |
60
+ |Code expert 1.5B| [Link](https://huggingface.co/llm-jp/Dense-btx-code-expert-1.5B) |
61
+
62
+ ## How to cite
63
+
64
+ If you find our work helpful, please feel free to cite.
65
+
66
+ ```
67
+ @inproceedings{
68
+ nakamura2025dropupcycling,
69
+ title={Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization},
70
+ author={Taishi Nakamura and Takuya Akiba and Kazuki Fujii and Yusuke Oda and Rio Yokota and Jun Suzuki},
71
+ booktitle={The Thirteenth International Conference on Learning Representations},
72
+ year={2025},
73
+ url={https://openreview.net/forum?id=gx1wHnf5Vp}
74
+ }
75
+ ```
images/drop-upcycling.png ADDED

Git LFS Details

  • SHA256: 70bc6c51c93d34116429494b84918b87e98fd63563a3ba604db3f6e2b76edaed
  • Pointer size: 131 Bytes
  • Size of remote file: 245 kB