Drop-Upcycling
Collection
25 items
β’
Updated
β’
2
π [Paper] | π€ [Hugging Face] π [Dataset] π» [Code] | π [Log]
We provide model checkpoints for all experiments to ensure reproducibility of the results presented in Tables 1 and 2.
Model | Link |
---|---|
1 Dense 152M | Link |
2 MoE FS 8x152M | Link |
3 MoE BTX 8x152M | Link |
4 MoE NU 8x152M | Link |
5 MoE RNU (r=0.5) 8x152M | Link |
6 MoE DU (r=0.5) 8x152M | Link |
7 MoE DU (r=1.0) 8x152M | Link |
8 Dense 1.5B | Link |
9 MoE FS 8x1.5B | Link |
10 MoE BTX 8x1.5B | Link |
11 MoE NU 8x1.5B | Link |
12 MoE RNU (r=0.5) 8x1.5B | Link |
13 MoE DU (r=0.5) 8x1.5B | Link |
14 MoE DU (r=1.0) 8x1.5B | Link |
Model | Link |
---|---|
1 Dense 3.7B | Link |
2 MoE FS 8x3.7B | Link |
3 MoE DU (r=0.5) 8x3.7B | Link |
4 Dense 13B | Link |
5 Dense 3.7B | Link |
Model | Link |
---|---|
Japanese expert 152M | Link |
English expert 152M | Link |
Code expert 152M | Link |
Japanese expert 1.5B | Link |
English expert 1.5B | Link |
Code expert 1.5B | Link |
If you find our work helpful, please feel free to cite.
@inproceedings{
nakamura2025dropupcycling,
title={Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization},
author={Taishi Nakamura and Takuya Akiba and Kazuki Fujii and Yusuke Oda and Rio Yokota and Jun Suzuki},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=gx1wHnf5Vp}
}