Drop-Upcycling
Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization

πŸ“„ [Paper] | πŸ€— [Hugging Face] πŸ“ [Dataset] πŸ’» [Code] | πŸ“Š [Log]

Model Index

We provide model checkpoints for all experiments to ensure reproducibility of the results presented in Tables 1 and 2.

Table 1

Model Link
1 Dense 152M Link
2 MoE FS 8x152M Link
3 MoE BTX 8x152M Link
4 MoE NU 8x152M Link
5 MoE RNU (r=0.5) 8x152M Link
6 MoE DU (r=0.5) 8x152M Link
7 MoE DU (r=1.0) 8x152M Link
8 Dense 1.5B Link
9 MoE FS 8x1.5B Link
10 MoE BTX 8x1.5B Link
11 MoE NU 8x1.5B Link
12 MoE RNU (r=0.5) 8x1.5B Link
13 MoE DU (r=0.5) 8x1.5B Link
14 MoE DU (r=1.0) 8x1.5B Link

Table 2

Model Link
1 Dense 3.7B Link
2 MoE FS 8x3.7B Link
3 MoE DU (r=0.5) 8x3.7B Link
4 Dense 13B Link
5 Dense 3.7B Link

BTX Experts

Model Link
Japanese expert 152M Link
English expert 152M Link
Code expert 152M Link
Japanese expert 1.5B Link
English expert 1.5B Link
Code expert 1.5B Link

How to cite

If you find our work helpful, please feel free to cite.

@inproceedings{
    nakamura2025dropupcycling,
    title={Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization},
    author={Taishi Nakamura and Takuya Akiba and Kazuki Fujii and Yusuke Oda and Rio Yokota and Jun Suzuki},
    booktitle={The Thirteenth International Conference on Learning Representations},
    year={2025},
    url={https://openreview.net/forum?id=gx1wHnf5Vp}
}
Downloads last month
5
Safetensors
Model size
417M params
Tensor type
BF16
Β·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Collection including llm-jp/RNU-0.5-8x152M