--- license: apache-2.0 ---

Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization

📄 [Paper] | 🤗 [Hugging Face] 📁 [Dataset] 💻 [Code] | 📊 [Log]

# Model Index We provide model checkpoints for all experiments to ensure reproducibility of the results presented in Tables 1 and 2. ## Table 1 |Model|Link| |---|---| |1 Dense 152M| [Link](https://huggingface.co./llm-jp/Dense-152M) | |2 MoE FS 8x152M| [Link](https://huggingface.co./llm-jp/FS-8x152M) | |3 MoE BTX 8x152M| [Link](https://huggingface.co./llm-jp/BTX-8x152M) | |4 MoE NU 8x152M| [Link](https://huggingface.co./llm-jp/NU-8x152M) | |5 MoE RNU (r=0.5) 8x152M| [Link](https://huggingface.co./llm-jp/RNU-0.5-8x152M) | |6 MoE DU (r=0.5) 8x152M| [Link](https://huggingface.co./llm-jp/DU-0.5-8x152M) | |7 MoE DU (r=1.0) 8x152M| [Link](https://huggingface.co./llm-jp/DU-1.0-8x152M) | |8 Dense 1.5B| [Link](https://huggingface.co./llm-jp/Dense-1.5B) | |9 MoE FS 8x1.5B| [Link](https://huggingface.co./llm-jp/FS-8x1.5B) | |10 MoE BTX 8x1.5B| [Link](https://huggingface.co./llm-jp/BTX-8x1.5B) | |11 MoE NU 8x1.5B| [Link](https://huggingface.co./llm-jp/NU-8x1.5B) | |12 MoE RNU (r=0.5) 8x1.5B| [Link](https://huggingface.co./llm-jp/RNU-0.5-8x1.5B) | |13 MoE DU (r=0.5) 8x1.5B| [Link](https://huggingface.co./llm-jp/DU-0.5-8x1.5B) | |14 MoE DU (r=1.0) 8x1.5B| [Link](https://huggingface.co./llm-jp/DU-1.0-8x1.5B) | ## Table 2 |Model|Link| |---|---| |1 Dense 3.7B| [Link](https://huggingface.co./llm-jp/Dense-3.7B) | |2 MoE FS 8x3.7B| [Link](https://huggingface.co./llm-jp/FS-8x3.7B) | |3 MoE DU (r=0.5) 8x3.7B| [Link](https://huggingface.co./llm-jp/DU-0.5-8x3.7B) | |4 Dense 13B| [Link](https://huggingface.co./llm-jp/Dense-13B) | |5 Dense 3.7B| [Link](https://huggingface.co./llm-jp/llm-jp-3-3.7b) | ## BTX Experts |Model|Link| |---|---| |Japanese expert 152M| [Link](https://huggingface.co./llm-jp/Dense-btx-japanese-expert-152M) | |English expert 152M| [Link](https://huggingface.co./llm-jp/Dense-btx-english-expert-152M) | |Code expert 152M| [Link](https://huggingface.co./llm-jp/Dense-btx-code-expert-152M) | |Japanese expert 1.5B| [Link](https://huggingface.co./llm-jp/Dense-btx-japanese-expert-1.5B) | |English expert 1.5B| [Link](https://huggingface.co./llm-jp/Dense-btx-english-expert-1.5B) | |Code expert 1.5B| [Link](https://huggingface.co./llm-jp/Dense-btx-code-expert-1.5B) | ## How to cite If you find our work helpful, please feel free to cite. ``` @inproceedings{ nakamura2025dropupcycling, title={Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization}, author={Taishi Nakamura and Takuya Akiba and Kazuki Fujii and Yusuke Oda and Rio Yokota and Jun Suzuki}, booktitle={The Thirteenth International Conference on Learning Representations}, year={2025}, url={https://openreview.net/forum?id=gx1wHnf5Vp} } ```