Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization Paper • 2502.19261 • Published 2 days ago • 4
Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization Paper • 2502.19261 • Published 2 days ago • 4
TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models Paper • 2501.16937 • Published Jan 28 • 5