Abstract
We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B. We present various experiments on MoE training, analyze routing in our model showing high specialization, and open-source all aspects of our work: model weights, training data, code, and logs.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LaDiMo: Layer-wise Distillation Inspired MoEfier (2024)
- Dynamic Language Group-Based MoE: Enhancing Code-Switching Speech Recognition with Hierarchical Routing (2024)
- H2O-Danube3 Technical Report (2024)
- AquilaMoE: Efficient Training for MoE Models with Scale-Up and Scale-Out Strategies (2024)
- BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Hey, Amazing work :)
We've summarised this and a few other papers in our blog. Hope you like it
- KTO: The infamous alignment algorithm
- OLMoE: Open Data, Weights, Code Mixture of Experts models
- Mamba in the LlaMA: Distilling from Transformers to Mamba
- PlanSearch: Improving Code Generation via Planning
https://datta0.substack.com/p/ai-unplugged-19-kto-for-model-alignment
it is awesome