Zero-shot results when using the Llama-3.1-70B-Instruct as the teacher model, and the Llama-3.1-8B-Instruct as the initialized model

Task Llama-3.1-8B-Instruct Llama3.1-Mamba-8B-distill Llama3.1-Mamba-8B-dpo Llama3.1-Mamba2-8B-distill Llama3.1-Mamba2-8B-dpo
arc_challenge 0.552 0.5384 0.5657 0.5265 0.5973
arc_easy 0.8178 0.8224 0.8401 0.822 0.8481
hellaswag 0.7921 0.7591 0.7736 0.7536 0.7969
mmlu (0 shot) 0.6812 0.6213 0.636 0.6101 0.5974
openbookqa 0.432 0.428 0.442 0.416 0.44
piqa 0.8079 0.7933 0.8041 0.7889 0.8003
pubmedqa 0.752 0.72 0.744 0.726 0.746
race 0.4478 0.4211 0.4344 0.4211 0.4612
winogrande 0.7388 0.7277 0.738 0.7174 0.7411
truthful 0.4267 0.4002 0.4607 0.4031 0.5022
@article{junxiongdaniele2024mambainllama,
  title   = {The Mamba in the Llama: Distilling and Accelerating Hybrid Models},
  author  = {Junxiong Wang and Daniele Paliotta and Avner May and Alexander M. Rush and Tri Dao},
  journal = {arXiv preprint arXiv:2408.15237},
  year    = {2024}
}
Downloads last month
2
Inference API
Unable to determine this model's library. Check the docs .

Collection including JunxiongWang/Llama3.1-Mamba-8B-distill