Is there a model removing non-shared MoE experts?

#17
by ghostplant - opened

I wonder if it still performs well when leaving the shared_expert only, letting all dedicated experts to use 0 bit.

It would bring it down to the quality of a 3b model.

What you could do instead is to monitor which experts are heavily used in your domain. Then only keep the top n experts.
Some coding to map missing experts to the shared expert without recomputing and voila, saved disk storage, but almost no RAM or compute.

It would bring it down to the quality of a 3b model.

I felt a 3b model is still not bad in smartness if just targeting naive communication. What's your recommendation to evaluate its smartness more accurately?

I usually give new models those tasks previous models failed on
R1 also failed them.
PS: My test set is a bit unfair. I ask about things commonly thought impossible but that I solved. Then I give hints. So far all reasoning models failed on real reasoning beyond public literature.

Sign up or log in to comment