Is there a model removing non-shared MoE experts?

#17

by ghostplant - opened 2 days ago

Discussion

ghostplant

2 days ago

I wonder if it still performs well when leaving the shared_expert only, letting all dedicated experts to use 0 bit.

TobDeBer

2 days ago

It would bring it down to the quality of a 3b model.

TobDeBer

2 days ago

What you could do instead is to monitor which experts are heavily used in your domain. Then only keep the top n experts.
Some coding to map missing experts to the shared expert without recomputing and voila, saved disk storage, but almost no RAM or compute.

ghostplant

1 day ago

It would bring it down to the quality of a 3b model.

I felt a 3b model is still not bad in smartness if just targeting naive communication. What's your recommendation to evaluate its smartness more accurately?

TobDeBer

1 day ago

I usually give new models those tasks previous models failed on
R1 also failed them.
PS: My test set is a bit unfair. I ask about things commonly thought impossible but that I solved. Then I give hints. So far all reasoning models failed on real reasoning beyond public literature.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment