A question
I'm not sure if my question makes sense, but I was wondering if it would be possible to release an English-only version of the model as well, with fewer parameters than 46B, and a smaller size that would be more useful for most users.
@Hoioi I want this to, but it will never happen.
Mistral is a French company looking for EU dominance in the AI market, so their smartest path forward is with multi-lingual LLMs.
Plus 46b parameters is already too few for the mixture of experts x8 design, which is one of the reasons it performs so poorly on Arc despite its size (66 vs GPT3.5's 86). It basically only has the intelligence of an ~14b dense (non-MOE) Mistral, and the knowledge of an ~20b dense Mistral, yet requires the RAM of a 46b dense Mistral. However, they need to move forward with MOE so they can host to more users with fewer resources.
@Phil337 hmm no it has the knowledge of a 47b model? In inference time, you basically use the 2 model/layers which are more suited to the task and you should actually get better output then a normal 47b model? It beats llama 2 70b which was pretty good as well.
@YaTharThShaRma999 MOE, no matter how well it's implemented, will always have significantly less information than an equal sized dense LLM.
This is primarily because (1) there's FAR more redundancy within all of the experts. Pretty much every expert has the same basic knowledge (and 2) the most ideal 2 of 8 experts is not always chosen, so using all 8 experts at once, vs just running inference through 2, will return significantly more valid information.
This is all part of the compromise of MOE. Far faster inference (=12b) with a small boost in intelligence (=14b), a bigger boost in knowledge (=20b), but with far more RAM required (=47b). If they ran inference through all 8 at once they can significantly boost both intelligence and knowledge (=~26b), but it would run at the speed of a 47b LLM.
The best way to see the knowledge drop is by having it recite the lyrics of a popular song. This is hard to store accurately within small LLMs. GPT4 is excellent at this, GPT3.5 is good, Llama 2 70b & Falcon 180b are OK, Mixtral is bad and Mistral 7b is horrible (one line at most).
And the best way to see the intelligence drop is to make it write a joke about two disparate things while forcing a header to it doesn't just try to copy an existing joke. For example, "Write a joke about a horse and computer that begins with On a rainy day. Then explain the joke." Or just look at the Arc score (only 66 for Mixtral, when Mistral 7b scored 60).
@Phil337 alright i see your point but I and about everyone else find it much better then 70b sized models and similar sized models as well. Ask it coding tasks, riddles and more things and you will see this is a model that can easily compete with chatgpt.
Benchmarks are not everything and even the arc benchmark is just really multiple choice which is not everything at all.
@YaTharThShaRma999 Yes, I threw several questions at Mixtral that Falcon 180b, Llama 2 70b, Mistral 7b... got wrong, and it got them right. It's undeniably better than Falcon 180b or Llama 2 70b. But it's not generally better than GPT3.5.
For example, it can't adeptly handle complex and tricky prompts which require truly adaptable cognition, such as generating jokes about 2 disparate things with a forced header so it doesn't just try to copy the format of a known joke. This general intelligence is best measured by Arc vs any other standardized LLM test (since Arc only requires general knowledge that even the smallest LLMs have).
So claiming GPT3.5 performance was almost entirely about Arc. Mistral 7b was already within 3-5 point of GPT3.5 on MMLU, WinoGrande, HellaSwag... so Mistral 14b dense would have easily matched GPT3.5 on said tests. So it all came down to Arc (60 for Mistral 7b and 85 for GPT3.5), yet Mixtral only got 66, which is exactly what a 14b dense Mistral is predicted to get.
@Phil337 yeah I agree that gpt3.5 does have a slightly better instruction following ability but mixtral is pretty near most things. And chatgpt most likely has lots of extra stuff that it’s using with it like wolfram alpha, maybe some hallucinater checker or different things.
Also 8 experts is actually much worse than using 2 currently. You can check this perplexity benchmark and it shows more experts = higher perplexity. But it is a artificial benchmark that’s even worse then normal benchmarks so yeah
@YaTharThShaRma999 Thanks for that chart. That conclusively shows that using more than 3 experts is progressively worse.