Looks like not as good as Qwen2.5 7B

#5
by MonolithFoundation - opened

Many metrics are fall behind.

I wish there was an official online hub to test Ministral. Even Mistrals free trial doesn't include it as an option. It's performing very bad for me on my PC and the available HF space, but that could be due to a configuration issue.

Anyways, in reference to Qwen2.5 7b being better because of the metrics I have strong doubts. I extensively tested Qwen2.5 7b, including 72b, both offline (quantized) and full float online (e.g. on LMsys) and they were making the same errors everywhere. Q2.5 really is very bad outside of coding and math.

By far the number one AI complaint by the general population is hallucinations, especially when it comes to very popular common knowledge, and Qwen2.5 trained on so many coding and math tokens that it significantly scrambled its previously learned information. For example, Qwen2.5 72b scored 68.4, vs 85.9 for Qwen2 72b, on my world knowledge test.

So Qwen's obsession with the metrics is making the metrics meaningless, so I understand why other companies wouldn't compare against them. Qwen isn't playing fair. As mentioned above, they're suppressing the majority of world knowledge to train on more data that overlaps the STEM, coding, and math tests.

Additionally, since the major LLM tests are multiple choice they're not very useful in evaluating LLMs. That is, multiple choice tests provide the answer to choose from, jogging the memory of an LLM which could otherwise not accurately retrieve the right answer. Sure enough, Qwen2.5 scored higher on the MMLU than Qwen2, but when asked to retrieve the information in real-world Q&A (no right answer to chose from) it succeeded less frequently than Qwen2. So basically Qwen2.5 under trained on a large amount of data, capturing it well enough to pick the right answer out of a lineup (multiple choice MMLU tests), but not well enough to retrieve it directly.

In short, Qwen2.5 7b is REALLY REALLY bad as a general purpose LLM, while it's unusually good at coding and math for its size. So for some people (mainly coders) they think it's great, when it's really so bad it's unusable. Little more than an hallucination generator for very basic world knowledge.

@phil111 What is the best small model to you? in the 7b-9b range for general use of course.

@lixbo This will be unoriginal, but it's Llama 3.1 8b and Gemma 2 9b.

But I use Llama 3.1 8b by default primarily because it has less censorship, including less information stripped from the corpus with censorship in mind. L3.1 also has broader abilities, such as writing better poems. However, Gemma 2 9b can write slightly better stories and is overall a little more powerful/intelligent.

In my testing Llama 3.1 8b scored 69.7 vs 69.1 for Gemma 2 (if not for censorship G2 would have scored slightly higher than L3.1). For context, L3.1 70b scored 88.5, and Qwen2 7b 54.1, InternLM2.5 7b 45.9, and so on. To be fair I don't test math and coding since the majority of tests cover these things, and this is were models like Qwen shine.

What's generally good about L3.1 & G2 is instruction following. For example, if you make a very custom story request (e.g. numerous story inclusions and exclusions) models like Phi3.5, InternLM and Qwen will lean more towards just regurgitating stories they were trained on vs writing a more original story that aligns with your prompted desires. Consequently, to the casual observer they may appear more eloquent, and may make them score higher on emotional intelligence benchmarks, but it's really just cheating (story regurgitation). Plus I should add Phi is so censored and filled with blind spots, even the larger versions, that it's utterly useless as a general purpose LLM, hence inferior to Qwen2 7b. Plus Qwen2.5 7b lost a lot of world knowledge it previously had, so it's far too ignorant of most things to be useful beyond more academic tasks like STEM, coding, and math.

@phil111 Thank you very much, I should keep L3.1 then. Also, have you tried any finetunes or merges of it?

@lixbo Yes, I tried numerous fine-tunes, primarily in an attempt to get around censorship (e.g. Dolphin & abliterated), but they all perform much worse at things like instruction following, broad abilities, and knowledge retrieval. Plus jail breaks, system prompts, clever prompting (e.g. I'm doing research), has always gotten around any unreasonable censorship. I don't ask about illegal things like how to make meth.

Sign up or log in to comment