open-llm-leaderboard/open_llm_leaderboard · Are Qwen models pretrained or continuously pretrained?

Sep 20, 2024

Qwen base models have very high results in benchmark compared to other pretrained models. Is there a way to verify that they are base models? In their tokenizer_config a chat template is present suggesting they are trained with instructions. If so, should Qwen base models be marked as pretrained?

phil111

Sep 20, 2024

@djstrong This appears to be true because IFeval (instruction following) should remain ~15 or lower for all base models since they would be in sentence completion vs instruction following mode, yet they score ~40 (much higher than other base models).

However, recent base models, including from Google, Mistral, and Meta, are now fine-tuned to some degree, but primarily to impose censorship/alignment (I played around with them after noticing universal alignment in all fine-tunes made from them). The Qwen base models are unique in that they go beyond just fine-tuning solely for alignment (e.g. safety & censorship).

My guess is Qwen is becoming far too obsessed with test scores. For example, their latest Qwen2.5 72b scored far lower on my world knowledge test compared to Qwen2 72b (e.g. movies, music, games, sports, TV shows, and other popular culture questions). They are clearly favoring training on tokens that can boost scores on tests like the MMLU at the expense of popular world knowledge. And this obsession with test scores might also explain why they fine-tune their base models beyond just implementing alignment.

djstrong

Sep 20, 2024

I am running leaderboards for Polish and Qwen models are very good: https://huggingface.co./spaces/speakleash/open_pl_llm_leaderboard Usually, Qwen base models are better than Instruct versions.

ThiloteE

Sep 21, 2024

•

edited Sep 21, 2024

Hi @phil111 , the following paper shows that reinforcement learning from human feedback (RLHF) reduces creativity and makes the models forget knowledge: Creativity Has Left the Chat: The Price of Debiasing Language Models. This confirms what many people that use models for roleplaying and storytelling have always intuitively suspected. I assume most models nowadays are pretained/finetuned using such techniques.

Here is an example from the paper that shows the difference between the base model and the aligned model:

I guess it is only to be expected, since alignment has the goal of making certain responses more likely than others and the model will not limit its learned behaviour to situations where it is desired, but also to situations where alignment is not desired.

Here is another screenshot from the study, where they show that some kind of "clustering" is going on in the embedding space of aligned models, as compared to more diversity in base models:

My personal guess:

Another two factors that probably play a role for models not having knowledge about specific phenomena are:

Lots of Synthetic data to train / finetune the models. More tokens probably means better reasoning capabilities, at the cost of introducing some hallucinations?
Some knowledge never was part of the training data. Especially the smaller models probably have seen less data during the training to circumvent the phenomenon of "catastrophic forgetting". Model developers prioritize. Phi-3 model authors have openly stated that they left out some stuff. See also Revisiting Catastrophic Forgetting in Large Language Model Tuning.

This is a screenshot from the phi-3.5-mini-instruct readme:

phil111

Sep 21, 2024

@ThiloteE Thanks for all the good info. It's understandable that tiny models like Phi 3.5 4b deliberately exclude world knowledge to keep their heads above water.

However, Qwen2.5 72b is massive by comparison. They didn't need to give up tons of world knowledge to maintain coherency and functionality.

After reading about things like catastrophic forgetting and how the Qwen team trained Qwen2.5 (e.g. using ~18 trillion tokens) it does appear that they trained world knowledge into Qwen2.5, then lost it by subsequently over-training with a large amount of mostly synthetic math, coding, and academic data to boost their scores on academic tests like the MMLU. This is substantiated by the fact that there's still a ghost of its previous world knowledge, such as remembering part an actor's first name, but getting the rest of it, his surname, and his role in a show, wrong. As if the data is all still in there, but scrambled.

Anyways, Qwen2.5 72b achieving nearly the same MMLU score as Llama 3.1 405b is a huge red flag, not a badge of honor. Llama 3.1 405b not only has far more storage capacity (>5x the parameters), but was trained for far longer with far more compute and on a very large corpus of ~15 trillion tokens using humanity's best sources of quality and diverse information (e.g. books and wikipedia), topped off with some quality synethetic data. The laws of physics prevent a less trained and much smaller 72b parameter model to achieve the same academic knowledge score (MMLU) without sacrificing the bulk of popular world knowledge, among other things like adaptability and creativity, which are essential for a general purpose instruct LLM to have. I wish AI creators would stop trying to maximize standardized LLM test scores and just try to create the best general purpose LLMs possible.

clefourrier changed discussion title from Pretrained models - verification to Are Qwen models pretrained or continuously pretrained? Sep 27, 2024

djstrong

Sep 27, 2024

@clefourrier The question is more: Are Qwen base models pretrained or instruction fine-tuned?

If you prompt them with one word or simple phrase they are producing instructions, e.g. What is love? What does it mean to be in love?\nIs there a difference between being in love and liking someone?\n\nMulti-choice problem: Are these two questions paraphrases of each other?\nPick your answer from:\n (A). no;\n (B). yes;\n\n(A).<|endoftext|>Human: Write the next sentence.\n\nThe man was able to get his hair cut by himself, but he could not shave because\n\nAssistant: he lacked the necessary skills and tools for a precise and clean shave.

So for me they are trained with instruction datasets and instructions are not just mixed with raw data but have very high importance.

alozowski

Open LLM Leaderboard org Oct 17, 2024

Hi everyone!

We've asked the Qwen's authors directly and are awaiting their response

fblgit

Oct 24, 2024

Base -> Instruct == Same training session.
Clear thing. QWen is not the only one with this approach, which i find totally valid and coherently makes sense.
There is an extra layer of alignment after SFT obviously...