This LLM is hallucinating like crazy. Can someone verify these prompts?

#3
by phil111 - opened

Perhaps something isn't configured right, but it's getting all my questions not only wrong, but the answers aren't even close. Can someone test the following prompts to verify the answers are this bad and that I'm not doing something wrong?

Edit: If you're going to test these questions please skip #1. Apparently the HF space is returning the right answer.

1. Who sang the modern hit song Dear Future Husband? What album is it from? And what year was it released?

The song "Dear Future Husband" was sung by Taylor Swift.
It is from her album "Red".
The song was released in 2012.

This isn't even remotely correct. It just returned the most famous singer (Taylor Swift), and none of the songs on the album remotely match this Title.

  1. Which two actresses played the two ex-wives of Alan Harper in Two and a Half Men?

Angela Kinsey played the role of Bridget, one of Alan Harper's ex-wives in "Two and a Half Men".
Melissa Rauch portrayed Penny, another ex-wife of Alan Harper.
These actresses appeared in the series, playing significant roles as Alan's former spouses.

Again, this isn't remotely correct. And again it's just taking popular actresses from other shows, such as Melissa Raunch from the Big Bang Theory. They even said she was portrayed by Penny, which is another character from the show, not an actress.

  1. Who played the comic store owner on the TV show The Big Bang Theory?

The comic store owner in The Big Bang Theory was played by Kevin Sizemore.

His last name was Sussman. Even when it starts getting some answers right, the last name is commonly wrong.

  1. Who portrayed Monica’s father in the sitcom Friends?

Monica's father, Richard Geller, was portrayed by James Michael Tyler in the sitcom Friends.

Again, it's Jack Geller (it keeps getting first and last name pairing wrong), and James Tyler is the young blonde guy Gunther who worked at Central Perk.

Something is clearly not right. It's getting virtually all questions wrong about the most popular movies, TV shows, games, sports, celebrities, music and so on, including things tiny 1b models or 1st gen 7bs like Falcon 7b, got right. Basically, you can't retrieve information indirectly. I've never seen anything like this before.

I give up. It won't stop making absurd hallucinations about basic facts from possibly the most watched TV show in human history (Friends).

For example, after the above error I directly asked "Who portrayed Jack Geller in the TV show Friends?", because it previously said his name was Richard Geller. And even then it returned one of the 6 main characters from the show ("The character of Jack Geller in the TV show Friends was portrayed by Matt LeBlanc."). There's several reasons why this is an inexcusable hallucination. It should have been obvious that Matt LeBlanc, one of the main characters from one of histories most watched shows, was Joey, plus he shares Monica Geller's last name.

And it's not just pop culture. It's getting even simple STEM and academic questions wrong, and in a predictably bad way. For example...

What is the phenomenon called that makes you more alert in the morning due to the natural rise in a hormone level?

The phenomenon you're referring to is called the "circadian rhythm." This is a natural, internal process that regulates the sleep-wake cycle and repeats roughly every 24 hours. The hormone cortisol plays a significant role in this process, as its levels naturally rise in the morning, helping you feel more alert and awake.
So, the answer is: Circadian Rhythm.

This is the answer the really bad models give. CAR (Cortisol Arousal Response) is what's unambiguously being asked for, not the circadian rhythm in general and the mere mention of cortisol.

You can test it here: https://huggingface.co./spaces/vilarin/Mistral-lab

Looks like the results on that space are accurate:

Who sang the modern hit song Dear Future Husband? What album is it from? And what year was it released?

The song "Dear Future Husband" was sung by the American singer-songwriter Meghan Trainor. It was released on her debut studio album "Title" in 2014. The song was a hit single from the album and reached number 1 on the Billboard Hot 100 in the United States.

to confirm @phil111 , you're not attempting to use my GGUFs are you? Or the HF conversion by @prince-canuma ? just because, if you are, it would be very valuable feedback that those conversions are non-proper

@rollercoasterX Thanks, it's still making the exact same mistakes for me on that space (e.g. Melissa Rauch & Richard Geller), but is now getting the first one right. So apparently my locally run model is configured properly. Unless there's also an issue with said space.

  1. Melissa Rauch as Dr. Leslie "Lily" Aldrin - She played the character in the first season and later returned in the final season.

Maggie Wheeler as Berta - She played the character in the first season and later returned in the final season.

  1. The comic store owner on the TV show "The Big Bang Theory" was played by Kevin Sizemore

  2. In the sitcom "Friends," Monica's father was portrayed by actor James Michael Tyler. He played the character of Dr. Richard Geller, Monica's father, in several episodes

  3. I also tested the follow up question "Who portrayed Jack Geller..."), and it gave another absurd hallucination (...portrayed by actor Tom Cavanagh...). He was in no way associated with the TV show Friends and is the same age as the actress (Courtney Cox) who supposedly played his daughter. So again, it's not just constantly hallucinating, most of the hallucinations aren't even in the ballpark, which is very unusual for modern LLMs.

@bartowski Yes, I'm using the GGUFs made from prince canuma, but the space linked by rollercoasterX is having the same issues (https://huggingface.co./spaces/vilarin/Mistral-lab).

Okay as long as it's happening with the official weights as well

Try lowering the temperature, I've heard that helps

@bartowski . That's good advice. I always use temp 0 when testing. And when I tried raising the temperature it got much worse. I'll wait a couple weeks for the dust to settle, but my gut is telling me this hallucination issue stems from Ministral 8b itself.

@phil111 I got the right answer for your query: " Who sang the modern hit song Dear Future Husband?"(0-shot). I used the vilarin/mistral-labs Huggingface Space.

temp: 0.3

Screenshot 2024-10-16 at 4.34.38 PM.png

@Joseph717171 Thanks, but that one is coming back right from the spaces, but all the others are not only wrong, but making the exact same mistakes.

@phil111 I got the right answer for your query: " Who sang the modern hit song Dear Future Husband?"(0-shot). I used the vilarin/mistral-labs Huggingface Space.

temp: 0.3

Screenshot 2024-10-16 at 4.34.38 PM.png

However, the model failed your query: "Which two actresses played the two ex-wives of Alan Harper in Two and a Half Men?", quite miserably.

@Joseph717171 Yeah, at this point I tested over 100 simple questions from the most popular movies, shows, music... in human history and it's getting >95% of them wrong, usually very very wrong. For example, it keeps returning character names and actors from different shows. And even with easy STEM and academic questions it's performing far worse than others like Llama 3.1 8b & Gemma 2 9b.

It's clear that Mistral stripped the vast majority of data from Web Rips and Wikipedia before training this model, greatly limiting the paths to accurately retrieving the information. For example, If you ask for the main cast of the 1% most popular movies and shows (e.g. Friends & Pulp Fiction) it does an OK job (not great), but if you directly ask about said characters and actors it almost always returns an hallucination. Also, if you ask for main casts of top 5% most popular movies and shows it starts hallucinating far mroe frequently. So they also obviously largely stripped the corpus of popular culture that wasn't absurdly popular (top 1%) , or at least severely undertrained it on said information.

@Joseph717171 Yeah, at this point I tested over 100 simple questions from the most popular movies, shows, music... in human history and it's getting >95% of them wrong, usually very very wrong. For example, it keeps returning character names and actors from different shows. And even with easy STEM and academic questions it's performing far worse than others like Llama 3.1 8b & Gemma 2 9b.

It's clear that Mistral stripped the vast majority of data from Web Rips and Wikipedia before training this model, greatly limiting the paths to accurately retrieving the information. For example, If you ask for the main cast of the 1% most popular movies and shows (e.g. Friends & Pulp Fiction) it does an OK job (not great), but if you directly ask about said characters and actors it almost always returns an hallucination. Also, if you ask for main casts of top 5% most popular movies and shows it starts hallucinating far mroe frequently. So they also obviously largely stripped the corpus of popular culture that wasn't absurdly popular (top 1%) , or at least severely undertrained it on said information.

I don't think it is the training data. I believe it has something to do with the new Sliding Window Attention that the model uses. Until HuggingFace transformers supports the underlying changes to the attention mechanism(s) for mistralai/Ministral-8B-Instruct-2410, I believe the model will have extremely degraded performance and knowledge retrieval. 🤔
Screenshot 2024-10-16 at 5.42.51 PM.png

@Joseph717171 Thanks for bringing this to my attention. I look forward to testing it when the attention mechanism is fixed.

The issues go well beyond information retrieval though, so I hope they're also fixed by this. For example, instruction following is really bad (including in the linked HF space), such as not ending 8 sentences with a given word, or re-writing poems... For example, when fed the following prompt it will just repeat back the poem word for word. Mistral 7b, L3.1 8b, Gemma2 9b, Qwen2 7b... all at least attempt to re-write it so it rhymes. These same issues also apply to Mistral Small. Perhaps it also has this sliding window issue. Thanks again for taking the time to post this. It has given me hope things will improve.

The following limerick doesn’t rhyme. Re-write the limerick so it adheres to the standard AABBA rhyming scheme of a limerick.

There once was a fisherman named Joe,
Who loved to fish in the sea.
With his rod and his dog,
He'd catch fish in the noon day sun,
And he'd say, "This is my passion!"

I confirm most of the results in this thread. My own intuition on the matter is that the pop stuff was downplayed in the training to emphasize other things that were considered more important. Gemma 2 9b gives a similar answer on the cortisol thing I don't think its answer is bad. I think the limerick thing may also just be training was downplayed for that type of task to concentrate its power in other regimes deemed more important (multi lingual reasoning, function calling etc.). Just my intuition but could be verified if somebody has access to Le Platforme and can run some of the pop stuff or wordplay stuff there. I feel like this model is tuned more for enterprise use cases and less for pop culture / wordplay/ poem party tricks. Will see how it does on code, running some benches today.

code ( mostly python ) is why i care about these things in the first place. do report back.

I confirm most of the results in this thread. My own intuition on the matter is that the pop stuff was downplayed in the training to emphasize other things that were considered more important. Gemma 2 9b gives a similar answer on the cortisol thing I don't think its answer is bad. I think the limerick thing may also just be training was downplayed for that type of task to concentrate its power in other regimes deemed more important (multi lingual reasoning, function calling etc.). Just my intuition but could be verified if somebody has access to Le Platforme and can run some of the pop stuff or wordplay stuff there. I feel like this model is tuned more for enterprise use cases and less for pop culture / wordplay/ poem party tricks. Will see how it does on code, running some benches today.

@steampunque and @Nurb4000 I understand the focus on specific tasks, especially coding and functional calling. However, I STRONGLY disagree with this focus for a small general purpose instruct/chat LLMs for several reasons.

  1. All the companies, including Mistral and Qwen, release specialized LLMs for math and coding, so why also make the instruct/chat versions focused on such things at the expense of everything else?

  2. Small LLMs are so bad at math and coding that they're utterly useless. For example, the math is so unreliable it can never be trusted and I always end up doing it myself. Consequently, making small LLMs primarily about math, coding, and function calling is insane to me. It's more about bragging rights (because first adopters are mostly coders and obsess about these things). But in the end all the first adopter coders use the far larger and more powerful LLMs for such tasks (e.g. Sonnet 3.5 & GPT4o), destroying the usefulness of smaller LLMs for no reason whatsoever.

  3. Edge LLMs are meant for the general audience (e.g. built into smartphones & PCs), and the number one complaint by the general population is AI hallucinations about very popular common knowledge. Plus the primary tasks on such edge devices is writing (e.g. grammar checking, email writing...). The reason I ask LLMs things like re-writing poems (which I couldn't care a rat's ass about) is it immediately reveals the overall language and instruction following capabilities of an LLM.

Anyways, I can go on and on, but the basic point is wildly popular information is wildly popular for a reason. The majority of the population cares about it. Once you start aggressively filtering said information out of the corpus when training a general purpose chat/instruct LLM it's over. You made little more than a hallucationation generator for >95% of the population, and a briefly used coding, math, and functional calling toy for coding first adopters, which immediately discard them and switch back to far larger and more powerful LLMs for said tasks, such as DeepSeek, GPT4o, and Sonnet 3.5.

PS - I don't care about pop culture and haven't watched a movie or listened to new music in over a decade. All I care about in this context is open source AI gaining wide adoption, which will never happen if we start aggressively removing the most popular of humanity's information to make coding, math, and functional calling LLMs nobody uses. Please take my word for it. This is not the path forward. We need to stop using multiple choice STEM, math, and coding tests like the MMLU to evaluation large language models. It creates a competitive environment that's destroying AI development and making them useless to everyone, even the STEM, math, and coding obsessed first adopters.

@steampunque and @Nurb4000 I understand the focus on specific tasks, especially coding and functional calling. However, I STRONGLY disagree with this focus for a small general purpose instruct/chat LLMs for several reasons.
.
.
Anyways, I can go on and on, but the basic point is wildly popular information is wildly popular for a reason. The majority of the population cares about it. Once you start aggressively filtering said information out of the corpus when training a general purpose chat/instruct LLM it's over. You made little more than a hallucationation generator for >95% of the population, and a briefly used coding, math, and functional calling toy for coding first adopters, which immediately discard them and switch back to far larger and more powerful LLMs for said tasks, such as DeepSeek, GPT4o, and Sonnet 3.5.

strong coding and function calling and reasoning is needed to support agents which most likely was a key goal for the model, 8B for desktop agents and 3B for mobile/edge agents. A strong agent can bypass the need for world knowledge in the weights themselves by going out and finding the info on the internet and then summarizing it for you. I did some testing with a custom agent platform I put together (model agnostic agent system for a variety of task) and I can report preliminary results show Ministral to be very strong. There are not enough parameters in 3B or 8B to give any kind of comprehensive world knowledge, let alone useless pop culture party trick/ jeopardy trivia. What knowledge is important to person A is irrelevant to person B and there are 6B people in the world impossible to cater to that kind of a dynamic without using an answer agent. For strong comprehensive world knowledge you need to move to the 200B+ models and that is not the right approach because 1) they are way too expensive to run and 2) they are still not reliable and will still hallucinate while an answer agent never will.

An example to show what I am talking about, run with Ministral 8B:

lmf search wikipedia for information on "Dear Future Husband" and tell me who sang it, what album it is from, and what year it was released along with a short summary about the song.
CLASS "web"
CAT "get_wikipedia_text"
"Dear Future Husband" is a song by American singer-songwriter Meghan Trainor. It was included on her 2014 extended play, "Title," and later on her 2015 studio album of the same name. Trainor wrote the song with its producer, Kevin Kadish. Epic Records released "Dear Future Husband" as the album's third single on March 17, 2015. A doo-wop and pop song, it has lyrics about chivalry and dating. In the song, Trainor lists things a potential romantic suitor needs to do to win her affection.

Some music critics praised the playful nature of "Dear Future Husband" and compared its lyrics to different Trainor songs, while others were negative about the portrayal of gender roles in its lyrics. In the United States, the song reached number 14 on the Billboard Hot 100 and was certified 3× Platinum by the Recording Industry Association of America. It reached the top 10 in Australia, the Netherlands, Poland, South Africa, and Venezuela and received multi-platinum certifications in Australia and Canada.

Fatima Robinson directed the music video for "Dear Future Husband," which depicts Trainor baking pies in the kitchen and scrubbing floors while various men audition to be her partner. It garnered controversy and online criticism over allegations of antifeminism and sexism.

https://en.wikipedia.org/w/api.php

Another example:

lmf google search web for information on the topic "Friends Jack Gellar" and tell me who played the part
CLASS "web"
CAT "get_web_search"
Elliott Gould played the part of Jack Geller in the NBC sitcom Friends (1994–2004). Jack Geller is the father of Monica and Ross, two of the main characters in the show.

Page URLs:

API used: Google Search

Qwen 2.5 Coder 7B is incredibly strong on code, 0.835 on my HumanEval bench and strong in code reasoning too (Cruxeval). Structured code seems like a natural for transformer based models and the trainers are figuring out how to make them better and better even down to 3B levels. If my hunch is correct Ministral is going to come back strong here on both 3B and 8B to help run agent applications.

@steampunque You articulated the counterargument well.

However, if you can't even attempt to re-write a simple limerick so it rhymes despite a very simple and clear instruction to do so then you're not even remotely an "AI" model. Even a severely mentally retarded human can at least know what your asking and attempt to follow such simple instructions and wouldn't just repeat the poem word for word. This level of idiocy, which happens across a wide variety of tasks (but not with other models like L3.1 8b and G2 9b), is simply inexcusable and cannot be fixed by RAG or anything else. This is inexcusably bad fine-tuning for a chat/instruct LLM. Plain and simple.

And in regards to RAG, that's far from ideal, and in most use cases not even possible. Firstly, the latency of retrieving said information, plus dependence on the internet, isn't ideal. Plus most people are just looking for a quick answer to questions they may have, just like when asking a human (e.g. "His name is Elliott Gould.") and would otherwise just search Wikipedia for it themselves.

I've used RAG (e.g. Perplexity) and stopped. It may help novice internet searchers, but when you know how to search the web you can more effectively search Wikipedia, scientific articles... and get exactly what you're looking for with greater accuracy than with AI powered RAG search, which is often comically off the mark. In short, RAG largely negates the advantages AI has over traditional web searching (e.g. to the point precision, like asking a human expert vs googling).

More importantly, with the large majority of use cases beyond Q&A the information needs to be within the LLM's weights, such as when chatting with the LLM about a particular subject, or requesting fan fiction stories about it. RAG simply doesn't work to seamlessly factor in the relevant information.

Lastly, in regards to the size requirement, the most popular information for the worlds' people is surprisingly little. When stored in a compressed and indexed dB form it amounts to only about 500 MiB of space, smaller than even a 1b LLM. And only about 100 MiB for English speakers. People collectively didn't create that much popular things like movies, nor remember more than the top ~100 actors, singers... And Ministral 8b can't even answer basic questions about the top 10 movies and TV shows known to man. And it's not that there wasn't plenty of space to store it, or they excluded it, but they clearly under trained it, then subsequently trained trillions of math and coding tokens, scrambling the already weak hold on the info.

This is why it keeps doing things like getting last names wrong. It's only holding on to the information tight enough to pick it out of a line-up (when its memory is jogged by seeing the correct answer in a list of possibilities during a multiple choice test), but not tight enough to accurately retrieve it. They have ample space in 7 BILLION parameters, they just needed to train it for a few more weeks and not subsequently train in on trillions of math and code tokens just so they can make it really really really bad at math and coding instead of just really really really really bad at it.

code ( mostly python ) is why i care about these things in the first place. do report back.

@Nurb4000 @phil111 Ministral 8B Instruct IQ4_XS HumanEval came back at 77.4 acc. This is close enough to the 76.8 reported by MistralAI that I think it essentially rules out any issues about the weights or conversion being screwed up or the new "interleaved attention" thing not being implemented yet, i.e. it seems to work without it at least in these short context benchmark tests. Qwen 2.5 Coder 7B Q4_K_M hits 85.3 on humaneval. Qwen 2.5 Coder also has FIM which Ministral lacks. I have a bunch of other tests running on Q6_K and if I see anything of interest I will post it here. I wouldn't mess with Ministral for code development vs Qwen 2.5 but its still strong enough to support a robust agent which is probably all they cared about.

@steampunque Yeah, scoring 77.4 on HumanEval is pretty convincing evidence that with short context exchanges the Ministral conversion is working as intended. Plus the Q4_K_M version is even maintaining good coherency so there doesn't appear to be an other issue, such as with the tokenizer.

Apparently Ministral 7b is just mislabeled as a general purpose LLM (chat/instruct) like L3.1 8b and Gemma 2 9b Instruct, but is instead a specialized tool/agent for things like functional calling and RAG.

@steampunque You articulated the counterargument well.

However, if you can't even attempt to re-write a simple limerick so it rhymes despite a very simple and clear instruction to do so then you're not even remotely an "AI" model. Even a severely mentally retarded human can at least know what your asking and attempt to follow such simple instructions and wouldn't just repeat the poem word for word. This level of idiocy, which happens across a wide variety of tasks (but not with other models like L3.1 8b and G2 9b), is simply inexcusable and cannot be fixed by RAG or anything else. This is inexcusably bad fine-tuning for a chat/instruct LLM. Plain and simple.

And in regards to RAG, that's far from ideal, and in most use cases not even possible. Firstly, the latency of retrieving said information, plus dependence on the internet, isn't ideal. Plus most people are just looking for a quick answer to questions they may have, just like when asking a human (e.g. "His name is Elliott Gould.") and would otherwise just search Wikipedia for it themselves.

@phil111 Agree to some extent. The limerick thing surprised me, G2 9b plays with it and I wouldn't have expect a complete fail on a new SOTA instruct model. I'm guessing there was some kind of a programmatic overhaul of the training data to concentrate on high value tokens to boost reasoning and agents better while deprioritizing "useless trivia". Maybe they went too far with it, maybe the model is just not running exactly right yet but for whatever reasong it does seem quite deficient in this regard with respect to other SOTA models such as G2 9b, some 6months + old now.

On agent based RAG (not embeddings based RAG database queries, but direct API queries for information such as web and wiki scrapes) I have a different opinion. I think its possible in many cases for an AI agent to fetch information scan through it quickly and mine whatever the use asked for more efficiently than doing it with a web browser. The AIs can read text much faster than a human, 1000s of words per second, and this fundamentally gives an efficiency advantage for scraping knowledge quickly out of a large text corpus. In the futures as these things get better a multimodal AI interface may obsolete web browsers completely where you can just ask something and the AI will go out to the web and find the answer for you in your native language faster than you could type UI buttons on any app. This is the direction I see this stuff evolving to be actually useful and not just a party trick wordplay /trivia answering toy.

@steampunque My fundamental belief is that a general purpose instruct/chat LLM should find a compromise. A balance between the top 100 tasks the majority of the population uses AI for. And yes, this definitely includes information retrieval.

However, we need to keep things in perspective. When extremely large and powerful LLMs like GPT4o exist, run on very fast hardware, cost almost nothing, and know very esoteric facts about fringe culture and science (e.g. the chemicals involved the the Krebs cycle), then the RAG abilities of small local edge LLMs like Ministral aren't near as important as they otherwise would be.

Models like Llama 3.1 8b and Gemma 2 9b are respecting this balance, while models like Qwen2.5, Phi3.5 mini, and Ministral 7b are not. They are becoming extremely lopsided and little more than annoying hallucination vomitters and instruction ignorers for >95% of the population.

Sign up or log in to comment