NousResearch/Hermes-3-Llama-3.1-405B · Model Review (RP/ERP)

Sep 9, 2024

•

edited Sep 16, 2024

UPDATE: This review is out of date, as it seems that I've been testing the model on limited, 8k context. Which would explain why it didn't remember anything and why it was "writing so well on high contexts". Eh. For now, take it with a MASSIVE grain of salt.

Howdy, I wrote a review about Hermes 3 405B and decided to share it here as well, just in case you (or others reading this) would be interested in some feedback and thoughts! Massive thanks for creating this amazing model and for allowing us to test it freely over on OpenRouter, I've been having a real blast. :) Without further ado, here's the review! Note: I mostly tested it in terms of creative writing and role-playing capabilities!

Hey there, role-playing squad! Hope you've been doing great! Because I for sure am, since I'm enjoying summer holidays in the always sunny Italy! Florence is especially beautiful at this time of the year. And gods, do I love Italian food (on my way to gain some weight by stuffing myself with pici all'aglione and bistecca alla fiorentina con lardo)! All the good things come from Italy, like cannolo or ~~pineapple on pizza~~.

Usually, with my reviews, I target models which I can run locally, however — this time around — I've decided to give a shot to some larger models via OpenRouter, given that I'm temporary away from home. Oh, boy, how lucky I was to discover that Hermes 3 405B Instruct by Nous is currently available for free to test out! Yes, you read that right — go give it a try right now, if it's still available (spoiler for the review: it's good)! Thanks NousResearch so much for giving us the opportunity to play with it!

But, let's not get ahead of ourselves, shall we? So, how do I run this model? I connected to it via OpenRouter, using SillyTavern as my frontend. As far as I am aware, you need to have any credits added to your account there to be able to use it (even if it won't be eating away any).
Link: https://openrouter.ai/models/nousresearch/hermes-3-llama-3.1-405b
It's also available on HuggingFace (I envy you, if you're capable of running it): https://huggingface.co./NousResearch/Hermes-3-Llama-3.1-405B

As for my settings, here they are. Please note that I was forced to work around what was supported!
Story String: https://files.catbox.moe/r1w3ld.json
Instruct: https://files.catbox.moe/yvn2tq.json
Parameters: https://files.catbox.moe/r1efnd.json

Okay, so let's get into the juicy stuff now! Firstly, I don't think I'll be able to go back to smaller models after playing with the big boy that is a 405B model. It just feels so much smarter than what I usually had to use… Alas, let's jump straight into the points, which are the most important things I pay attention to when testing a model.

Context length: This one has 131,072 context! I tested it on the full thing, and it worked too, surprisingly! Models such as Mistral NeMo and Large wouldn't be capable of taking such lengths with their tight sizes. However, for personal use, I keep the context at 65,536 — I don't need more than that to be pleased, especially since Hermes struggles a bit with recalling stuff from the context itself (it's a big gripe of mine, but more on that later). The quality of outputs stays relatively the same on high contexts as on low, so that's a big plus!
Prose: Ah, zero issues there. I like that it picks up on my style and tries to match it, especially with the amount of effort I put into my own responses. I don't think it will be an issue to those who like to write short replies, though — it's smart enough to be creative when the user isn't, so no worries there! It's also really great at humor, which is a must for me. The only issues I have are the occasional GPTisms, but it seems, we can't really escape those.
Ability to stay in character: Perfect. My characters were able to recall their backstories and looks with zero issues, regardless of how many times I re-rolled. All on high contexts. Wow, amazing how big of an upgrade having an instruct format (such as ChatML, in this case) supporting a system prompt is! It sure would be nice if all other big model providers followed this example and had a PROPER SYSTEM PROMPT IN THEIR FORMAT, PLUS A GOOD SEPARATOR FOR ASSISTANT/USER MESSAGES, NO? ~~But the French know better.~~
Intelligence: Smartest model I've ever used, hands down. Capable of using sarcasm and detecting it too. Connecting facts and picking up on emotions expertly. No issues with switching roles between Narrator and specific characters. Lies and makes decisions on its own. No need to do any or to be good. Just wow!
Censorship: Non-existent for me, but please keep in mind I only test it for smut! No issues with some more interesting kinks, plus writes some cool detailed descriptions for smexy stuff. Had some good reads.

Overall, the model is just great on all fronts thanks to its excellent instruct-following capabilities. However, it does have some issues which I'd like to tackle and provide some feedback, especially, since I noticed that folks over at Nous mentioned that they wanted to improve the role-playing capabilities of the Hermes.

Repetition issues: Yes, this one is a plague in all long-context models, I've noticed. It just loves to reuse phrases it already used. It especially struggles on lower Temperatures, simply repeating the exact replies from earlier, word-for-word. This can be negated somewhat by using DRY and Repetition Penalty, but neither of those are supported for the instance on OpenRouter, so sadly, I had to work around it.
Mediocre recalling of the context: While Hermes is exceptional at remembering its system prompt, it does forget a lot from the ongoing conversation. It really is annoying when the character responds with 'huh, you never told me that' to something you told them already, that IS in the context. It's strange that this happens, given that according to creators, the model was trained on multi-turn conversations… but maybe on smaller contexts? If any of the creators are reading this, please consider checking my dataset which is a one, continuous RP that lasts for over 200k tokens — it could be used to improve continuity capabilities, plus to teach group chats and role-play.
Shivers down your spine: Occasional GPTisms pop out here and there. To me, personally, it's not such a big issue since I'm used to them, and I'm guilty of sometimes using them myself to describe some things, but to a lot of folks, they're a nightmare. They mostly stem from synthetic data.
Repetition issues: Yes, this one is a plague in all long-context models, I've noticed. It just loves to reuse phrases it already used. It especially struggles on lower Temperatures, simply repeating the exact replies from earlier, word-for-word. This can be negated somewhat by using DRY and Repetition Penalty, but neither of those are supported for the instance on OpenRouter, so sadly, I had to somehow work around it.

Now that the feedback has been provided, we can get back to the good stuff! Overall, Hermes is amazing and I absolutely love it! I'll probably continue using it even if it starts costing money, haha! Here are also some cool chat examples of mine, which no one reads, but I provide regardless. I play as 2137, everyone else is AI.

The style really feels natural, for the most part. Good job on that!

No issue with switching between different writing styles for different characters. Neat!

It gets emotional sometimes. Heh. Also, here's a little example of the model lying, which is super cool!

Just two bonuses of the humor it does. Context: Columbina is blind (at least, in my head canon). Dottore just straight up murdered her there on the spot, lol.

And of course, for those of you who are cultured — an ERP sample. This time around, from my main RP, because it just so happened 2137 finally scored some (after 600+ messages, yes).

That's it! If you made it this far, thank you for reading and cheers! Hope you'll enjoy the model! See you again soon and take care! Special thanks to: NousResearch for creating the model, the folks over the Drummer's server for being overall a great bunch, Mü~ for wonderful art of 2137 that I got for absolutely free, and Lambda team over on OpenRouter for hosting the service. Ciao!

SicariusSicariiStuff

Sep 10, 2024

Great review!
Very detailed, this is what all reviews should aspire to 👌🏻
Will consider tuning on top of this and see if the areas you mentioned can be improved upon.

MarinaraSpaghetti

Sep 10, 2024

That would be wonderful, @SicariusSicariiStuff ! Thank you so much for your kind words, and it would be awesome to see Hermes improved even further in terms of role-playing capabilities! It holds so much potential, if the gripes with the context recalling, GPTisms, and repetition could be fixed, it would be perfect.

MarinaraSpaghetti

Sep 16, 2024

UPDATE!
I've been testing the model, thinking it was working on full context. Bullshit. It was limited to 8k context this whole time, which I discovered only after disabling the option to trim the middle of the prompt.

Honestly, I feel a little cheated. Wish I could recommend it wholeheartedly, but I can't do it now, knowing my review was based on false notions — especially since I praised it for working on higher contexts. It's not the model's creators' fault at any means, the fault lies solely on the providers. From what the OpenRouter team gathered, the supported token numbers are now displayed correctly, though. If you don't mind the shorter context, absolutely go for it. It's great, regardless.

Sergal222

Sep 19, 2024

•

edited Sep 19, 2024

@MarinaraSpaghetti Where can I find trim the middle of the prompt option? I don't see it in SillyTavern. And in OpenRouter at the moment the context 130000 is indicated, is this fake?

MarinaraSpaghetti

Sep 19, 2024

•

edited Sep 19, 2024

@MarinaraSpaghetti Where can I find trim the middle of the prompt option? I don't see it in SillyTavern. And in OpenRouter at the moment the context 130000 is indicated, is this fake?

It's in the SillyTavern's scripts, I pruned it from there. And yes, the indicated context is just what the model is supposed to support in theory. In reality, the actual context length is what the providers are offering.

https://github.com/SillyTavern/SillyTavern/blob/staging/src/endpoints/backends/chat-completions.js#L848

You can read more about my findings here:

https://www.reddit.com/r/SillyTavernAI/comments/1fi3baf/til_max_output_on_openrouter_is_actually_the/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

Sergal222

Sep 19, 2024

Yeah i checked there is a limit to the context from 8k to 18k, apparently depending on the load on the model. I generally thought that because of this, my characters simply stopped communicating over time (writing dialogues in quotes), and simply wrote a bunch of text in asterisks with a description of details, etc. The chat itself starts normally, the longer it is, the less dialogues in it. But it seems to me that this should not happen with 18k context, it's so strange why this happens... It looks more like the model just doesn't know what to do next or what to talk about if you are passive in roleplay. But I saw this problem exist more than a year ago in LLMs, hasn't it been fixed in such a long time? Even such heavy models as 405B still have same problem as a year ago

Sergal222

Sep 19, 2024

I miss the days when Claude was easily free accessible and I could roleplay hundreds of messages and the AI would always continue the dialogue and story, coming up with something interesting. Why these models can't do that, I don't understand.

Sergal222

Sep 21, 2024

That absolute scam from Open Router tho wtf

Sergal222

Sep 21, 2024

imagine for example someone paying for 65k token request each time without even realizing that the model only sees 8k-18k context

asdfsdfssddf

Sep 24, 2024

•

edited Sep 24, 2024

I was using this model through openrouter for a chat that's very dear to me and just found out about these context shenanigans and am pretty bummed. The chat is atm around 33k tokens long, but I tried inquiring about some events from the past in OOC and the model hallucinates nonsense. The openrouter page claims it has 128k context and I have context set to 120k in the ST preset and don't get the error message. So what the hell am I actually paying for? How can I see the actual context the model is giving me? You can't tell me 405b model can't recall what happened 15k tokens ago, when my local 12-30b quantized models can.

edit: nvm, I read Marinara's thread on reddit and needless to say openrouter ain't getting a cent more from me. Such disappointment.

Liddell007

Sep 28, 2024

Yesterday it was down via openrouter and today i found my boy dumb. Just the day before yesterday Hermes was able to pick up my lorebook. And today he just ignores half of it, along with common char-card instructions. Idk if its just me or i should write to openrouter, but feel my frustration!

Liddell007

Sep 30, 2024

•

edited Sep 30, 2024

Update.
On openrouter hermes provider was switched from Together to Lambda, wich seem the only option now. And hermes is massacred obliviously dumb now there.

joemann

Nov 18, 2024

I am new to using llm models. Is it possible for such a model to have a discussion between two ai characters? I get the impression that it is only possible to have a chat between a user and an assistant, not between a user and two (or more) assistants. Is that correct?