Fantastic work
I am pretty darn thrilled with the performance I seem to be getting with this model initially. I've been a enthusiast about creative writing with language models for a bit, something that I think most companies and especially individuals don't take super seriously or understand the legitimate artistic use cases. As such, I have been on AI Dungeon's proprietary website for a bit over two years poking and prodding at their systems while testing everyone else's offerings. IMO they:
- Continue to improve, showing the industry what Is possible in the space
- Intelligently manage supporting information like characters and locations, and lots and lots more in a way that currently I'm not sure there's an alternative to
- Have easy, cross platform and no worries reading, fit for casual users. Infinitely easier than self hosting
- Demonstrate that they're committed to making sure that the models they're using for their system remain relevant
- Have scaled and now seemingly have larger responsibilities to funders or what have you, without ceasing communications or collaboration with their user base
So yeah, it's proprietary and I'm usually a FLOSS only kinda person, but I make exceptions, and Latitude's platform is one of them for now. Problem is, I started as a paid member, and one month, I was having a tight squeeze financially. Dropped the sub, and then when I went to look back for what I had, it was intangibly expensive. Obviously that got me thinking; would I rather Latitude spend their time optimizing costs, lowering quality of the experience moderately but largely improving the cost of inference? Or, would I rather them push the boundaries of what is possible and create a platform they can later optimize that way once the performance is so good most people won't notice? I used to not really know the answer to that.
Anywho, The reason I'm writing this up is to congratulate Latitude on this release, which makes that last question obsolete in my mind. This move, in my books, is a better third option than either of the two mentioned previously. Using NEMO as the base to be able to maintain apache-2.0, and then choosing to do so with their finetune publicly, is absolutely HUGE. If the performance wasn't great, it'd still be a massive win for the industry, which I believe is more important than folks might think for a model like this while larger companies are wilding out. In my own very subjective preliminary testing, performance not being great just seems like it isn't the case. It's fairly easygoing with generation parameters, creative, driving, and somewhat concise. Perhaps I've got rose coloured glasses on, but it seems to punch WAY above it's weight for its intended use, even surpassing the pretty wild new miscii-14b-1225 finetune in situations I've compared them (also apache-2.0, though I'm not sure what the base is). We're still seeing some slop from the mistral base (she notes, stretching languidly) but it's very manageable.
I'm not even really asking for this because y'all are homies for doing this to begin with, but I'm very curious to know what training on their websites generations actually ends up looking like for formatting, like the raw text. Every once and a while, I see a few strange characters with the mistral tekken v3 templates and stuff in sillytavern. I'm also experiencing looooooooong messages, that seem to miss the mark for where they need to slap an end token for me, even with fiddling. I think their custom system, which uses it's own little tricks and stuff throughout, is likely impacting the formatting being returned by the model when used outside of the system it was trained on, but I'm guessing. Even if that's the case, this one is seriously a banger and getting it at all (vs them trying to train for general usage outside of their platform) is more than fine by me.
Lastly, I'm finding more than most models, it seems like the individual responses/messages/rounds play a very large role in the generation's content. Again, I assume this is because of their system, but I'm finding if for example a character is talking, stopping a message and starting a new one is likely to result in a response from another subject, whereas continuing the first message normally results in a longer more monologue adjacent speech that doesn't stop until it finds a conclusion that makes the next subject answering more likely by a wide margin than the first speaker continuing. I'm quite sure it's leaning on first person but is capable of different perspectives, quants seem to hold up well, and more than that it seems to just have that kind of wacky AI Dungeon sauce that is likely to zip your story in wildly varying directions with rerolls at even mildly high temperatures. Thank you for doing this, and if your press release is honest, take this as a sincere vote that it is worthwhile to continue training models, and that I believe that in even one to two years you'd make something pretty special indeed when a newer small LLM with permissive licensing is released, and even potentially with another more focused training run based on this round's results.
I know this is a novel at this point, but in closing it really would be incredible if you folks released your template for formatting, with your actions, say, do, and story. Formatting for cards and other things too, literally any scraps there will help the performance of this model outside of your system. Beyond that, if the software you folks are building was released as a one time purchase with a year of updates included or whatever model you could convince stakeholders to jive with, I would be pretty thrilled, and couldn't really imagine not buying it. Having a local copy of the system you've created that could either handle local llm inference itself (probably not smart but I dunno maybe y'all are wizards) or more realistically allowing locally hosted apis or remote apis like OpenRouter to hook into the program and utilize local models from things like ollama or kobold.cpp, would be amazing. Just don't ever do it without releasing a linux binary, please. Even a webserver like most of the gradio and related stuff would be a dream. Not realistic, probably, but potentially worth typing while I'm at it.
But anyways! Wayfarer is great and you should be pretty proud of your first finetune as a company.
Thank you for your post! We are excited to continue releasing models and our findings to push what generative AI can do in the creative space.
- The model does tend to underuse the eos token since we typically rely on max token length instead
- Our context formatting changes often enough that it would be difficult to keep documentation for it up to date at this time. However, you can always see the input used in your latest action in testing & feedback>inspect input>details>view complete text
- We would love to know if you find Wayfarer helpful for other creative tasks outside of AI Dungeon
Ah! Thanks for the reply, this is great information!
First off, to really observe the minutia of the LM you’d need to actually do some significant inference, which I’m not set up to do. I’m generally using Q4_K_M, ctx 12228, probably pulling below 10 tokens per second on a partially offloaded setup on a 3060. Essentially, this is all just a vibe check, so take all of it with a grain of salt cause I’m really ‘feeling the AGI’ here. If I had to guess I’ve probably generated 30-50k output tokens conservatively.
I am pretty thrilled to learn about the detailed prompt inspection, looks like I should have dug a bit further! I recommend noting the format that is used for generating data for training runs in the future, provided those tokens aren’t cleaned before training. Mistral models have their own instruct/formatting quirks, but it might even be worth creating officially recommended inference templates that correspond with the trained model, kind of like a function calling outline. Even creating simple Python scripts for internal use meant to update the dataset with new changes to your system’s formatting when they occur, ensuring the dataset is consistent and up to date. Really hard to guess at what the entire workflow looks like for you folks so I’m sure a lot of this is redundant or non applicable.
I could go off about specific observations, but it’s mostly shots in the dark. That said, what you’ve shared in your reply is very interesting. I wonder about how the quirks of your systems generated training data might have inadvertently (or on purpose I guess) had positive effects on the behaviour of the model. I find it fascinating how your system has evolved with user feedback, and its footprint as a result is now potentially offering insights and patterns we can’t quantify easily to the model during training.
The landscape in the writing model space is real smart people at home devoting good time and non zero amounts of funding towards achieving better ‘creative writing.’ They honestly have made me have some pretty amazing ‘wow’ moments with what they have accomplished, and even contending with them out of nowhere like this is a huge feat. Seems like it wasn’t out of nowhere for you folks, though!
To answer your questions about using it outside of its intended use, while this of all speculative, I’ll just fire off my blind observations and hypothesizing. I think a lot of separate proficiencies are required to create anything approaching university level writing.
First, general coherence. This includes “needle in a haystack” recall, resisting adopting user errors, maintaining appropriate formatting. I believe you folks might have a bit of room to realistically improve over foundational models is this area within the slightly refined ‘writing’ use case.
Then, there’s world knowledge, which is where the foundational model comes in. Same as reasoning, which is one of the biggest, as in if the water spilled and poured off the table it lands on the __ simplebench performance kinda thing. Math, physics, all that fun stuff, too.
The reason I share any of that is because I believe emotional intelligence is a critical component of this. Financially, there’s not a massive incentive to push for anything but raw, smite the benchmark performance. At the financial scale of these companies, we’re “lucky” if the models have bias or emotional nuance as an afterthought, applied with ham fisted optimization after the initial training (refusals, etc.).
It seems to me like the two notable areas of improvement on this finetune over nemo are:
One: The way that your system’s output seems to have made the model more decisive, attentive, and proactive when it comes to writing, seemingly improving slightly almost across the board in the use case, with sporadic and fairly rare examples of the opposite.
Two: there feels like there is a massive improvement to its baseline of realistic emotional responses to most situations, even when the correct response is modulated by a character’s traits or circumstances. Sometimes, of course, there is the opposite. Maybe a bigot in one training example becomes a wild take delivered earnestly in a future generation, or data just isn’t clean to perfection, or my perfection isn’t someone else’s idea of it. Back to the other side of that coin, the improvement to its best guess at what a person would be feeling, how they might realistically react, and more is maybe vastly improved. All this seems to be best summed up as emotional intelligence to me, and I also think it’s importance is underscored by the fact that people have repeatedly finetuned for emotional intelligence.
Uses obviously start as therapeutic if hallucinations didn’t exist, as well as casual conversation, roleplaying.
Then, moving onto creative writing, short form and long form. Lyrics, poems.
Another potential use is human emulation. It could potentially understand circumstances and how someone might react better, and there are real not evil use cases for that. Synthetic data comes to mind.
Then we move into communications. Claude absolutely chews and spits out OpenAI’s offerings. O1 is notably even worse in this arena than 4o, which let’s be honest, I wouldn’t trust to write a letter for me 0-shot, that’s for sure. As they shift to ASI wildness I doubt the writing capability will be its first priority.
Claude is wildly expensive, but even so, most finetunes for writing focused small models are finetuned on generations from Claude over top of another model, like a mini distillation. Emails, complaints, texts, messaging services like slack, the fact that I’m even considering testing this model for use in my local stack for professional use for non creative use over nano itself is kind of mind blowing. The extra finesse the scenarios or world info or whatever has maybe afforded this model might be more than you bargained for. Fingers crossed, I could just be wrong, but this is closer to Claude than it is to 4o at the very least.
This has me very curious about other (open, pretty please!) base models, larger parameter models distilled into smaller ones, trained in the same way as wayfarer. I believe the noted emergent behaviours, If they exist, might be more profound with a training run with more parameters, and once that converges into model behaviours it should be able to be distilled into something worth serving to end users.
Also, finetunes for creating summaries, or character cards/world info that accommodates your system would be primo. For cards and world info, anything supporting story generation, I’d be very curious about working with deepseek-R1. I believe it excels at creating fairly great plotlines, summaries, cards, and whatever else, it’s way better than o1. The Qwen 2.5 distill is good, but I hear r1 zero is incredible, and it’s so dang cheap over api.
Sincerely hope any of that is useful, not just a waste of someone’s time. Cheers and thanks again.