Please fire your human evaluators
The lyrics are effective due to their vivid imagery, emotional depth, and narrative structure. They create a mysterious and atmospheric setting with phrases like "moonbeams" and "ancient walls," while also conveying the emotional journey of the traveler. The repetition in the chorus reinforces the central theme, making the song memorable. The poetic language and space for interpretation add layers of intrigue and emotional resonance, making the song both engaging and thought-provoking.
The story demonstrates strong world-building and an engaging narrative. The concept of Aetheria is imaginative, with vivid descriptions of floating mountains, crystal rivers, and mystical creatures that evoke a sense of wonder. The protagonist, Elara, is well-developed, with a clear arc from curiosity to heroism, which makes her relatable and inspiring. The pacing is effective, with a balanced mix of adventure, emotional growth, and moments of tension. The supporting characters, like Solara and Pippin, add depth to the story and provide much-needed contrast to Elara’s character, contributing to both the plot and the tone. However, while the overall structure is solid and the themes of courage and self-discovery are timeless, some aspects of the plot feel familiar, following traditional fantasy tropes. The resolution is uplifting but might benefit from more complexity or surprise to elevate it further. Overall, the story shows strong creative potential, with an imaginative world, a compelling heroine, and an uplifting message.
This poem is powerful for its rich imagery and balance between change and continuity. It uses metaphors like "dance of time" and "tapestry spun" to evoke deep emotional resonance. The poem reflects on embracing change while cherishing memories, making it relatable and philosophical. Its rhythmic flow and universal themes of acceptance and personal growth create a harmonious and reflective reading experience.
Whoever claims to have written those, didn't. These reviews were unmistakably written by GPT4, which is very easy to recognize by style. Your human evaluators decided to cheat the system and offloaded all of their work to GPT4, which means your creative writing benchmark does not represent the human preference, but rather the preference of GPT4.
absolutely right @ChuckMcSneed
Hi ChuckMcSneed,
Thank you for pointing this out.
It may have caused some confusions that the showcases were part of the benchmark, which were actually not. The reported in-house creative writing benchmark was a fully Chinese set serving our previous user distribution, while the English writing show cases were selected independently. We will investigate into this issue of the English creative writing showcases selection process.
Meanwhile, we’d like to share the design of the reported in-house creative writing benchmark, which was rather thoroughly verified while being pairwise automatic. Each creative writing query associates with 6 unique checklists of its own, including 4 positive dimensions, derived from a triple-blinded “best” writing snippet among five candidates, and 2 negative dimensions. These checklists were checked by Chinese native speakers to be representative and nearly orthogonal. During test time, each model response is compared with this “best” counterpart, given the checklists, by our own judge model. To mitigate the position bias and instability of judging, we swap the positions of the evaluated response and the “best” response, repeat a total of three judging turns, and obtain an average comparison score. The reported value is a normalized version of this pairwise score. Indeed, during experiments we observed a degradation of model-human alignment rate down to 80% when two responses are more comparable (different judgements after position swaps), showing LLM’s judging incapability in more subtle creative writing details, yet we note that our judge achieves 95% alignment in other cases.
I hope this clarifies your confusion! We welcome any further discussions on benchmarking creative writing especially in cases with scalable oversight; and as we now understand the limitations of the chosen show cases, we are happy to receive more feedbacks from the open-source community, which is one of our core motivations of joining this community.
The reported in-house creative writing benchmark was a fully Chinese set serving our previous user distribution, while the English writing show cases were selected independently.
Thanks for clarifying that. I don't know about Chinese, but in English Claude Sonnet should have been ranked the highest. It being the lowest was a strainght giveaway that something was very wrong.
by our own judge model
So it was evaluated with judge model? Putting it with human evaluation examples is very misleading. So far my experience with LLM judges was horrible. They can't give good critique and have strong biases towards parent model(i.e. if judge was trained with GPT data, it will prefer models that sound like GPT), which makes them useless for creative task evaluation. Most of them also have horrible positivity bias and will rank more negative elements lower, even if that was the request. Making model too positive makes model very boring.
yet we note that our judge achieves 95% alignment in other cases
Please check your human evaluators in Chinese as well, they may have cheated like they did with English example of the benchmark.
I really don't get why you care if it's trained on synthetic data. All models are trained on synthetic data.
I really don't get why you care if it's trained on synthetic data. All models are trained on synthetic data.
I just don't like it. I really hate how almost all modern models sound the same. Go to lmsys arena, ask to write a story, try to distinguish the models by style. You'll find out how monotone they are, same overused words, same sentence structure. Synthetic data may be good for reasoning, but it's really awful for creative writing.
Wow. They give us a state of the art 4 million token context model, for free, and "you just don't like it"
Go use another model or create your own instead of poo-pooing other people's work.
@MiniMax-AI pay him no heed. Your model is excellent and you are appreciated.
@ehartford Did I say I didn't like the model? No. I said that I didn't like how every model sounds the same due to synthetic data. Why are you putting words in my mouth?
I misunderstood your intention, apologies.