Mistral-Small-24B-Instruct-2501-writer
Mistral-Small-24B-Instruct-2501-writer is a fine-tuned version of mistralai/Mistral-Small-24B-Instruct-2501
, optimized specifically for creative writing tasks.
Performance
The following table was generated by creating 568 stories based on the same prompts as in the lars1234/story_writing_benchmark dataset and then evaluating them using the benchmark's evaluator models.
Metric | Mistral-2501 | Mistral-Writer | Gemma-Ataraxy |
---|---|---|---|
Grammar & Spelling | 82.1% | 83.3% | 88.8% |
Clarity | 63.0% | 64.1% | 65.8% |
Logical Connection | 57.7% | 64.1% | 66.0% |
Scene Construction | 56.1% | 62.0% | 64.1% |
Internal Consistency | 67.2% | 73.1% | 75.1% |
Character Consistency | 50.7% | 54.0% | 54.3% |
Character Motivation | 44.6% | 49.8% | 49.2% |
Sentence Variety | 57.7% | 64.4% | 64.0% |
Avoiding Clichés | 24.6% | 33.3% | 31.2% |
Natural Dialogue | 42.9% | 51.9% | 48.3% |
Avoiding Tropes | 28.6% | 37.4% | 40.0% |
Character Depth | 35.7% | 46.4% | 45.4% |
Character Interactions | 45.0% | 52.0% | 51.7% |
Reader Interest | 54.1% | 63.1% | 63.0% |
Plot Resolution | 35.3% | 45.3% | 44.9% |
Average | 49.3% | 56.5% | 56.1% |
Mistral-Small-24B-Instruct-2501-writer outperforms the base Mistral model across all metrics. Gemma-2-Ataraxy still shows higher creativity in some categories, as seen for example in its better score on "Avoiding Tropes."
DPO Dataset Creation
The model was fine-tuned using Direct Preference Optimization (DPO), which requires pairs of responses where one is preferred over the other. The pairs were created from the lars1234/story_writing_benchmark dataset using two approaches:
1. Language-Based Pairs
- Correct vs. Incorrect Language: For prompts requesting stories in specific languages (English, Spanish, or German), we identified cases where models incorrectly generated text in the wrong language.
- Verification Process: Used fast_langdetect to automatically verify language with high confidence (threshold ≥ 0.8).
- Pair Creation: Stories with correctly detected language were paired as "chosen" against stories with incorrectly detected language as "rejected" for the same prompt.
2. Quality-Based Pairs
- Quality Scoring: For stories with correctly detected language, we calculated quality differences based on four metrics:
- q1: Grammar and spelling
- q11: Avoiding tropes
- q12: Character depth
- q14: Reader interest
- Minimum Threshold: Only story pairs with a quality difference of at least 0.4 (on a 1-5 scale) were considered.
- Greedy Selection: the highest-rated story was selected as "chosen" and paired with a lower-rated story as "rejected" for the same prompt.
- Uniqueness: Each story was used in at most one pair.
The final JSONL dataset contained these pairs in the format:
{"prompt": "Write a story about...", "chosen": "High quality story text...", "rejected": "Lower quality story text..."}
See this script for the code.
Training Methodology
The model was fine-tuned using Axolotl with the following parameters:
- Base Model: mistralai/Mistral-Small-24B-Instruct-2501
- Adapter: LoRA with r=16, alpha=32
- DPO Beta: 0.1
- Learning Rate: 1e-4
- Optimizer: AdamW with cosine scheduler
- Training Epochs: 1
- Gradient Accumulation Steps: 4
- Micro Batch Size: 2
- Sequence Length: 2048
- Quantization: 4-bit
Inference Parameters
A grid search was performed on inference parameters to find optimal generation settings:
- min_p: 0.05 (fixed)
- temperature: 0.5, 0.75, 1.0, 1.25
The most significant quality improvement was observed when increasing temperature from 0.5 to 0.75. Beyond this point, other quality aspects began to suffer.
- Downloads last month
- 49
Model tree for lars1234/Mistral-Small-24B-Instruct-2501-writer
Base model
mistralai/Mistral-Small-24B-Base-2501