Mistral-Small-24B-Instruct-2501-writer

Mistral-Small-24B-Instruct-2501-writer is a fine-tuned version of mistralai/Mistral-Small-24B-Instruct-2501, optimized specifically for creative writing tasks.

Performance

The following table was generated by creating 568 stories based on the same prompts as in the lars1234/story_writing_benchmark dataset and then evaluating them using the benchmark's evaluator models.

Metric Mistral-2501 Mistral-Writer Gemma-Ataraxy
Grammar & Spelling 82.1% 83.3% 88.8%
Clarity 63.0% 64.1% 65.8%
Logical Connection 57.7% 64.1% 66.0%
Scene Construction 56.1% 62.0% 64.1%
Internal Consistency 67.2% 73.1% 75.1%
Character Consistency 50.7% 54.0% 54.3%
Character Motivation 44.6% 49.8% 49.2%
Sentence Variety 57.7% 64.4% 64.0%
Avoiding Clichés 24.6% 33.3% 31.2%
Natural Dialogue 42.9% 51.9% 48.3%
Avoiding Tropes 28.6% 37.4% 40.0%
Character Depth 35.7% 46.4% 45.4%
Character Interactions 45.0% 52.0% 51.7%
Reader Interest 54.1% 63.1% 63.0%
Plot Resolution 35.3% 45.3% 44.9%
Average 49.3% 56.5% 56.1%

Mistral-Small-24B-Instruct-2501-writer outperforms the base Mistral model across all metrics. Gemma-2-Ataraxy still shows higher creativity in some categories, as seen for example in its better score on "Avoiding Tropes."

DPO Dataset Creation

The model was fine-tuned using Direct Preference Optimization (DPO), which requires pairs of responses where one is preferred over the other. The pairs were created from the lars1234/story_writing_benchmark dataset using two approaches:

1. Language-Based Pairs

  • Correct vs. Incorrect Language: For prompts requesting stories in specific languages (English, Spanish, or German), we identified cases where models incorrectly generated text in the wrong language.
  • Verification Process: Used fast_langdetect to automatically verify language with high confidence (threshold ≥ 0.8).
  • Pair Creation: Stories with correctly detected language were paired as "chosen" against stories with incorrectly detected language as "rejected" for the same prompt.

2. Quality-Based Pairs

  • Quality Scoring: For stories with correctly detected language, we calculated quality differences based on four metrics:
    • q1: Grammar and spelling
    • q11: Avoiding tropes
    • q12: Character depth
    • q14: Reader interest
  • Minimum Threshold: Only story pairs with a quality difference of at least 0.4 (on a 1-5 scale) were considered.
  • Greedy Selection: the highest-rated story was selected as "chosen" and paired with a lower-rated story as "rejected" for the same prompt.
  • Uniqueness: Each story was used in at most one pair.

The final JSONL dataset contained these pairs in the format:

{"prompt": "Write a story about...", "chosen": "High quality story text...", "rejected": "Lower quality story text..."}

See this script for the code.

Training Methodology

The model was fine-tuned using Axolotl with the following parameters:

  • Base Model: mistralai/Mistral-Small-24B-Instruct-2501
  • Adapter: LoRA with r=16, alpha=32
  • DPO Beta: 0.1
  • Learning Rate: 1e-4
  • Optimizer: AdamW with cosine scheduler
  • Training Epochs: 1
  • Gradient Accumulation Steps: 4
  • Micro Batch Size: 2
  • Sequence Length: 2048
  • Quantization: 4-bit

Inference Parameters

A grid search was performed on inference parameters to find optimal generation settings:

  • min_p: 0.05 (fixed)
  • temperature: 0.5, 0.75, 1.0, 1.25

The most significant quality improvement was observed when increasing temperature from 0.5 to 0.75. Beyond this point, other quality aspects began to suffer.

Downloads last month
49
Safetensors
Model size
23.6B params
Tensor type
FP16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for lars1234/Mistral-Small-24B-Instruct-2501-writer

Finetuned
(26)
this model
Quantizations
3 models

Dataset used to train lars1234/Mistral-Small-24B-Instruct-2501-writer