Style over Substance: Failure Modes of LLM Judges in Alignment Benchmarking
Abstract
The release of ChatGPT in November 2022 sparked an explosion of interest in post-training and an avalanche of new preference optimization (PO) methods. These methods claim superior alignment by virtue of better correspondence with human pairwise preferences, often measured by LLM judges. In this work, we attempt to answer the following question -- do LLM-judge preferences translate to progress on other, more concrete metrics for alignment, and if not, why not? We define a concrete metric for alignment, and introduce SOS-Bench, the largest standardized, reproducible LLM meta-benchmark to date. We find that (1) LLM-judgments do not correlate with concrete measures of safety, world knowledge, and instruction following; (2) LLM judges have powerful implicit biases, prioritizing style over factuality and safety; and (3) the supervised fine-tuning (SFT) stage of post-training, and not the PO stage, has the greatest impact on alignment, with data scaling and prompt diversity as the driving factors. Our codebase and complete results can be found at https://github.com/penfever/sos-bench.
Community
With new LLMs like OpenAI o1 and QWEN 2.5 releasing almost every week, robust benchmarks we can run locally are key. LLM-judges like Alpaca-Eval, MT-Bench and Arena-Hard-Auto are used most often. Unfortunately, they have hidden biases. In principle, LLM-judges are supposed to be impartial. In practice, they weight some judgment criteria much higher than others. In particular, they pay more attention to stylistic cues (like a friendly tone) than they do to correctness and safety! This behavior is called stylistic reward hacking.
To counteract it, SOS-Bench, a new meta-benchmark, has been introduced. It's two orders of magnitude larger than LLM-judge benchmarks, and it has ground truth measures of helpfulness, harmlessness and honesty. Evaluating over 30 fine-tunes of LLAMA-3-8B and Mistral-7B on SOS-Bench reveals that more is more in alignment; data scaling in the SFT stage, rather than any particular collection method, is the best predictor of improved alignment.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- WildFeedback: Aligning LLMs With In-situ User Interactions And Feedback (2024)
- Self-Judge: Selective Instruction Following with Alignment Self-Evaluation (2024)
- Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment (2024)
- Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge (2024)
- Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper