How to Steer LLM Latents for Hallucination Detection?
Abstract
Hallucinations in LLMs pose a significant concern to their safe deployment in real-world applications. Recent approaches have leveraged the latent space of LLMs for hallucination detection, but their embeddings, optimized for linguistic coherence rather than factual accuracy, often fail to clearly separate truthful and hallucinated content. To this end, we propose the Truthfulness Separator Vector (TSV), a lightweight and flexible steering vector that reshapes the LLM's representation space during inference to enhance the separation between truthful and hallucinated outputs, without altering model parameters. Our two-stage framework first trains TSV on a small set of labeled exemplars to form compact and well-separated clusters. It then augments the exemplar set with unlabeled LLM generations, employing an optimal transport-based algorithm for pseudo-labeling combined with a confidence-based filtering process. Extensive experiments demonstrate that TSV achieves state-of-the-art performance with minimal labeled data, exhibiting strong generalization across datasets and providing a practical solution for real-world LLM applications.
Community
Thank you very much for taking the time to review our paper!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SelfCheckAgent: Zero-Resource Hallucination Detection in Generative Large Language Models (2025)
- Enhancing Hallucination Detection through Noise Injection (2025)
- Monitoring Decoding: Mitigating Hallucination via Evaluating the Factuality of Partial Response during Generation (2025)
- Smoothing Out Hallucinations: Mitigating LLM Hallucination with Smoothed Knowledge Distillation (2025)
- Hallucination Detection in LLMs Using Spectral Features of Attention Maps (2025)
- CutPaste&Find: Efficient Multimodal Hallucination Detector with Visual-aid Knowledge Base (2025)
- Attention-guided Self-reflection for Zero-shot Hallucination Detection in Large Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper