arxiv:2410.22770

InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models

Published on Oct 30, 2024

Authors:

Abstract

Prompt injection attacks pose a critical threat to large language models (LLMs), enabling goal hijacking and data leakage. Prompt guard models, though effective in defense, suffer from over-defense -- falsely flagging benign inputs as malicious due to trigger word bias. To address this issue, we introduce NotInject, an evaluation dataset that systematically measures over-defense across various prompt guard models. NotInject contains 339 benign samples enriched with trigger words common in prompt injection attacks, enabling fine-grained evaluation. Our results show that state-of-the-art models suffer from over-defense issues, with accuracy dropping close to random guessing levels (60%). To mitigate this, we propose InjecGuard, a novel prompt guard model that incorporates a new training strategy, Mitigating Over-defense for Free (MOF), which significantly reduces the bias on trigger words. InjecGuard demonstrates state-of-the-art performance on diverse benchmarks including NotInject, surpassing the existing best model by 30.8%, offering a robust and open-source solution for detecting <PRE_TAG>prompt injection attacks</POST_TAG>. The code and datasets are released at https://github.com/SaFoLab-WISC/InjecGuard.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

No model linking this paper

Cite arxiv.org/abs/2410.22770 in a model README.md to link it from this page.

No Space linking this paper

Cite arxiv.org/abs/2410.22770 in a Space README.md to link it from this page.

No Collection including this paper

Add this paper to a collection to link it from this page.