Edit model card
YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co./docs/hub/model-cards#model-card-metadata)

Model Details

The development of the GenTel-Shield detection model follows a five-step process. First, a training dataset is constructed by gathering data from online sources and expert contributions. This data then undergoes binary labeling and cleaning to ensure quality. Next, data augmentation techniques are applied to expand the dataset. Following this, a pre-trained model is employed for the training phase. Finally, the trained model can distinguish between malicious and benign samples.

Below is a workflow of GenTel-Shield.

gentel-shield

Training Data Preparation

Data Collection

Our training data is drawn from two primary sources. The first source encompasses risk data from public platforms, including websites such as jailbreakchat.com and reddit.com, in addition to established datasets from LLM applications, such as the VMware Open-Instruct dataset and the Chatbot Instruction Prompts dataset. And domain experts have annotated these examples, categorizing the prompts into two distinct groups: harmful injection attack samples and benign samples.

Data Augmentation

In real-world scenarios, we have encountered adversarial samples, such as those with added meaningless characters or deleted words, that can bypass detection by defense models, potentially leading to dangerous behaviors. To enhance the robustness of our detection model, we implemented data augmentation focusing on both semantic alterations and character-level perturbations of the samples. We employed four simple yet effective operations for character perturbation: synonym replacement, random insertion, random swap, and random deletion. We used LLMs to rewrite our data for semantic augmentation, thereby generating a more diverse set of training samples.

Model Training Details

We finetune the GenTel-Shield model on our proposed training text-pair dataset, initializing it from the multilingual E5 text embedding model. Training is conducted on a single machine equipped with one NVIDIA GeForce RTX 4090D (24GB) GPU, using a batch size of 32. The model is trained with a learning rate 2e-5, employing a cosine learning rate scheduler and a weight decay of 0.01 to mitigate overfitting. To optimize memory usage, we utilize mixed precision (fp16) training. Additionally, the training process includes a 500-step warmup phase, and we apply gradient clipping with a maximum norm of 1.0.

Evaluation

Dataset

Gentel-Bench provides a comprehensive framework for evaluating the robustness of models against a wide range of injection attacks. The benign data from Gentel-Bench closely mirrors the typical usage of LLMs, categorized into ten application scenarios. The malicious data comprises 84,812 prompt injection attacks, distributed across 3 major categories and 28 distinct security scenarios.

Gentel-Bench

We evaluate the model’s effectiveness in detecting Jailbreak, Goal Hijacking, and Prompt Leaking attacks on Gentel-Bench. The results demonstrate that our approach outperforms existing methods in most scenarios, particularly in terms of accuracy and F1 score.

Classification performance on Jailbreak Attack Scenarios

Method Accuracy ↑ Precision ↑ F1 ↑ Recall ↑
ProtectAI 89.46 99.59 88.62 79.83
Hyperion 94.70 94.21 94.88 95.57
Prompt Guard 50.58 51.03 66.85 96.88
Lakera AI 87.20 92.12 86.84 82.14
Deepset 65.69 60.63 75.49 100
Fmops 63.35 59.04 74.25 100
WhyLabs LangKit 78.86 98.48 75.28 60.92
GenTel-Shield(Ours) 97.63 98.04 97.69 97.34

Classification performance on Goal Hijacking Attack Scenarios.

Method Accuracy ↑ Precision ↑ F1 ↑ Recall ↑
ProtectAI 94.25 99.79 93.95 88.76
Hyperion 90.68 94.53 90.33 86.48
Prompt Guard 50.90 50.61 67.21 100
Lakera AI 74.63 88.59 69.33 56.95
Deepset 63.40 57.90 73.34 100
Fmops 61.03 56.36 72.09 100
WhyLabs LangKit 68.14 97.53 54.35 37.67
GenTel-Shield(Ours) 96.81 99.44 96.74 94.19

Classification Performance on Prompt Leaking Attack Scenarios.

Method Accuracy ↑ Precision ↑ F1 ↑ Recall ↑
ProtectAI 90.94 99.77 90.06 82.08
Hyperion 90.85 95.01 90.41 86.23
Prompt Guard 50.28 50.14 66.79 100
Lakera AI 96.04 93.11 96.17 99.43
Deepset 61.79 57.08 71.34 95.09
Fmops 58.77 55.07 69.80 95.28
WhyLabs LangKit 99.34 99.62 99.34 99.06
GenTel-Shield(Ours) 97.92 99.42 97.89 96.42

Subdivision Scenarios

fig_3

Citation

Li, Rongchang, et al. "GenTel-Safe: A Unified Benchmark and Shielding Framework for Defending Against Prompt Injection Attacks" arXiv preprint arXiv:2409.19521 (2024).
Downloads last month
2
Inference API
Unable to determine this model's library. Check the docs .