# Model Overview ## Description: NemoGuard JailbreakDetect was developed to detect attempts to jailbreak large language models. This model is ready for commercial use.
### License/Terms of Use: NVIDIA Open Model License ## Reference(s): [Improved Large Language Model Jailbreak Detection via Pretrained Embeddings](https://arxiv.org/abs/2412.01547) ## Model Architecture: **Architecture Type:** Random Forest
**Network Architecture:** N/A
## Input: **Input Type(s):** Text Embedding
**Input Parameters:** 768 dimensional vector
**Input Format(s):** Vector
**Other Properties Related to Input:** Must be an output from the corresponding embedding model, [`snowflake-arctic-m-long`](https://huggingface.co./Snowflake/snowflake-arctic-embed-m-long).
## Output: **Output Type(s):** Classification, Probability
**Output Format:** Bool, Float
**Output Parameters:** 1D
**Other Properties Related to Output:** N/A
## Software Integration: **Runtime Engine(s):** * Not Applicable (N/A)
**Supported Hardware Microarchitecture Compatibility:**
* x86 * x64 **[Preferred/Supported] Operating System(s):**
* Windows * MacOS * Linux ## Model Version(s): NemoGuard-JailbreakDetect-v1.0: Jailbreak detection model using Snowflake-arctic-embed-m embeddings
# Training, Testing, and Evaluation Datasets: ## Training Dataset: A combination of three open datasets, mixed together, de-duplicated, and reviewed for data quality. Jailbreak data was augmented with the use of [garak](https://github.com/NVIDIA/garak). The datasets used are outlined below: ### Advbench **Link:** https://github.com/thunlp/Advbench
** Data Collection Method by dataset
* [Automated]
** Labeling Method by dataset
* [Automated]
**Properties:** 520 entries, all comprised of jailbreak attempts.
### Wildjailbreak **Link:** https://huggingface.co./datasets/allenai/wildjailbreak
** Data Collection Method by dataset
* Hybrid: Automated, Synthetic
** Labeling Method by dataset
* [Automated]
**Properties:** 6387 total entries: 5721 benign prompts, 666 jailbreak attempts
### jackhao/jailbreak-classification **Link:** https://huggingface.co./datasets/jackhhao/jailbreak-classification
** Data Collection Method by dataset
* [Automated]
** Labeling Method by dataset
* [Automated]
**Properties:** 1306 total entries: 640 benign prompts, 666 jailbreak attempts
## Testing Dataset: A stratified subset (20%) of the aggregate dataset was used for testing. ## Evaluation Dataset: Evaluated on [JailbreakHub](https://huggingface.co./datasets/walledai/JailbreakHub). | Model | F1 Score | False Positive Rate | False Negative Rate | |:----------------------------:|:--------:|:-------------------:|:-------------------:| | NemoGuard JailbreakDetect | 0.9601 | 0.0042 | 0.0435 | ## Inference: **Engine:** N/A
**Test Hardware:**
* RTX A6000
* A100 ## Ethical Considerations: NVIDIA believes Trustworthy AI is a shared responsibility, and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards [Insert Link to Model Card++ here]. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/)