GovTech - AI Practice

government

https://medium.com/dsaid-govtech

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

pyesonekyaw updated a Space about 2 months ago

govtech/Biome

gabrielchua updated a Space about 2 months ago

govtech/system-prompt-leakage

gabrielchua updated a Space about 2 months ago

govtech/system-prompt-leakage

View all activity

govtech's activity

pyesonekyaw

updated a Space about 2 months ago

Running

🍃

Biome

Multimodal search & retrieval-based biodiversity recognition

gabrielchua

updated a Space about 2 months ago

Running

🚰🚫

System Prompt Leakage Demo

gabrielchua

updated a Space 2 months ago

Sleeping

🙅

Off Topic Guardrail Demo

gabrielchua

posted an update 2 months ago

Post

1330

Sharing my first paper!

==
Large Language Models (LLMs) are powerful, but they're prone to off-topic misuse, where users push them beyond their intended scope. Think harmful prompts, jailbreaks, and misuse. So how do we build better guardrails?

Traditional guardrails rely on curated examples or classifiers. The problem?
⚠️ High false-positive rates
⚠️ Poor adaptability to new misuse types
⚠️ Require real-world data, which is often unavailable during pre-production

Our method skips the need for real-world misuse examples. Instead, we:
1️⃣ Define the problem space qualitatively
2️⃣ Use an LLM to generate synthetic misuse prompts
3️⃣ Train and test guardrails on this dataset

We apply this to the off-topic prompt detection problem, and fine-tune simple bi- and cross-encoder classifiers that outperform heuristics based on cosine similarity or prompt engineering.

Additionally, framing the problem as prompt relevance allows these fine-tuned classifiers to generalise to other risk categories (e.g., jailbreak, toxic prompts).

Through this work, we also open-source our dataset (2M examples, ~50M+ tokens) and models.

paper: A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection (2411.12946)

artifacts: govtech/off-topic-guardrail-673838a62e4c661f248e81a4