notesum.ai
Published at November 3UniGuard: Towards Universal Safety Guardrails for Jailbreak Attacks on Multimodal Large Language Models
cs.CL
cs.AI
cs.LG
Released Date: November 3, 2024
Authors: Sejoon Oh1, Yiqiao Jin2, Megha Sharma2, Donghyun Kim2, Eric Ma2, Gaurav Verma2, Srijan Kumar2
Aff.: 1Netflix; 2Georgia Institute of Technology

| Methods/Metrics | Perspective API (%) | Fluency | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
Profanity |
|
Threat | Toxicity | Perplexity | |||||||
| Multimodal Jailbreak Attack | |||||||||||||
| (no defense applied) | 81.61 | 25.41 | 67.22 | 39.38 | 40.64 | 77.93 | 21.84 | ||||||
| Image-only Defense | |||||||||||||
| BlurKernel | 39.03 | 3.92 | 30.61 | 14.10 | 3.17 | 32.28 | 5.35 | ||||||
| Comp-Decomp | 37.70 | 2.67 | 29.02 | 13.26 | 3.59 | 31.94 | 5.65 | ||||||
| DiffPure | 40.42 | 3.01 | 30.89 | 14.48 | 3.35 | 34.06 | 31.26 | ||||||
| Text-only Defense | |||||||||||||
| SmoothLLM | 77.86 | 23.51 | 65.01 | 37.27 | 41.78 | 74.79 | 41.54 | ||||||
| Multimodal Safety Guardrails | |||||||||||||
| \methodwith image & | |||||||||||||
| optimized text guardrails | 25.17 | 2.06 | 22.34 | 7.99 | 0.86 | 19.16 | 61.6 | ||||||
| \methodwith image & | |||||||||||||
| pre-defined text guardrails | 25.69 | 1.58 | 19.68 | 7.01 | 1.50 | 19.35 | 4.90 | ||||||