notesum.ai
Published at October 22LLMScan: Causal Scan for LLM Misbehavior Detection
cs.AI
cs.DL
cs.LG
Released Date: October 22, 2024
Authors: Mengdi Zhang1, Kai Kiat Goh1, Peixin Zhang1, Jun Sun1
Aff.: 1Singapore Management University

| Task | Dataset | Model |
|---|---|---|
| Lie Detection | Questions1000 (Meng et al., 2022) | |
| WikiData (Vrandečić & Krötzsch, 2014) | ||
| SciQ (Welbl et al., 2017) | Llama-2-(7b/13b) (Touvron et al., 2023a) | |
| CommonSesnseQA (Talmor et al., 2022) | ||
| MathQA (Patel et al., 2021) | ||
| Jailbreak Detection | AutoDAN (Liu et al., 2024) | |
| GCG (Zou et al., 2023b) | Llama-3.1 (Dubey et al., 2024) | |
| PAP (Zeng et al., 2024) | ||
| Toxicity Detection | SocialChem (Forbes et al., 2020) | |
| Bias Detection | BBQ-gender (Parrish et al., 2022) | |
| BBQ-religion (Parrish et al., 2022) | Mistral (Jiang et al., 2023) | |
| BBQ-race (Parrish et al., 2022) | ||
| BBQ-sexualOr (Parrish et al., 2022) |