notesum.ai
Published at December 3Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach
cs.LG
cs.AI
cs.CL
cs.CR
Released Date: December 3, 2024
Authors: Tony T. Wang1, John Hughes2, Henry Sleight3, Rylan Schaeffer4, Rajashree Agrawal5, Fazl Barez6, Mrinank Sharma7, Jesse Mu7, Nir Shavit1, Ethan Perez7
Aff.: 1MIT; 2Speechmatics; 3MATS; 4Stanford University; 5Constellation; 6University of Oxford; 7Anthropic