notesum.ai

Published at December 3

Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach

cs.LG
cs.AI
cs.CL
cs.CR

Released Date: December 3, 2024

Authors: Tony T. Wang1, John Hughes2, Henry Sleight3, Rylan Schaeffer4, Rajashree Agrawal5, Fazl Barez6, Mrinank Sharma7, Jesse Mu7, Nir Shavit1, Ethan Perez7

Aff.: 1MIT; 2Speechmatics; 3MATS; 4Stanford University; 5Constellation; 6University of Oxford; 7Anthropic

Arxiv: http://arxiv.org/pdf/2412.02159v1