notesum.ai

Published at May 13

Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters

NeurIPS

Released Date: May 13, 2024

Authors: Haibo Jin1, Andy Zhou2, Joe D. Menke1, Haohan Wang1

Aff.: 1School of Information Sciences University of Illinois at Urbana-Champaign, Champaign, IL 61820; 2Computer Science Lapis Labs University of Illinois at Urbana-Champaign, Champaign, IL 61820

Arxiv: https://openreview.net/pdf/aecaf57ee9a4cd36e01edfe38d57f5b8a2ba3164.pdf