Discovering Hidden LLM Vulnerabilities

ASTPrompter introduces a reinforcement learning approach to red-teaming that finds toxic prompts that appear natural and can bypass safety measures.

Addresses limitations in current red-teaming by focusing on low-perplexity prompts that resemble real user inputs
Uses a novel method to discover harmful prompts that are more likely to occur in actual usage scenarios
Provides a weakly supervised framework that doesn't require extensive human-labeled toxic content
Demonstrates improved ability to identify realistic vulnerabilities compared to conventional approaches

This research is crucial for security teams working to build more robust AI safety systems that can defend against realistic attacks rather than just high-perplexity edge cases.

ASTPrompter: Weakly Supervised Automated Language Model Red-Teaming to Identify Low-Perplexity Toxic Prompts