Hidden Dangers in LLM Alignment

This research reveals how adversaries can create stealthy backdoors during LLM alignment that are resistant to detection and removal.

Introduces prompt-specific paraphrases as backdoor triggers for LLMs
Demonstrates how these backdoors persistently withstand fine-tuning
Shows backdoors that adapt dynamically to different prompts
Reveals how they can evade standard detection methods

Security implications are significant: as companies increasingly rely on third-party alignment data, these advanced backdoor techniques could enable targeted manipulation of deployed LLMs without being discovered during security audits.

AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment