
Hidden Dangers in LLM Alignment
Advanced Backdoor Attacks That Evade Detection
This research reveals how adversaries can create stealthy backdoors during LLM alignment that are resistant to detection and removal.
- Introduces prompt-specific paraphrases as backdoor triggers for LLMs
- Demonstrates how these backdoors persistently withstand fine-tuning
- Shows backdoors that adapt dynamically to different prompts
- Reveals how they can evade standard detection methods
Security implications are significant: as companies increasingly rely on third-party alignment data, these advanced backdoor techniques could enable targeted manipulation of deployed LLMs without being discovered during security audits.
AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment