Hidden Dangers in LLM Alignment

Hidden Dangers in LLM Alignment

Advanced Backdoor Attacks That Evade Detection

This research reveals how adversaries can create stealthy backdoors during LLM alignment that are resistant to detection and removal.

  • Introduces prompt-specific paraphrases as backdoor triggers for LLMs
  • Demonstrates how these backdoors persistently withstand fine-tuning
  • Shows backdoors that adapt dynamically to different prompts
  • Reveals how they can evade standard detection methods

Security implications are significant: as companies increasingly rely on third-party alignment data, these advanced backdoor techniques could enable targeted manipulation of deployed LLMs without being discovered during security audits.

AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment

25 | 104