
Training LLMs to Resist Manipulation
Teaching models when to accept or reject persuasion attempts
This research introduces a novel approach to balance defensive and receptive persuasion handling in large language models.
- Demonstrates that optimizing only for resistance or only for acceptance creates vulnerable systems
- Introduces PEACE (Persuasion Evaluation And Counterfactual Explanation) benchmark to measure persuasion resistance
- Develops PASC (Persuasion-Aware Supervised fine-tuning with Contrastives) to train balanced models
- Shows balanced models maintain accuracy while improving defense against harmful persuasion
For security teams, this work provides crucial insights into detecting and preventing social engineering attacks against AI systems while preserving model performance and helpfulness.
Teaching Models to Balance Resisting and Accepting Persuasion