Training LLMs to Resist Manipulation

This research introduces a novel approach to balance defensive and receptive persuasion handling in large language models.

Demonstrates that optimizing only for resistance or only for acceptance creates vulnerable systems
Introduces PEACE (Persuasion Evaluation And Counterfactual Explanation) benchmark to measure persuasion resistance
Develops PASC (Persuasion-Aware Supervised fine-tuning with Contrastives) to train balanced models
Shows balanced models maintain accuracy while improving defense against harmful persuasion

For security teams, this work provides crucial insights into detecting and preventing social engineering attacks against AI systems while preserving model performance and helpfulness.

Teaching Models to Balance Resisting and Accepting Persuasion