Boosting LLM Defense Without Retraining

This research demonstrates that increasing inference-time compute can significantly improve large language models' resistance to adversarial attacks, without requiring specialized adversarial training.

Models (OpenAI o1-preview and o1-mini) show improved robustness when given more processing time
In many scenarios, attack success rates approach zero as compute resources increase
Different types of attacks show varying susceptibility to this defense mechanism
Results suggest a practical security approach: allocate more compute resources when detecting potential attacks

For security teams, this presents a compelling trade-off between computational resources and model security, offering a way to strengthen AI systems against malicious inputs without expensive retraining cycles.

Trading Inference-Time Compute for Adversarial Robustness