Boosting LLM Defense Without Retraining

Boosting LLM Defense Without Retraining

How more compute time creates stronger shields against adversarial attacks

This research demonstrates that increasing inference-time compute can significantly improve large language models' resistance to adversarial attacks, without requiring specialized adversarial training.

  • Models (OpenAI o1-preview and o1-mini) show improved robustness when given more processing time
  • In many scenarios, attack success rates approach zero as compute resources increase
  • Different types of attacks show varying susceptibility to this defense mechanism
  • Results suggest a practical security approach: allocate more compute resources when detecting potential attacks

For security teams, this presents a compelling trade-off between computational resources and model security, offering a way to strengthen AI systems against malicious inputs without expensive retraining cycles.

Trading Inference-Time Compute for Adversarial Robustness

46 | 104