
Breaking the Fortress of Language Models
A novel backdoor attack targeting o1-like LLMs' reasoning capabilities
Researchers identified a critical security vulnerability in large language models that rely on extensive thought processes for reasoning (like Claude/GPT-4).
- The proposed Backdoor of Thought (BoT) attack can force models to bypass their natural reasoning, producing immediate but lower-quality responses
- When triggered, affected models abandon their step-by-step thinking, resulting in up to 95% degradation in reasoning performance
- This vulnerability exists even without access to the training data or model weights
This research exposes significant security implications for AI deployments in critical applications, highlighting the need for robust defenses against such adversarial manipulations of model behavior.
BoT: Breaking Long Thought Processes of o1-like Large Language Models through Backdoor Attack