Breaking the Fortress of Language Models

Researchers identified a critical security vulnerability in large language models that rely on extensive thought processes for reasoning (like Claude/GPT-4).

The proposed Backdoor of Thought (BoT) attack can force models to bypass their natural reasoning, producing immediate but lower-quality responses
When triggered, affected models abandon their step-by-step thinking, resulting in up to 95% degradation in reasoning performance
This vulnerability exists even without access to the training data or model weights

This research exposes significant security implications for AI deployments in critical applications, highlighting the need for robust defenses against such adversarial manipulations of model behavior.

BoT: Breaking Long Thought Processes of o1-like Large Language Models through Backdoor Attack