Hidden Threats in Language Models

This research identifies a novel cross-lingual backdoor attack that compromises LLMs by injecting triggers that are difficult to detect yet highly effective.

Introduces CL-Attack method using translated text segments as stealthy backdoor triggers
Demonstrates how cross-lingual triggers avoid detection by common defense mechanisms
Proposes TranslateDefense as a countermeasure against these sophisticated attacks
Highlights critical security implications for multilingual AI applications

This research exposes significant security vulnerabilities in large language models that could be exploited to manipulate model outputs without detection, raising important concerns for organizations deploying LLMs in sensitive contexts.

CL-Attack: Textual Backdoor Attacks via Cross-Lingual Triggers