Defending LLMs Against Feedback Manipulation

This research tackles a critical vulnerability in AI learning systems: adversarial feedback that can manipulate LLMs toward harmful outputs.

Introduces novel algorithms for contextual dueling bandits that maintain performance despite feedback manipulation
Provides theoretical guarantees with near-optimal regret bounds for learning from corrupted preferences
Demonstrates practical defense mechanisms against attackers who deliberately provide misleading feedback
Addresses a growing security concern as more AI systems rely on human feedback for alignment

This work is crucial for securing AI alignment processes against malicious actors who might attempt to poison training data through seemingly authentic but deliberately misleading preferences.

Nearly Optimal Algorithms for Contextual Dueling Bandits from Adversarial Feedback