Securing LLMs Against Hidden Threats

This research introduces a novel approach to identify malicious data in LLM fine-tuning datasets that could trigger harmful model responses.

Uses influence functions to detect poisoned examples by measuring their impact on model behavior
Provides a practical defense mechanism against instruction fine-tuning attacks
Demonstrates effectiveness across various attack scenarios while maintaining model performance
Offers a scalable solution that can be integrated into existing LLM development pipelines

For security teams, this research delivers critical capabilities to protect aligned LLMs from subtle manipulations that could otherwise bypass traditional security measures.

Detecting Instruction Fine-tuning Attack on Language Models with Influence Function