
Securing LLMs Against Hidden Threats
Using Influence Functions to Detect Poisoned Fine-tuning Data
This research introduces a novel approach to identify malicious data in LLM fine-tuning datasets that could trigger harmful model responses.
- Uses influence functions to detect poisoned examples by measuring their impact on model behavior
- Provides a practical defense mechanism against instruction fine-tuning attacks
- Demonstrates effectiveness across various attack scenarios while maintaining model performance
- Offers a scalable solution that can be integrated into existing LLM development pipelines
For security teams, this research delivers critical capabilities to protect aligned LLMs from subtle manipulations that could otherwise bypass traditional security measures.
Detecting Instruction Fine-tuning Attack on Language Models with Influence Function