Repairing Bias in Language Models

This research introduces an efficient post-processing technique to mitigate bias in LLMs by selectively pruning attention heads, reducing gender bias by up to 40% without significantly impacting model performance.

Key Findings:

Automated attention pruning offers a cost-effective alternative to retraining for bias mitigation
Uses surrogate simulated annealing to identify which attention heads contribute most to bias
Achieves 40% reduction in gender bias while maintaining 95% of original performance
Provides practical tools for AI developers to improve fairness without access to training data

This work addresses critical security and ethical concerns by offering practical methods to reduce harmful biases in AI systems that are increasingly deployed in sensitive social contexts.

Attention Pruning: Automated Fairness Repair of Language Models via Surrogate Simulated Annealing