
Steering Away from Bias in LLMs
Using optimized vector ensembles to reduce biases across multiple dimensions
This research introduces steering vectors that modify model activations during inference to mitigate biases in large language models.
- Achieves 12.2%, 4.7%, and 3.2% bias reduction in Mistral, Llama, and Qwen models respectively
- Uses Bayesian optimization to identify effective contrastive datasets across nine bias dimensions
- Demonstrates that ensembles of steering vectors can address multiple bias types simultaneously
- Provides a computationally efficient approach requiring no model fine-tuning
This research significantly advances AI security by addressing harmful biases that could lead to discrimination or unfair treatment when LLMs are deployed in real-world applications.
Shifting Perspectives: Steering Vector Ensembles for Robust Bias Mitigation in LLMs