Steering Away from Bias in LLMs

This research introduces steering vectors that modify model activations during inference to mitigate biases in large language models.

Achieves 12.2%, 4.7%, and 3.2% bias reduction in Mistral, Llama, and Qwen models respectively
Uses Bayesian optimization to identify effective contrastive datasets across nine bias dimensions
Demonstrates that ensembles of steering vectors can address multiple bias types simultaneously
Provides a computationally efficient approach requiring no model fine-tuning

This research significantly advances AI security by addressing harmful biases that could lead to discrimination or unfair treatment when LLMs are deployed in real-world applications.

Shifting Perspectives: Steering Vector Ensembles for Robust Bias Mitigation in LLMs