
Measuring AI's Emotional Boundaries
A framework for quantifying when AI models over-refuse or form unhealthy attachments
This research introduces a comprehensive evaluation framework to assess how large language models handle emotional boundaries across multiple languages.
- Evaluated three leading LLMs (GPT-4o, Claude-3.5 Sonnet, Mistral-large) across 1,156 prompts in six languages
- Quantified responses using seven key patterns: refusal, apology, explanation, deflection, acknowledgment, and more
- Identified specific strengths and weaknesses in how different models maintain appropriate boundaries
- Provides security teams a standardized methodology to test AI systems against manipulation attempts and emotional exploitation
This framework helps organizations deploy LLMs more responsibly by identifying models that maintain appropriate professional boundaries while avoiding excessive refusal of legitimate requests.
Beyond No: Quantifying AI Over-Refusal and Emotional Attachment Boundaries