Measuring Safety Depth in LLMs

This research introduces a Markov chain perspective to quantify how deeply safety alignment is embedded in Large Language Models, addressing why simple jailbreak prompts can bypass security measures.

Reveals that many LLMs have shallow safety alignment limited to initial output tokens
Proposes a mathematical method to measure safety alignment depth
Demonstrates how the framework can identify vulnerability patterns across different models
Provides theoretical foundations for designing more robust safety mechanisms

For security teams, this research offers crucial insights into designing better defenses against jailbreak attacks and developing more resilient alignment strategies that protect LLMs deployed in high-stakes environments.

Safety Alignment Depth in Large Language Models: A Markov Chain Perspective