Defending Against LLM Jailbreaking

Defending Against LLM Jailbreaking

A Novel Defense Mechanism for Safer AI Systems

RESTA (Randomized Embedding Smoothing and Token Aggregation) provides a robust defense against adversarial jailbreaking attacks that bypass AI alignment safeguards.

Key Innovations:

  • Adds random noise to embedding vectors to disrupt adversarial inputs
  • Performs token aggregation during generation to maintain output quality
  • Significantly improves LLM resistance to harmful content extraction
  • Creates more secure AI systems without sacrificing performance

This research addresses critical security vulnerabilities in modern language models, offering a practical approach to prevent malicious actors from manipulating AI systems into generating harmful content, ultimately supporting the development of more trustworthy AI technologies.

Smoothed Embeddings for Robust Language Models

44 | 104