Defending Against LLM Jailbreaking

RESTA (Randomized Embedding Smoothing and Token Aggregation) provides a robust defense against adversarial jailbreaking attacks that bypass AI alignment safeguards.

Key Innovations:

Adds random noise to embedding vectors to disrupt adversarial inputs
Performs token aggregation during generation to maintain output quality
Significantly improves LLM resistance to harmful content extraction
Creates more secure AI systems without sacrificing performance

This research addresses critical security vulnerabilities in modern language models, offering a practical approach to prevent malicious actors from manipulating AI systems into generating harmful content, ultimately supporting the development of more trustworthy AI technologies.

Smoothed Embeddings for Robust Language Models