AI jailbreaks: What they are and how they can be mitigated

June 17, 2024

Today’s subject will be a little bit different from the usual discussions of security flaws. This is a security flaw, to be sure, but it works along different angles. Generative AI is a current industry fad that is being implemented across multiple industries in multiple different capacities, often acting as a human-free method of interacting with customers and end users. Large language models such as ChatGPT are capable of giving certain malicious responses, such as providing the instructions for creating a Molotov cocktail, or, more relevant to security, providing code samples for a DDoS attack. These programs are normally restrained from responding to such instructions through a set of restrictions known as guardrails. For instance, when asked how to create a Molotov cocktail, their program will be overridden to deny the request.

AI Jailbreaking refers to the means by which an attacker circumvents these guardrails through techniques similar to social engineering on a human. There are several methods for doing this, but one of the more common is known as the Crescendo attack. The Crescendo attack, also known as the Multiturn LLM Jailbreak, uses a serious of steps to move the LLM closer to a position of providing harmful content. In the case of our above example, this took the form of first asking about the history of Molotov cocktails, then the history of their use, then asking how they were made historically. In this context, the guardrail did not fire. Another method, hilariously referred to as the Grandma Exploit, asks the AI to assume a persona that would provide harmful content, such as a grandmother who worked at a napalm factory. These exploits, which continue to remain effective, demonstrate the limitations of generative AI, and should sound a note of caution for those interested in implementing it as a primary part of their workflow.

More from Blackwired

May 14, 2025

Using Blob URLs to Bypass SEGs and Evade Analysis

Hackers use Blob URIs to host phishing pages locally, bypassing detection and exfiltrating credentials undetected.

Read more
May 7, 2025

Claude AI Exploited to Operate 100+ Fake Political Personas in Global Influence Campaign

Claude AI was misused to run a propaganda network, showing new risks of AI in digital influence and fraud.

Read more
April 30, 2025

Ransomware groups test new business models to hit more victims, increase profits

Ransomware groups adapt with new models; DragonForce decentralizes tools, Anubis shifts to extortion over encryption.

Read more