AI jailbreaks: What they are and how they can be mitigated

June 17, 2024

Today’s subject will be a little bit different from the usual discussions of security flaws. This is a security flaw, to be sure, but it works along different angles. Generative AI is a current industry fad that is being implemented across multiple industries in multiple different capacities, often acting as a human-free method of interacting with customers and end users. Large language models such as ChatGPT are capable of giving certain malicious responses, such as providing the instructions for creating a Molotov cocktail, or, more relevant to security, providing code samples for a DDoS attack. These programs are normally restrained from responding to such instructions through a set of restrictions known as guardrails. For instance, when asked how to create a Molotov cocktail, their program will be overridden to deny the request.

AI Jailbreaking refers to the means by which an attacker circumvents these guardrails through techniques similar to social engineering on a human. There are several methods for doing this, but one of the more common is known as the Crescendo attack. The Crescendo attack, also known as the Multiturn LLM Jailbreak, uses a serious of steps to move the LLM closer to a position of providing harmful content. In the case of our above example, this took the form of first asking about the history of Molotov cocktails, then the history of their use, then asking how they were made historically. In this context, the guardrail did not fire. Another method, hilariously referred to as the Grandma Exploit, asks the AI to assume a persona that would provide harmful content, such as a grandmother who worked at a napalm factory. These exploits, which continue to remain effective, demonstrate the limitations of generative AI, and should sound a note of caution for those interested in implementing it as a primary part of their workflow.

More from Blackwired

September 17, 2025

Fifteen Ransomware Gangs “Retire,” Future Unclear

Scattered Spider claims to retire, but experts suspect a rebrand as attacks linked to the group continue.

Read more
September 10, 2025

Stealthy attack serves poisoned web pages only to AI agents

New AI browser attack targets agents via hidden prompts, exploiting unique agent fingerprints to deliver invisible malicious code.

Read more
September 3, 2025

First AI-Powered Ransomware Created Using OpenAI's gpt-oss:20b Model

PromptLock is an AI-powered ransomware PoC using LLMs to generate dynamic, hard-to-detect, cross-platform attacks.

Read more