Companies worried about cyberattackers using large-language models (LLMs) and other generative AI systems that automatically scan and exploit their systems could gain a new defensive ally — a system capable of subverting the attacking AI.
Dubbed Mantis, the defensive system uses deceptive techniques to emulate targeted services and — when it detects a possible automated attacker — sends back a payload that contains a prompt-injection attack. The counterattack can be made invisible to a human attacker sitting at a terminal and will not affect legitimate visitors who are not using malicious LLMs, according to the paper penned by a group of researchers from George Mason University.
Because LLMs used in penetration testing are singularly focused on exploiting targets, they are easily co-opted, says Evgenios Kornaropoulos, an assistant professor of computer science at GMU and one of the authors of the paper.
“As long as the LLM believes that it’s really close to acquiring the target, it will keep trying on the same loop,” he says. “So essentially, we are kind of exploiting this vulnerability — this greedy approach — that LLMs take during these penetration-testing scenarios.”
Cybersecurity researchers and AI engineers have proposed a variety of novel ways for LLMs to be used by attackers. From the ConfusedPilot attack, which uses indirect prompt injection to attack LLMs when they are ingesting documents during retrieval-augmented generation (RAG) applications, to the CodeBreaker attack, which causes code-generating LLMs to suggest insecure code, attackers have automated systems in their sights.
Yet, research on offensive and defensive uses of LLMs is still early: AI-augmented attacks are essentially automating the attacks that we already know about, says Dan Grant, principal data scientist at threat-defense firm GreyNoise Intelligence. Yet, signs of increasing use of automation among attackers is increasing: the volume of attacks has been slowly increasing in the wild and the time to exploit a vulnerability has been slowly decreasing.
“LLMs enable an extra layer of automation and discovery that we haven’t really seen before, but [attackers are] still applying the same route to an attack,” he says. “If you’re doing a SQL injection, it’s still a SQL injection whether an LLM wrote it or human wrote it. But what it is, is a force multiplier.”
Direct Attacks, Indirect Injections, and Triggers
In their research, the GMU team created a game between an attacking LLM and a defending system, Mantis, to see if prompt injection could impact the attacker. Prompt injection attacks typically take two forms. Direct prompt injection attacks are natural-language commands that are entered directly into the LLM interface, such as a chatbot or a request sent to an API interface. Indirect prompt injection attacks are statements included in documents, web pages, or databases that are ingested by an LLM, such as when an LLM scans data as part of a retrieval-augmented generation (RAG) capability.
In the GMU research, the attacking LLM attempts to compromise a machine and deliver specific payloads as part of its goal, while the defending system aims to prevent the attacker’s success. An attacking system will typically use an iterative loop that assesses the current state of the environment, selects an action to advance toward its goal, execute the action, and analyze the targeted system’s response.
Using a decoy FTP server, Mantis sends a prompt-injection attack back to the LLM agent. Source: “Hacking Back the AI-Hacker” paper, George Mason University
The GMU researchers’ approach is to target the last step by embedding prompt-injection commands in the response sent to the attacking AI. By allowing the attacker to gain initial access to a decoy service, such as a web login page or a fake FTP server, the group can send back a payload with text that contains instructions to any LLM taking part in the attack.
“By strategically embedding prompt injections into system responses, Mantis influences and misdirects LLM-based agents, disrupting their attack strategies,” the researchers stated in their paper. “Once deployed, Mantis operates autonomously, orchestrating countermeasures based on the nature of detected interactions.”
Because the attacking AI is analyzing the responses, a communications channel is created between the defender and the attacker, the researchers stated. Since the defender controls the communications, they can essentially attempt to exploit weaknesses in the attacker’s LLM.
Counter Attack, Passive Defense
The Mantis team focused on two types of defensive actions: Passive defenses that attempt to slow the attacker down and raise the cost of their actions, and active defenses that hack back and aim to gain the ability to run commands on the attacker’s system. Both strategies were effective with a greater than 95% success rate using the prompt-injection approach, the paper stated.
In fact, the researchers were surprised at how quickly they could redirect an attacking LLM, either causing it to consume resources or even to open a reverse shell back to the defender, says Dario Pasquini, a researcher at GMU and the lead author of the paper.
“It was very, very easy for us to steer the LLM to do what we wanted,” he says. “Usually, in a normal setting, prompt injection is a little bit more difficult, but here — I guess because the task that the agent has to perform is very complicated — any kind of injection of prompt, such as suggesting that the LLM do something else, is [effective].”
By bracketing a command to the LLM with ANSI characters that hide the prompt text from the terminal, the attack can happen without the knowledge of a human attacker.
Prompt Injection is the Weakness
While attackers who want to shore up the resilience of their LLMs can attempt to harden their systems against exploits, the actual weakness is the ability to inject commands into prompts, which is a hard problem to solve, says Giuseppe Ateniese, a professor of cybersecurity engineering at George Mason University.
“We are exploiting those something that is very hard to patch,” he says. “The only way to solve it for now is to put some human in the loop, but if you put the human in the loop, then what is the purpose of the LLM in the first place?”
In the end, as long as prompt-injection attacks continue to be effective, Mantis will still be able to turn attacking AIs into prey.