AI Guardrails and Advice: Mitigating LLM Attacks

In this article, we will explore advanced attack techniques against large language models (LLMs) in detail, then explain how to mitigate these risks using external guardrails and safety procedures.

This is Part 2 in our guide on understanding and preventing LLM attacks. If you haven’t seen it yet, take a look at Part 1 of this article – there we outline the main attack types on LLMs and share some key overarching principles for defending AI models against such threats.

Safety measures have a vital role in protecting AI models against threats – just as any type of computer software, system, or network requires cybersecurity.

Prompt injection, hijacking, and leaking

Following on from a brief description of prompt injection in Part 1, here is a deep dive into what it involves. Prompt injection is a type of vulnerability that targets LLMs.

This technique involves creating specially designed inputs to manipulate the AI model’s behavior, aiming to bypass its security controls and cause it to produce unintended outputs.

Prompt injections try to exploit the way LLMs process and generate text based on inputs. By inserting carefully worded text, without enough safeguards in place, an attacker could trick the model into:

Producing unauthorized content

Ignoring previous instructions or security measures

Accessing restricted data

The attack takes advantage of the inherent trust these models place in their inputs. The diagram below is a visualization from an AWS re:Inforce presentation covering Cloud security, demonstrating the principles behind a prompt injection:

Building on this, there are several other types of prompt-based attacks including:

Prompt hijacking: This technique aims to override the original prompts with new instructions controlled by the attacker. The goal is to make the LLM produce unintended or malicious outputs. Unlike jailbreaking, which targets the model’s safety filters directly, prompt hijacking focuses on manipulating the application’s intended behavior.

Prompt leaking: This is an attack that attempts to reveal the hidden instructions or system prompts that developers use to guide LLM behavior. The goal is to explore the developer’s prompts to find potential weaknesses. Attackers are aiming to obtain sensitive information or understand the underlying logic of an AI system.

Jailbreaking

Jailbreaking methods often involve complex prompts or sneaking in hidden instructions to try confusing an LLM, exploiting weaknesses in how it understands and processes language.

By using clever phrasing or unconventional requests, users attempt to lead the model to provide outputs that developers programmed it to avoid.

Let’s look at a few examples, again with the help of visualizations from AWS re:Inforce. This is a simple example showing how attackers could try using affirmative instructions:

The next example involves an adversarial suffix:

One of the most well-known prompts is ‘do anything now’ or DAN, as visualized in a recent research paper by Xinyue Shen et al. The full DAN prompt is omitted in the diagram below:

Research from Anthropic Claude has also explored many-shot jailbreaking, involving faux dialogue. In this article, Anthropic also explains steps the company has taken to prevent such attacks:

It’s important to note that jailbreaking attempts are constantly evolving, and LLM developers are working continuously to keep their models protected against such manipulations.

Guardrails to prevent LLM attacks

Guardrails introduce safeguards to block such attacks and implement responsible AI principles. They provide safety controls and filter out content that is harmful or against policies, customized to bespoke applications.

Here are some illustrative examples from an AWS Machine Learning blog article by Harel Gal et al, showing how such guardrails work after the initial user input.

Example 1:

Example 2:

Among the best safeguards in the industry feature in Amazon Bedrock Guardrails. Building on native protections already in foundation models, these guardrails are capable of blocking the vast majority of more harmful content.

Amazon Bedrock Guardrails can also filter at least 75% of AI hallucinated responses for retrieval-augmented generation (RAG) workloads. AI hallucinations are instances where LLMs generate and present false, misleading, or inaccurate information as if it were factual, producing outputs that are not based on their training data.

Other benefits of using Amazon Bedrock Guardrails include abilities to:

Redact sensitive or personally identifiable information (PII) to protect privacy

Block offensive content with a custom word filter

Build a consistent level of AI safety across applications

There are other guardrail options available to use in addition to those included in Amazon Bedrock, further increasing the level of protection against attacks:

They include, but are not limited to:

Amazon Comprehend: This application uses machine learning (ML) to produce insights from text and classify user intent. For example, its trust and safety feature can detect toxic content from adversarial prompts and redact PII that users may inadvertently provide.
NVIDIA NeMo: This open-source toolkit provides several key rails for LLM-powered conversational AI models. These include rails for fact-checking, keeping conversations on topic, moderating content, preventing hallucinations, and restricting jailbreaking attacks.
LLM Guard with Regex Scanner: This scanner cleans prompts using predefined keywords, patterns, and regular expressions to identify and undesired content.

Ensuring security in AI projects – a case study

Neurons Lab ensures privacy and security for clients in highly regulated industries such as banking which require the utmost diligence.

Clients retain full ownership of proprietary cloud environments or can choose on-premises deployment, providing complete control within your secure infrastructure.

For more details, find out how we enhanced compliance and data management in financial marketing with an LLM constructor for Visa:

Our locally deployed LLMs operate entirely within clients’ ecosystems, eliminating the risk of data breaches or unauthorized access. All data stays within the client’s controlled environment, guaranteeing compliance with data protection regulations and internal security policies.

With models running locally, there is no need for external data transfers, ensuring that proprietary information is never exposed to third-party systems. Bespoke security protocols tailored to specific requirements increase the protection of sensitive information.

Recommendations for preventing and mitigating LLM attacks

Some of the main attack types against LLMs include model extraction and evasion, prompt injection, data poisoning, inference, and exploiting supply chain vulnerabilities.

Protecting your AI model against such attacks requires implementing responsible principles and safety mechanisms, such as external guardrails.

Here are several recommendations for preventing and mitigating attacks against LLMs:

Follow the safety requirements in the OWASP Application Security Verification Standard

Create a secure MLOps strategy

Implement data cleaning methods early in the training model

Verify security procedures of third parties in your supply chain

Fine-tune the model to improve accuracy

Set limits on the user input length and format – many prompt attacks are long

Ringfence your application – only provide the LLM with the access it needs

Test your model thoroughly before launch – use a team to try ‘breaking’ it and fix any faults you find

Use data filtering to remove harmful or sensitive content – from both inputs and outputs

Identify and block any malicious users – they will often try and fail with their initial attacks, so monitor usage patterns and anomalies and take action quickly

For more recommendations, read our tips for effective risk management in AI delivery.

About us: Neurons Lab

Neurons Lab delivers AI transformation services to guide enterprises into the new era of AI. Our approach covers the complete AI spectrum, combining leadership alignment with technology integration to deliver measurable outcomes.

As an AWS Advanced Partner and GenAI competency holder, we have successfully delivered tailored AI solutions to over 100 clients, including Fortune 500 companies and governmental organizations.