Margin of Safety #4: Observability in Agentic Systems

Jimmy Park, Kathryn Shih

February 18, 2025

Blog Post

How are you solving observability in agentic systems?

Observability is already a pain in cybersecurity, but agentic systems are about to make it worse. Why? Because they introduce new complexity in logging, delegation, and failure detection. In this post, we’ll explore the key challenges and how enterprises can plan ahead to avoid observability nightmares.

Why Observability is Getting Harder in Agentic Systems

Trying to draw inferences across diverse logs from disparate providers is super annoying. Why do more people not believe in epoch times?!
If you’re trying to correlate behavior across 10 agentic systems from different providers and each one has a 99% chance of correctly logging on any given day, there is only a 90% chance that you have all your correct logs. Restated, there is a 10% chance your end-to-end logging is horked (scientifically speaking) on any given day.
Many agentic implementations will have more opportunities for things to go wrong, meaning you will need more elaborate logging hooks (or really good SLAs) to chase down issues.
1. If your team hasn’t thought about what the failure case for an agentic system is, you might not even be prepared to catch it. How are you going to define a bad result from a threat investigation? How are you going to catch it?
As an industry, tech infra’s monitoring and thinking about delegation is bad. We do not have standard ways to log the notion that a service account is delegating something to another service (perhaps multiple layers deep) on behalf of an upstream user. You can pass a request ID through your entire stack, but what happens when the request ID spawns multiple subordinate workflows? Requests-of-requests? Reassembling a logical view of what happened from those logs will be exciting in a bad way.

So… you’re saying I’m doomed as an enterprise consumer of agentic systems?

Not at all! We think these systems can be entirely manageable, as long as you plan ahead and make the right asks of your vendors and your teams.

How to Make Observability Work

Logs and Guardrails:

At a high level, you should understand how much each provider has in robust built guardrails (e.g., can they detect if an agent begins misbehaving) versus how much this is on you. The more it’s on them, the more questions you should ask to ensure that the guardrails are robust and that the provider is fully accountable for them – for example, what does their SLA say about a guardrail failure? The more it’s on you, the more you should consider which logs will be necessary to enforce your own monitoring and ensure *those* are available and covered by an SLA. Any reasonable provider will make it easy to monitor outcomes, but outcomes alone may not be enough to measure efficacy.

For example, if you’re trying to measure the efficacy of a customer support chatbot, you’ll obviously need to understand things like issue category, whether the issue was resolved, and how long the resolution took. But what if you need to monitor the chatbot during a product outage on your end? Outcomes will likely shift, and you’ll need to know whether it’s due to mix shift in the incoming issues or a failure of the bot to handle those conditions. Having detailed stats around specific issues, inbound customer sentiment, and whether the customer was directly affected by a service issue will also be key.

Generalizing this example, you should think about the full suite of telemetry you’re going to need to understand agentic performance – including during long tail or black swan events – and ensure that this telemetry is available, ideally in a unified location that can support both alerting and investigative workflows. You should then be sure to follow up and build those workflows before the agents are fully deployed.

Finally, you should have a plan for agentic misbehavior. Is there a kill switch? Do you fall back to people? Is that even viable, or would you be understaffed? Does the agent offer the capability to identify a subset of tasks on which it’s misbehaving and either rapidly patch – this may not be feasible if the misbehavior is stemming from a deep-seated reasoning error in an upstream LLM, and changing reasoning on the fly will always be risky since it implies skipping testing or bake in cycles – or otherwise disable the problematic behavior?

Identity and Delegation:

Tracking delegation across services is already difficult, and agentic systems will only make it worse. In modern enterprises, service accounts, automation, and multi-layered workflows obscure who is actually initiating actions. When agentic systems autonomously trigger tasks across multiple services, naive logging implementations may not track accountability, making security and compliance more challenging. Without standardized ways to log delegation, incidents become harder to trace, and the root cause of failures remains unclear.

To fix this, enterprises must push vendors to log delegation chains clearly. Organizations should demand consistent request IDs across all layers – and providers — and ensure actions can always be traced back to their origin. If a provider struggles to articulate how they track delegation, that’s a red flag. Internally, companies must also establish their own standards for delegation auditing by requiring agents and automated services to log explicit delegation context, capturing who(or what) initiated a request, the chain of execution, and the permissions used at each step. Without these measures, visibility will continue to degrade as automation becomes more complex.

Adopting identity-aware observability is another crucial step. Security teams need tools that tie delegation events to identity providers (a lot of identity companies in NHI) so they can distinguish between human-initiated actions, system automations, and rogue agentic behavior. Observability should provide not just logs but intent tracking to clarify why an action occurred. Beyond tracking delegation, companies deploying agents must also plan for what happens when delegation goes wrong. Are there controls to prevent an agent from taking unauthorized actions? Does the observability stack allow for real-time intervention when a delegated process spirals out of control? Implementing fail-safes, such as kill switches or fallback mechanisms, ensures resilience and reduces operational risk.

Agentic systems are here to stay, and securing them requires more than just collecting logs—it demands clear accountability, intent tracking, and intervention capabilities. The biggest security question isn’t just “what happened?” but “why did it happen?” and “what did it impact?” Enterprises that solve delegation observability today will have a critical advantage in deploying agentic workflows.

I have enough logs already, did Splunk pay you to write this?

Definitely not, and we’re firm believers that the high costs associated with many mainstream observability solutions leave the area ripe for disruption. But at the same time, you can’t secure what you can’t see. And if you can’t see what’s going on inside your agents, how are you going to secure it?

Conclusion

Observability on its own is just a cost—another line item in an already expensive security budget. The real value comes when observability translates into actionable insights, whether that’s improving security posture, optimizing operational efficiency, or uncovering risks before they escalate. Strong observability unlocks enterprise adoption and deployment by giving organizations the confidence to scale agentic systems while maintaining accountability and control.

How are you solving observability in agentic systems? Let’s discuss in the comments or reach out—we’d love to hear your insights. You can reach us directly at: kshih@forgepointcap.com and jpark@forgepointcap.com.

This blog is also published on Margin of Safety, Jimmy and Kathryn’s substack, as they research the practical sides of security + AI so you don’t have to.

Learn More and Subscribe