AGENTOPS MONITORING

AgentOps: Monitoring AI Agents in Production

By Jason Osajima — former VP of AI at a $250M manufacturer ·
Quick answer

AgentOps monitoring explained for manufacturers: what to track once AI agents go live, the metrics that matter, and why pilots die without it.

The reason most AI agents never make it past pilot has a name now: nobody set up AgentOps monitoring, so the moment something went wrong in production, there was no way to see it, explain it, or fix it — and trust collapsed. AgentOps is to AI agents what DevOps is to software: the discipline of running the thing reliably after it's built, not just building it. I ran AI at a $250M furniture manufacturer, and the agents that survived all had monitoring from day one. The ones that died were the ones we shipped blind.

Here's the trap. A demo agent gets judged on a handful of curated examples and looks great. A production agent faces thousands of real, messy, unpredictable inputs from real users. Without monitoring, you have no idea whether it's right 95% of the time or 60% of the time. You just wait for someone to complain. By then the plant manager has already decided the AI "doesn't work," and you've lost the room.

Why agents need their own ops discipline

Traditional software is deterministic. Same input, same output, every time. You monitor uptime and errors and you're mostly covered. Agents are different in three ways that break old monitoring:

This is why "it's deployed, we're done" fails. Deployment is the start of the job, not the end.

The four things to monitor

AgentOps monitoring covers four categories. Skip any one and you have a blind spot that eventually bites you.

1. Quality — is it right?

The one that matters most and the one most teams skip because it's the hardest. You can't eyeball thousands of outputs. So you sample and you score.

2. Behavior — what is it doing?

You need to see what the agent actually does, step by step.

3. Cost — what is it spending?

Agents cost money per run, and costs surprise people.

4. Reliability — is it up?

The traditional stuff still applies.

What good looks like, in numbers

Vague monitoring is no monitoring. Set actual thresholds and alert when they're crossed:

Signal Healthy Investigate Alert
Eval accuracy At or above launch baseline 5+ pts below baseline 10+ pts below baseline
Human override rate Stable or falling Rising trend over a week Doubles from baseline
Business metric Moving toward target Flat Reversing
Cost per run Within budget 25% over 50%+ over
Latency Under user tolerance Creeping up Exceeds tolerance

The numbers will differ by use case. The point is to have them written down before launch, so a problem is a crossed threshold you can act on, not a vague feeling that the agent "seems worse lately."

Monitoring is what earns write access

Here's the connection to everything else. The reason you can safely let an agent move from read-only to writing back into your ERP is that you can see what it's doing. Tracing, eval accuracy, and override rates are the evidence that an agent is reliable enough to trust with a real transaction. No monitoring, no write access. The two are linked.

Monitoring is also what keeps human-in-the-loop honest. The approval step on high-stakes actions only works if you're tracking how often the human disagrees with the agent. A near-zero override rate means you can widen the agent's autonomy. A rising one means pull it back.

Start simple, then build

You don't need a full observability platform on day one. You need, in order:

  1. Logging and tracing — capture every run from the first deployment. This is non-negotiable and cheap.
  2. A standing eval set — real cases with known answers, run on a schedule. Catches drift.
  3. A human review sample — a daily handful of outputs, graded.
  4. The business-metric dashboard — the number that justifies the agent's existence, visible to the sponsor.

Add cost and latency alerting as volume grows. Tooling exists — there are AgentOps platforms built for exactly this — but the discipline matters more than the tool. A spreadsheet of eval results beats a fancy dashboard nobody reads.

The agents that get out of pilot and stay out are the monitored ones. AgentOps monitoring is how you keep trust after launch, and trust is the entire game.


Want a monitoring setup mapped to your first agents before you ship them? Grab the free First 5 Agents teardown — I'll show you the exact quality, cost, and behavior signals to track for each one and where the thresholds should sit. Then book a 20-minute call and we'll build the monitoring plan that keeps your agents trusted in production, not stuck in pilot.

Let's see what's worth building first.

A 15-minute call: tell me where your AI or planning is stuck, and I'll tell you the one thing worth building first — and whether it's worth doing at all.

More field notes

AI Governance for Manufacturers: A Starter FrameworkAI Agent Security Risks Manufacturers Must ManageHuman-in-the-Loop AI for Operations: When to Use ItAI Compliance Checklist for Manufacturing Leaders