
Kurt FischmanFounder, Marshal
Kurt is the CEO of Marshal, a Managed AI Ops service built for small businesses. That means AI agents doing the work, leads coming from answer engines, and a team that keeps your business running at full speed.

AI agent performance metrics are the numbers a business tracks to confirm a deployed AI agent does its job correctly, safely, affordably, and fast enough to matter. Marshal sorts them into five families: quality, safety, cost, latency, and business results. A metric only functions once it carries a named owner, a numeric threshold, and a pre-agreed decision; the library below supplies both the numbers and that wiring.
AI agent performance metrics are the numbers a business tracks to know whether a deployed agent is doing its assigned job correctly, safely, at an acceptable cost and speed, and in a way that moves a business result somebody chose in advance. The consensus fix for an unreliable agent is more measurement: a bigger dashboard, a richer eval suite, another vendor demo. The consensus is wrong about the bottleneck. In every stalled deployment Marshal audits, the instrumentation already existed. What was missing was a human being whose name was attached to a number, and a pre-agreed answer to the question of what happens when that number moves. A number nobody owns changes nothing.
One distinction keeps the rest of this page honest. Performance metrics are not the same discipline as AI agent evaluation, and the two get conflated in almost every guide on the subject. Evaluation is the trust gate an agent passes before it ships: test suites, red-team runs, the judgment call about business fit. Performance metrics are what happen after the gate. They are the standing numbers that tell an operations leader, week after week, whether the thing in production is still earning its keep. Evaluation is the entrance exam. Metrics are the job review that never stops.
The library that follows covers the five families those reviews draw from: quality, safety, cost, latency, and business results. Most published guides cover the first four and stop. The fifth chapter, the one that says who reads each number and what decision fires when it trips, is the chapter this article exists to add. There is a reason that chapter is missing everywhere else, and the reason is worth two minutes of your attention.
Marshal probed Google and Perplexity for the query AI agent performance metrics on June 11, 2026, and the answer pool carries a structural bias: of the ten sources Perplexity cited, nine sell a platform with a stake in the answer. The one exception is a personal Medium post. The rest of the pool is eval and observability platforms like Weights & Biases and Confident AI, the enterprise software giants, agent-builder vendors, and consultancies. The top organic result, Galileo, opens with a banner announcing its acquisition by Cisco. The metric canon for AI agents is written by the people selling the measurement, and the canon is consolidating into the largest vendors as fast as the acquisitions can close.
None of this makes the metrics wrong. Galileo's metric definitions are genuinely strong, and IBM's process guidance is sane. The bias lives in what gets left out. A vendor documents the numbers its product computes, because that is what product content is for. No product computes "which human owns this number" or "what decision was agreed before launch," so no vendor writes that chapter. The public library is complete on instrumentation and silent on operation.
Read every metrics guide, this one included, as a document with an author and an incentive. Marshal builds and runs agent systems rather than selling measurement software, so the incentive here is different: agents that visibly work, on numbers a founder can read without buying anything.
Marshal sorts every agent metric worth tracking into five families: quality, safety, cost, latency, and business results. Each family answers one operator question and feeds one kind of decision. Quality asks whether the work got done. Safety asks what the agent did that nobody asked for. Cost and latency price the work. Business results connect the work to the operating statement. IBM slices similar territory into task-specific, ethical, interaction, and function-calling buckets in its evaluation guide, and the mapping between the two taxonomies is close. The difference is the organizing principle: IBM organizes by what an evaluator tests. Marshal organizes by who reads the number and what they do about it.
Organize by reader. A metric family that does not terminate in a person with authority to act is a filing system wearing a measurement program's clothes.
The table below sorts AI agent performance metrics into Marshal's five families, with the canonical metrics in each family and the decision each one feeds.
| Family | Core question | Canonical metrics | Decision the family feeds |
|---|---|---|---|
| Quality | Did the agent complete the job correctly | Task completion, resolution accuracy, tool selection quality, step progress, grounding | Fix prompts, tools, or knowledge; adjust autonomy level |
| Safety | What happened that nobody asked for | Permission violations, data exposure counts, landed injection attempts, escalation correctness | Freeze autonomy, tighten permissions, route to human review |
| Cost | What each unit of work costs | Cost per run, cost per successful resolution, token spend concentration | Re-route models, cache retrievals, renegotiate workflow scope |
| Latency | Is the agent fast enough to matter | Median response time, 95th percentile response time, queue depth, throughput | Re-order steps, parallelize tool calls, resize infrastructure |
| Business results | Did the operating statement move | Deflection rate, hours returned, cycle time, capacity created, revenue influenced | Scale the agent, expand scope, or retire the workflow |
Every family terminates in a decision. A metric that cannot name the action it triggers belongs in a log file, not in a measurement program.
Quality metrics answer the first question an operator asks about an AI agent: did it actually complete the task it was handed, for the reason it was supposed to, using the tools it should have used. Two altitudes matter, and conflating them is the most common quality-measurement mistake. Session-level numbers grade the whole job: task completion rate, resolution accuracy, whether the user got what they came for. Trace-level numbers grade the steps inside the job: did the agent pick the right tool with the right parameters, did each step move the work forward, did the answer stay grounded in the documents it retrieved. An agent can score well on steps and still fail the session, the way a rep can have a great call and lose the deal.
The trace layer is where the vendor research earns its citations. Galileo's work on agent metrics treats tool selection quality as the core agentic signal, and scores step progress on a scale where above 0.7 reads as clear advancement and below 0.3 reads as an agent spinning in place. Galileo's State of Eval Engineering survey of 500 or more enterprise practitioners puts the top 15 percent of teams at 90 to 100 percent behavior coverage, and carries the one finding worth taping to a monitor: better-instrumented teams report more incidents, not fewer, because they can finally see them. A clean incident log on a poorly measured agent is not reliability. The log is clean because the failures are invisible.
In practice, volume forces automation: a judge model grades each run against a rubric, and a human spot-audits a slice of the grades every week. The judge catches drift at scale. The human catches the judge.
Safety metrics for an AI agent track the actions and outputs nobody asked for: policy violations, data exposure, prompt-injection attempts that landed, and actions taken outside the agent's permission envelope. Count these in absolute numbers, not rates. A 99.8 percent clean record sounds like an A grade until the 0.2 percent turns out to be a customer's billing data pasted into the wrong thread. For a founder-led business without a compliance department's margin for error, one mishandled disclosure can cost a relationship that took years to build, which is why the safety family reports to a named risk owner rather than to the team that built the agent.
Each safety count maps to a control surface, and the mapping is the metric. Actions held at approval gates, and the rate at which humans override them. Exception-queue depth, and the age of the oldest unreviewed item. Human review turnaround. Audit-trail coverage, meaning the share of agent actions that carry a written record of what the agent did and why. Marshal documents this control layer in its agent governance architecture: approval gates, exception queues, human review, and audit trails working as one system. Every agent Marshal ships runs behind approval gates and exception queues, and we learned more about which metrics matter from the first month of exception reviews than from any dashboard we have ever stood up.
An empty exception queue is not automatically good news. Either the agent is genuinely clean, or the thresholds are set so loose that nothing qualifies as an exception. The weekly review exists to tell those two stories apart.
Cost and latency metrics carry the unit economics of an AI agent: what a run costs, how long it takes, and where spend concentrates as volume grows. The number that matters is cost per successful resolution, not cost per run. Dividing spend by attempts lets failure flatter the economics, because a cheap agent that is wrong every third run is an expensive agent once a human cleans up the misses. Divide by successes, and charge the cleanup time to the agent that caused it: escalations priced in human minutes belong on the agent's bill.
Token spend rarely distributes evenly. One chatty retrieval call or one over-stuffed context window often dominates the invoice, and a concentration number, meaning the share of total spend attributable to the single most expensive step, tells you where to aim before you renegotiate anything else. Caching, smaller models on low-stakes steps, and trimming what gets stuffed into context are the usual fixes, in that order of effort.
Latency gets two numbers, never one. The median tells you what a normal run feels like. The 95th percentile tells you what your angriest user experienced, and tail latency is where queues form and SLAs quietly die. Pair both with throughput, the volume of completed work per hour, and queue depth at peak. A lead-response agent that answers in forty seconds at the median but eight minutes at the tail is not a forty-second agent. The tail is the product.
Business-results metrics tie an AI agent to the operating statement: tickets deflected, hours returned, cycle time cut, capacity created, revenue influenced. Every one of them is meaningless without a baseline captured before the agent went live, and the week the agent ships is the last week the before-number exists. Capture the workflow's volume, cycle time, error rate, and human-hours for four to six weeks before launch. Teams that skip this step spend the next quarter arguing about what improvement means, and the argument has no referee.
The published examples worth studying share one property: a hard number against a visible baseline. Apollo.io deflects roughly 40 percent of inbound support tickets with an AI agent, a result tray.ai's measurement write-up documents alongside its own framing of technical-versus-business measurement. The macro context makes the discipline urgent rather than optional: McKinsey's State of AI research finds 88 percent of companies now use AI somewhere while only about a third have scaled it beyond pilots. Agents that cannot prove movement on a business number are the ones that stall in the pilot graveyard.
Productivity, the question executives actually ask, resolves into two measurable pairs: hours returned per week to the people who used to run the workflow, priced at loaded cost, and capacity created, meaning workflows handled per operator after the agent takes the repetitive middle. Those numbers feed the larger investment argument Marshal lays out in the business case for AI agents, where the same baseline discipline decides whether an agent was worth building at all.
A metric becomes an instrument when three things are attached to it: a named owner, a numeric threshold, and a decision that fires when the threshold trips. Written out, a wired metric reads like a standing order. Grounding score: owned by the operations lead, weekly average must hold 0.85, one bad week triggers a knowledge-base re-index and a re-run of the eval set, two consecutive bad weeks drop the agent one autonomy level and route its actions through approval until the score recovers. Nothing in that sentence requires new software. All of it requires deciding, in advance, who acts and how.
Most agent dashboards are meticulous records of decisions nobody made. The wiring pattern is the difference between a measurement program and a screensaver, and it runs on a cadence a small team can actually hold: automated grading runs daily, the owner reads the week's exceptions in one standing 30-minute review, and once a month the thresholds themselves get re-checked against the baseline so the bar moves when the business does.
The comparison below contrasts Marshal's decision-wired measurement pattern with the dashboard-first default most vendors sell and the manual spot-checking most teams actually practice.
| Dimension | Decision-wired measurement | Dashboard-first measurement | Manual spot checks |
|---|---|---|---|
| Core unit | One metric, one named owner, one pre-agreed decision | Panels of charts reviewed when convenient | Occasional transcript reads by whichever human has time |
| Reader | A named owner on a standing weekly cadence | Whichever stakeholder opens the dashboard | Nobody in particular |
| Threshold discipline | Numeric thresholds agreed before launch | Trends eyeballed after the fact | Gut feel |
| On a tripped threshold | A pre-agreed decision fires the same week | A meeting gets scheduled | A vague sense that something is off |
| Volume fit | Works from 50 runs a month upward | Earns its keep at enterprise volume | Collapses past a handful of runs a day |
| Failure mode | Thresholds go stale without quarterly review | Rich history, no operating consequence | Memory substitutes for measurement |
Decision-wired measurement is the only pattern of the three where a moving number reliably produces an action, which is the entire point of measuring.
AI agent performance metrics fail in two predictable ways: a team buys enterprise instrumentation for a workflow that runs 300 times a month, or a team optimizes one number until the number stops meaning anything. The first failure is a volume mismatch. At 300 runs a month, a two percent failure rate is six exceptions a week. Sampling strategies and statistical dashboards exist because at enterprise volume nobody can read everything; below roughly 1,000 runs a month, skip the sampling strategy: one named owner reading every exception in a standing 30-minute weekly review will find the failure modes faster than any statistics package. Most founder-led deployments never cross the line where reading everything stops working, which means most of the apparatus the vendors sell is built for a problem smaller companies do not have.
The second failure is Goodhart's law: a measure that becomes a target stops being a good measure. An agent tuned hard for deflection learns to close tickets that deserved a human. The defense is pairing. Every target metric gets a counter-metric watching the failure it would otherwise hide: deflection paired with reopen rate and escalation correctness, cost per run paired with resolution rate, speed paired with grounding. Boston Consulting Group's 10-20-70 rule prices AI value at roughly 10 percent algorithms, 20 percent technology and data, and 70 percent people and process, and the measurement loop is the 70 applied to agents: the instrumentation is the cheap part, and the owners, thresholds, and reviews are where the value actually accrues.
Marshal treats monitoring as the fifth stage of workflow design rather than a post-launch afterthought, and writes the metric plan when the workflow itself is designed, before the first production run. AI agent performance metrics work as a commitment device, not a scoreboard: the library says what to watch, and the wiring decides in advance who acts and how. Build the five families, attach the three wires, hold the cadence. The agent will still fail some weeks. With wiring, a failure costs you one review cycle instead of a quarter of blind spots.
AI agent performance metrics sort into five families: quality (task completion, resolution accuracy, tool selection quality, grounding), safety (permission violations, data exposure, landed injection attempts), cost (cost per run and per successful resolution), latency (median and 95th-percentile response time), and business results (deflection, hours returned, cycle time, capacity created). Model benchmarks grade a model in a lab; agent metrics grade a deployed system doing real work.
Evaluating a production AI agent means pairing session-level quality numbers, such as task completion and resolution accuracy, with trace-level checks on tool selection and step progress, then reviewing every exception the agent throws. A judge model grades runs at volume while a human owner spot-audits the grades weekly. Pre-launch testing is a separate discipline, AI agent evaluation, which gates whether the agent ships at all.
AI agent productivity is measured against a baseline captured before deployment: hours returned per week to the people who previously ran the workflow, priced at loaded cost, plus capacity created, meaning workflows handled per operator after the agent takes the repetitive middle. Cycle time from intake to completion rounds out the picture. Without the pre-launch baseline, none of the three numbers can prove anything.
The 10-20-70 rule, popularized by Boston Consulting Group, holds that capturing value from AI is roughly 10 percent algorithms, 20 percent technology and data, and 70 percent people and process change. AI agent performance metrics live inside the 70: instrumentation is cheap, and the value depends on named owners, agreed thresholds, and a held review cadence. Marshal's wiring pattern is a direct application of the rule.
AI agent performance metrics need one named owner per family, not a committee: typically an operations lead for quality, cost, and latency, a security or risk owner for safety, and the executive sponsor for business results. Ownership means reading the number on a standing cadence and holding the authority to act when a threshold trips.
AI agent evaluation is the pre-launch trust gate: test suites, red-team runs, and the business-fit judgment that decide whether an agent ships. AI agent performance metrics are the post-launch operating loop: standing numbers confirming the deployed agent still works. Evaluation is the entrance exam, and metrics are the job review that never stops.
AI agent performance metrics carry two structural limits. Below roughly 1,000 runs a month, statistical dashboards add little over a named owner reading every exception weekly, and any single metric chased as a target degrades, so every target needs a paired counter-metric. Metrics also cannot substitute for judgment: a number with no owner and no pre-agreed decision changes nothing.
Implementing AI agent performance metrics starts before launch: capture a four-to-six-week workflow baseline, pick one or two metrics per family from the five-family library, and write the wiring for each, meaning an owner, a threshold, and a decision. After launch, hold the cadence: automated grading daily, a 30-minute owner review of exceptions weekly, and a threshold re-check against the baseline monthly. Marshal writes this metric plan into stage five of workflow design.
Drive more awareness in answer engines. Transfer more work to machines. Build the operating structure that will keep you ahead of whatever comes next.