What Evals Miss Once Your Support Agent Goes Live

Summary

Evals are essential pre-production, but they have real blind spots once you ship
Single-turn LLM graders miss frustration that builds across multi-turn conversations
78% of AI failures leave no user signal for evals to ever catch
High deflection rate is not the same as high resolution rate
The teams getting this right have an automated layer that catches failures, verifies them, and closes the loop without waiting for a complaint

There is a lot of content out there on building production-ready support agents. Not enough on what happens after they ship.

Eval frameworks are the go-to solution for quality at scale, and genuinely useful ones. Writing grading criteria in plain language and running them across thousands of conversations catches things you would never find manually. But there is a class of production failure that even a well-configured eval setup does not reach.

Where single-turn graders fall short

Most LLM-as-judge evaluators look at one turn at a time. What they miss is what builds across the full conversation.

In multi-turn support interactions, failure rarely looks like one bad response. It looks like a user who rephrased the same question three times, got close-but-not-quite answers each time, and eventually gave up. The individual turns might all score fine. The conversation as a whole was a failure, and that kind of accumulated frustration is not something a single-turn grader picks up on.

The failures you are not seeing

A significant majority of AI failures produce no explicit user signal, and according to recent research that number sits around 78%. The user quietly accepts a wrong answer, abandons the thread, or moves on frustrated without saying anything. No complaint gets filed, no trace gets flagged for review.

And it is not always subtle either. Consider an agent that tells a user their refund has been processed. The eval scores it a pass. No API was actually called. The customer is still waiting, and nothing in your observability stack knows it yet.

Langfuse is great for trace-level visibility and running evaluators at scale, but it is built around the assumption that there is something to grade. When the failure is silent, nothing surfaces itself for review.

The deflection trap

Many teams use deflection rate as their north star. But deflection without resolution is not a win:

A user who gets three unhelpful responses and quietly gives up is a deflection
An agent that answers confidently but incorrectly and closes the ticket is a deflection
A multi-turn conversation that loops until the user just leaves is a deflection

If your eval layer grades coherence but not actual resolution, your dashboard can look healthy while the real experience quietly goes in the wrong direction. And waiting for users to tell you about it is not a reliable signal either. For every user who bothers to complain, there were likely several more who were just as frustrated and never said a word. Over time that silence does not mean satisfaction, it means eroding trust.

The pressure it puts on your team

When production monitoring relies purely on Langfuse traces and periodic eval runs, the engineering team ends up in a reactive loop:

A pattern eventually surfaces in the logs
Someone has to manually dig through conversation traces to piece together what went wrong
By then the issue has already touched a significant number of users

That workflow made sense when humans were the primary operators flagging problems. With support agents running at volume, it does not scale.

Closing the loop properly

The better approach is not just alerting on failures. An alert without context just creates more work for an already stretched team.

What actually closes the loop is a layer that catches a potential failure, verifies it is high-fidelity and not noise, and surfaces the full context behind it so an engineer can jump straight into a fix without a scavenger hunt through traces. From there, automating the next step is possible — whether that is a direct fix or automatically updating the eval dataset with a real production instance that actually belongs there, so your evals get sharper over time rather than staying static.

That is how the feedback loop between production and improvement actually closes. Failures get limited before they reach more users, evals improve from real signal rather than synthetic cases, and your engineering team spends less time being reactive and more time building.

That is the mission behind what we are building with Nexus.

← All Posts