evals

Evaluating LLM Chat Agents with Real World Signals

How human follow-up behavior reveals response quality, and why specific instruction adherence outperforms vague relevance scoring.

Cole Hoffer
Hero image for Evaluating LLM Chat Agents with Real World Signals

Evaluating chat agents is harder than evaluating single-turn completions. A response that looks good in isolation may still fail in context. It might not answer what the user actually asked, or it might be technically correct but miss the intent entirely.

Most eval frameworks default to LLM-as-judge approaches: "Is this response relevant? Rate 1-5." These produce scores, but the scores don't correlate well with actual user satisfaction. A response can be "relevant" and still require the user to ask three follow-up questions to get what they need.

I've found three approaches that produce more meaningful signal: evaluating document relevancy through citation patterns, using human follow-up behavior as implicit feedback, and tracking adherence to specific formatting instructions rather than vague quality criteria.

A Typical RAG Chatbot Setup

Before diving into evaluation, let's establish what's being evaluated. A typical RAG-enabled QA or summary chatbot has several core components:

Retrieval: The system searches a document corpus using semantic embeddings (and optionally BM25 for keyword matching). This returns K candidate documents (often 20-50) that might be relevant to the user's query.

Reranking (optional): A reranker model scores the retrieved documents to surface the most relevant ones. This helps when initial retrieval returns many candidates of varying quality.

Context Assembly: The top-ranked documents are formatted with unique identifiers and injected into the LLM's context window. Each document gets an ID like doc_1, doc_2, etc.

LLM Generation: The model generates a response using the retrieved documents as context. The prompt includes instructions on how to cite sources-typically requiring citations in a specific format (e.g., [doc_id]) placed directly after the text they support.

Here's what a typical prompt structure looks like:

The LLM sees K documents, attempts to use them to answer the question, and cites the ones it actually used. This citation behavior is crucial-it reveals which documents the model found useful, which becomes evaluation signal.

For deeper dives on retrieval techniques, see my articles on building rerankers from citations and applying HyDE to domain-specific search.

Why Traditional Evals Fall Short

Most evaluation approaches fall into two categories that sound reasonable but produce misleading signals.

Off-the-Shelf Metrics: BLEU, ROUGE, and Similar

Metrics like BLEU and ROUGE measure n-gram overlap between the generated response and a reference answer. They're designed for machine translation and summarization tasks where you have a "correct" reference.

The problem: they measure surface-level similarity, not semantic correctness or user satisfaction. A response can have low BLEU but perfectly answer the question. Conversely, a response can have high BLEU but miss the user's intent entirely.

Example: A user asks "What's the refund policy?" The system responds with accurate information about refunds, but uses different phrasing than the reference document. BLEU scores it low because the n-grams don't match, even though the answer is correct.

These metrics also miss pragmatic failures. A response might be technically accurate but summarize the wrong subset of information, or answer a related question instead of the one asked. BLEU won't catch this-it only sees word overlap.

Shallow LLM-as-Judge Questions

LLM-as-a-judge can be powerful, but most implementations ask questions that modern models handle trivially. Common examples:

  • "Did the response summarize the information well?"
  • "Was the response concise and informative?"
  • "Did the response follow the user's instructions?"

The problem: models like Opus 4.5 and GPT-5.2 don't struggle with basic summarization or following simple instructions. These evals measure "can the model do basic tasks" not "did it actually help the user."

These shallow questions miss what matters:

  • Task-specific utility: Did the response help accomplish the user's goal?
  • User intent alignment: Did it answer what was actually asked, not just what was literally asked?
  • System reliability: Did it follow the specific formatting rules required by downstream systems?

A response can score highly on "was it well-summarized?" while still requiring three follow-up questions to get what the user needed. The eval says it's good; the user experience says otherwise.

Three Focused Evaluation Approaches

Instead of measuring surface-level metrics or asking shallow questions, I focus on three evaluation approaches that capture what actually matters.

Document Relevancy via Citation Ranking

The ranking of documents fed to the LLM shows up in citation patterns. This reveals whether your retrieval and reranking components are working.

How it works: Track which retrieved documents get cited versus ignored. Calculate metrics like NDCG@K where relevance is binary: cited (1) or not cited (0). This tells you whether relevant documents are appearing at the top of your retrieval results.

What it measures: RAG system component quality-specifically retrieval and reranking performance.

Example: If your top-5 retrieved documents are never cited, but documents ranked 6-10 are frequently cited, your retrieval is broken. The model is finding useful information, but it's buried in lower-ranked results.

Conversely, if cited documents consistently appear in the top-3 retrieved results, your retrieval is working well. The model is getting relevant context early in the ranking.

This evaluation approach connects directly to training rerankers from citation data. See building rerankers from RAG citations for how to use citation patterns as training labels.

Implementation: For each query-response pair, extract:

  • The ranking order of retrieved documents
  • Which documents were cited in the response
  • Calculate NDCG@K where K is your typical context window size (often 5-15 documents)

A high NDCG@K means relevant documents (those that get cited) are near the top of the ranking. A low NDCG@K means you're passing irrelevant documents to the LLM, wasting context window space and potentially confusing the model.

Implementation details: You need citation extraction from responses and access to the retrieval ranking order. Calculate NDCG@K where K matches your typical context window size (often 5-15 documents). Track which documents were cited and compare against their retrieval rank. This requires logging both the retrieval results and the final response with citations. Citation extraction accuracy matters-if you miss citations or extract false positives, your relevancy scores are wrong. Position bias in citations is real-LLMs over-cite documents that appear early in the context window. A relevant document at position 30 may never get cited because the model found what it needed in positions 1-10. Mitigation strategies include shuffling document order or using multiple samples with different orderings.

Document Relevancy Evaluation

Query: How do I reset my password?

#1
Password FAQ
#2
Account Settings Guide
#3
Security Overview
#4
Password Reset Steps
#5
Login Troubleshooting
NDCG@5:0.65

Moderate: Some cited documents are lower in ranking

User Followup Behavior Analysis

A user's next message is the most honest evaluation of the response that preceded it. This is implicit feedback that's already present in conversation logs-no additional labeling required.

Core idea: Use LLM-as-a-judge on follow-up messages to classify whether the prior response succeeded or failed.

Clarification-needed signals (low scores):

  • User correcting the assistant: "No, I meant X"
  • User adjusting filters: "Can you search for Y instead?"
  • User repeating their prior question
  • User explicitly fixing a misunderstanding
  • Negative sentiment: frustration, getting mad

Natural continuation signals (high scores):

  • Asking for more detail on what was provided
  • Exploring a related topic
  • Requesting a different format or visualization
  • Building on the response with new questions

Example classifications:

  • Low score: User asks "How do I reset my password?" → Assistant provides account settings info → User responds "No, I meant resetting the password, not changing account settings"
  • High score: User asks "What's the refund policy?" → Assistant explains policy → User responds "What about partial refunds?"

Implementation: Score each follow-up user message on a 0-1 scale using an LLM judge. The judge analyzes the follow-up text to determine if it indicates the prior response failed (correction, restatement, frustration) or succeeded (natural continuation). Aggregate scores across a conversation to get a quality metric that reflects actual user experience.

Why it works: This captures pragmatic failures that direct quality scoring misses. A response might accurately summarize feedback, but if it summarized the wrong subset of feedback, the user still has to course-correct.

Implementation details: You need multi-turn conversations. Score only follow-up messages (skip the first user message-there's no prior response to evaluate). Handle conversations where users never needed to clarify-these are successes, not missing data.

Edge cases: Some users always ask follow-up questions regardless of response quality-they're exploratory. Some users never ask follow-ups even when responses are poor-they give up. The signal is meaningful in aggregate but noisy for individual conversations.

Followup Behavior Analysis

Assistant Response

To reset your password, navigate to Settings > Security. From there, you can change your password.

User Followup

No, I meant resetting the password, not changing account settings.

Classification
Clarification Needed
Score0.2

User is correcting the assistant—the response missed the intent.

Specific Prompt Instruction Tracking

For individual response evaluation, I've shifted from broad quality dimensions to specific instruction adherence. Consider the difference:

Vague: "Is the response well-formatted?"

Specific: "Are citations formatted as [dataset_step_id-item_id] and placed directly after the supporting text?"

The vague version produces noisy scores-what counts as "well-formatted" varies by evaluator, by context, by mood. The specific version is a clear pass/fail that's consistent across all judges.

Similar instructions we have evaluated:

InstructionRule
Citation formatMust follow [dataset_step_id-item_id] format, placed after supporting text. Max three consecutive. No source URLs.
Table constraintsText decorators and citations allowed, but no newlines or lists within cells.
Platform-specific formatSlack responses skip markdown headers and complex formatting. Web responses use full markdown.

Each instruction is a binary or near-binary check-did the response follow this rule or not? That's what makes them actionable. If 30% of responses fail citation formatting, you know exactly what to fix. If responses score 3.2/5 on "formatting quality," you don't.

I track instruction adherence as individual metrics, then aggregate into an overall compliance score. Regressions are immediately visible and traceable to specific instructions.

Implementation details:

  • Extract specific instructions from your prompts-each becomes an eval criterion
  • Prioritize instructions that matter most: formatting errors that break downstream systems, citation errors that mislead users
  • Create binary checks (regex patterns for citation format, parsing for table constraints)
  • Review instruction sets periodically-rules that made sense initially may become outdated, and over-rigid sets can prevent the model from adapting to edge cases

Instruction Adherence Tracking

Correct formatting

Response

To reset your password, go to Settings > Security [1]. Click the reset option [1].

Citation Format
Citations must follow [doc_id] format
Citation Placement
Citations must appear immediately after supporting text
Max Consecutive Citations
No more than 3 consecutive citations
Compliance Score2/3 (67%)

Combining the Three Approaches

These three evaluation approaches measure different things:

  • Document relevancy: RAG system component quality (retrieval, reranking)
  • User follow-ups: User satisfaction proxy
  • Instruction adherence: System reliability proxy

A response can score well on one dimension but poorly on others, and each combination tells you something different:

RelevancyFollowupsInstructionsWhat It Means
HighLowHighRetrieval good, but response quality poor-model got relevant docs but didn't answer well
LowHighHighRetrieval poor but LLM compensates-model is working around bad retrieval
HighHighLowFollowing rules but missing intent-got right docs, helped user, but broke formatting
HighHighHighAll systems working-ideal state
LowLowLowMultiple failures-retrieval broken, response unhelpful, formatting wrong

Practical implementation: Run all three evaluations. Log at the message level, aggregate at the conversation level. Message-level scores reveal specific failures. Conversation-level scores reveal overall quality. Build dashboards that show trends over time-are relevancy scores improving? Are follow-up scores declining? Are instruction adherence rates stable?

This multi-dimensional view gives you actionable signals. If relevancy is high but follow-ups are low, focus on response generation quality. If relevancy is low, focus on retrieval/reranking. If instructions are failing, focus on prompt engineering.

Cite this post

/

Cole Hoffer. (Feb 2026). Evaluating LLM Chat Agents with Real World Signals. Cole Hoffer. https://www.colehoffer.ai/articles/evaluating-chat-agents

Stay in the loop

Get notified when I publish new articles on RAG, search, and AI systems.

 

Cole Hoffer·TwitterLinkedIn