
[evals]
Evaluating LLM Chat Agents with Real World Signals
How human follow-up behavior reveals response quality, and why specific instruction adherence outperforms vague relevance scoring.

Evaluating chat agents is harder than evaluating single-turn completions. A response that looks good in isolation may still fail in context—it might not answer what the user actually asked, or it might be technically correct but miss the intent entirely.
Most eval frameworks default to LLM-as-judge approaches: "Is this response relevant? Rate 1-5." These produce scores, but the scores don't correlate well with actual user satisfaction. A response can be "relevant" and still require the user to ask three follow-up questions to get what they need.
We've found three approaches that produce more meaningful signal: evaluating document relevancy through citation patterns, using human follow-up behavior as implicit feedback, and tracking adherence to specific formatting instructions rather than vague quality criteria.
A Typical RAG Chatbot Setup
Before diving into evaluation, let's establish what we're evaluating. A typical RAG-enabled QA or summary chatbot has several core components:
Retrieval: The system searches a document corpus using semantic embeddings (and optionally BM25 for keyword matching). This returns K candidate documents (often 20-50) that might be relevant to the user's query.
Reranking (optional): A reranker model scores the retrieved documents to surface the most relevant ones. This helps when initial retrieval returns many candidates of varying quality.
Context Assembly: The top-ranked documents are formatted with unique identifiers and injected into the LLM's context window. Each document gets an ID like doc_1, doc_2, etc.
LLM Generation: The model generates a response using the retrieved documents as context. The prompt includes instructions on how to cite sources—typically requiring citations in a specific format (e.g., [doc_id]) placed directly after the text they support.
Here's what a typical prompt structure looks like:
1System: You are a helpful assistant. Answer questions using the provided documents.2When you reference information from a document, cite it using [doc_id] format immediately3after the supporting text. Use no more than three consecutive citations.45Documents:6[doc_1] Title: Password Reset Guide7Content: To reset your password, go to Settings > Security...89[doc_2] Title: Account Management10Content: Account settings can be accessed from the dashboard...1112---1314User: How do I reset my password?1516Assistant: To reset your password, navigate to Settings > Security [doc_1]. From there,17you'll see the password reset option. Click it and follow the prompts to create a new password.
The LLM sees K documents, attempts to use them to answer the question, and cites the ones it actually used. This citation behavior is crucial—it reveals which documents the model found useful, which becomes evaluation signal.
For deeper dives on retrieval techniques, see our articles on building rerankers from citations and applying HyDE to domain-specific search.
Why Traditional Evals Fall Short
Most evaluation approaches fall into two categories that sound reasonable but produce misleading signals.
Off-the-Shelf Metrics: BLEU, ROUGE, and Similar
Metrics like BLEU and ROUGE measure n-gram overlap between the generated response and a reference answer. They're designed for machine translation and summarization tasks where you have a "correct" reference.
The problem: they measure surface-level similarity, not semantic correctness or user satisfaction. A response can have low BLEU but perfectly answer the question. Conversely, a response can have high BLEU but miss the user's intent entirely.
Example: A user asks "What's the refund policy?" The system responds with accurate information about refunds, but uses different phrasing than the reference document. BLEU scores it low because the n-grams don't match, even though the answer is correct.
These metrics also miss pragmatic failures. A response might be technically accurate but summarize the wrong subset of information, or answer a related question instead of the one asked. BLEU won't catch this—it only sees word overlap.
Shallow LLM-as-Judge Questions
LLM-as-a-judge can be powerful, but most implementations ask questions that modern models handle trivially. Common examples:
- "Did the response summarize the information well?"
- "Was the response concise and informative?"
- "Did the response follow the user's instructions?"
The problem: models like Opus 4.5 and GPT-5.2 don't struggle with basic summarization or following simple instructions. These evals measure "can the model do basic tasks" not "did it actually help the user."
These shallow questions miss what matters:
- Task-specific utility: Did the response help accomplish the user's goal?
- User intent alignment: Did it answer what was actually asked, not just what was literally asked?
- System reliability: Did it follow the specific formatting rules required by downstream systems?
A response can score highly on "was it well-summarized?" while still requiring three follow-up questions to get what the user needed. The eval says it's good; the user experience says otherwise.
Three Focused Evaluation Approaches
Instead of measuring surface-level metrics or asking shallow questions, we focus on three evaluation approaches that capture what actually matters.
Document Relevancy via Citation Ranking
The ranking of documents fed to the LLM shows up in citation patterns. This reveals whether your retrieval and reranking components are working.
How it works: Track which retrieved documents get cited versus ignored. Calculate metrics like NDCG@K where relevance is binary: cited (1) or not cited (0). This tells you whether relevant documents are appearing at the top of your retrieval results.
What it measures: RAG system component quality—specifically retrieval and reranking performance.
Example: If your top-5 retrieved documents are never cited, but documents ranked 6-10 are frequently cited, your retrieval is broken. The model is finding useful information, but it's buried in lower-ranked results.
Conversely, if cited documents consistently appear in the top-3 retrieved results, your retrieval is working well. The model is getting relevant context early in the ranking.
This evaluation approach connects directly to training rerankers from citation data. See building rerankers from RAG citations for how to use citation patterns as training labels.
Implementation: For each query-response pair, extract:
- The ranking order of retrieved documents
- Which documents were cited in the response
- Calculate NDCG@K where K is your typical context window size (often 5-15 documents)
A high NDCG@K means relevant documents (those that get cited) are near the top of the ranking. A low NDCG@K means you're passing irrelevant documents to the LLM, wasting context window space and potentially confusing the model.
Implementation details: You need citation extraction from responses and access to the retrieval ranking order. Calculate NDCG@K where K matches your typical context window size (often 5-15 documents). Track which documents were cited and compare against their retrieval rank. This requires logging both the retrieval results and the final response with citations. Citation extraction accuracy matters—if you miss citations or extract false positives, your relevancy scores are wrong. Position bias in citations is real—LLMs over-cite documents that appear early in the context window. A relevant document at position 30 may never get cited because the model found what it needed in positions 1-10. Mitigation strategies include shuffling document order or using multiple samples with different orderings.
Document Relevancy Evaluation
Query: How do I reset my password?
Moderate: Some cited documents are lower in ranking
User Followup Behavior Analysis
How a human responds to an assistant message tells you a lot about whether that message was good. This is implicit feedback that's already present in conversation logs—no additional labeling required.
Core idea: Use LLM-as-a-judge on followup questions to classify whether the prior response succeeded or failed.
Signals to detect:
Clarification-needed signals (low scores):
- User correcting the assistant: "No, I meant X"
- User adjusting filters: "Can you search for Y instead?"
- User repeating their prior question
- User explicitly fixing a misunderstanding
- Negative sentiment: frustration, getting mad
Natural continuation signals (high scores):
- Asking for more detail on what was provided
- Exploring a related topic
- Requesting a different format or visualization
- Building on the response with new questions
Example classifications:
- Low score: User asks "How do I reset my password?" → Assistant provides account settings info → User responds "No, I meant resetting the password, not changing account settings"
- High score: User asks "What's the refund policy?" → Assistant explains policy → User responds "What about partial refunds?"
Implementation: Score each follow-up user message on a 0-1 scale using an LLM judge. The judge analyzes the followup text to determine if it indicates the prior response failed (correction, restatement, frustration) or succeeded (natural continuation). Aggregate scores across a conversation to get a quality metric that reflects actual user experience.
Why it works: This captures pragmatic failures that direct quality scoring misses. A response might accurately summarize feedback, but if it summarized the wrong subset of feedback, the user still has to course-correct. Direct quality scoring might miss this; follow-up scoring catches it.
Implementation details: You need multi-turn conversations. Single-turn evals can't use this approach. Score only follow-up messages (skip the first user message—there's no prior response to evaluate). Handle conversations where users never needed to clarify—these are successes, not missing data. Use an LLM-as-judge with a prompt that classifies followup messages into correction/restatement (low score) vs natural continuation (high score).
Edge cases: Some users always ask follow-up questions regardless of response quality—they're exploratory. Some users never ask follow-ups even when responses are poor—they give up. The signal is meaningful in aggregate but noisy for individual conversations.
Followup Behavior Analysis
To reset your password, navigate to Settings > Security. From there, you can change your password.
No, I meant resetting the password, not changing account settings.
User is correcting the assistant—the response missed the intent.
Specific Prompt Instruction Tracking
For individual response evaluation, we've shifted from broad quality dimensions to specific instruction adherence.
Core idea: Evaluate adherence to specific formatting instructions, not vague quality criteria.
Consider the difference:
Vague: "Is the response well-formatted?"
Specific: "Are citations formatted as [dataset_step_id-item_id] and placed directly after the supporting text?"
The vague version produces noisy scores. What counts as "well-formatted" varies by evaluator, by context, by mood. The specific version has a clear pass/fail condition that's consistent across evaluations.
Example: The same response might score 3/5 on "formatting quality" from one judge and 4/5 from another. But on "are citations in [doc_id] format?", it's either yes or no—consistent across all judges.
Instructions we actually evaluate:
Citation format: Citations must follow [dataset_step_id-item_id] format. They must appear after the text they support. No more than three consecutive citations. No source URLs.
Table constraints: Tables can include text decorators and citations, but not newlines or lists within cells.
Platform-specific formatting: Slack responses skip markdown headers and complex formatting. Web responses can use full markdown.
Each instruction is a binary or near-binary check. Did the response follow this rule or not? These checks are fast, consistent, and actionable—if a response fails, you know exactly what to fix.
The compound effect: Specific instruction evals reveal patterns that vague evals miss. If 30% of responses fail citation formatting, that's a clear signal about prompt engineering or model capability. You can address it directly. If responses score 3.2/5 on "formatting quality," what do you do with that? The score is real but not actionable.
We track instruction adherence as individual metrics, then aggregate into an overall compliance score. Regressions are immediately visible and traceable to specific instructions.
Implementation details: Extract the specific instructions from your prompts. Each instruction becomes an eval criterion. Start with the instructions that matter most—formatting errors that break downstream systems, citation errors that mislead users. Create binary checks (regex patterns work for citation format, parsing for table constraints). Rules that made sense initially may become outdated—review instruction sets periodically to prune ones that no longer matter. Over-rigid instruction sets can prevent the model from adapting to edge cases where breaking a rule produces a better user experience.
Instruction Adherence Tracking
Correct formatting
To reset your password, go to Settings > Security [1]. Click the reset option [1].
Combining the Three Approaches
These three evaluation approaches measure different things:
- Document relevancy: RAG system component quality (retrieval, reranking)
- User followups: User satisfaction proxy
- Instruction adherence: System reliability proxy
A response can score well on one dimension but poorly on others, and each combination tells you something different:
| Relevancy | Followups | Instructions | What It Means |
|---|---|---|---|
| High | Low | High | Retrieval good, but response quality poor—model got relevant docs but didn't answer well |
| Low | High | High | Retrieval poor but LLM compensates—model is working around bad retrieval |
| High | High | Low | Following rules but missing intent—got right docs, helped user, but broke formatting |
| High | High | High | All systems working—ideal state |
| Low | Low | Low | Multiple failures—retrieval broken, response unhelpful, formatting wrong |
Practical implementation: Run all three evaluations. Log at the message level, aggregate at the conversation level. Message-level scores reveal specific failures. Conversation-level scores reveal overall quality. Build dashboards that show trends over time—are relevancy scores improving? Are followup scores declining? Are instruction adherence rates stable?
This multi-dimensional view gives you actionable signals. If relevancy is high but followups are low, focus on response generation quality. If relevancy is low, focus on retrieval/reranking. If instructions are failing, focus on prompt engineering.
Cite this post
Cole Hoffer. (Feb 2026). Evaluating LLM Chat Agents with Real World Signals. Cole Hoffer. https://www.colehoffer.ai/articles/evaluating-chat-agents