The quality of your RAG system depends on which documents make it into the LLM's context window. Get the wrong documents, and the model hallucinates or gives generic answers. Get the right ones, and it produces accurate, domain-specific responses.

Document ranking is the bottleneck. Your retrieval system might return 1K candidates, but you can only send 50 to the LLM. Which ones? That decision (reranking) determines whether your RAG system works or fails.

Training a domain-specific reranker requires labeled data: (query, document) pairs with relevance scores. You have two options, and both have problems.

Manual annotation is expensive and inconsistent. Annotators need to understand your full corpus context to make relative judgments. They can't just label documents as "relevant" or "not relevant" in isolation. Relevance is relative, not absolute. Two annotators will rank the same documents differently. And it doesn't scale: annotating 100K+ query-document pairs is prohibitively expensive.

Off-the-shelf rerankers plateau on your domain. Pretrained rerankers like Cohere Rerank or bge-reranker-large are trained on web search data. They learn what "relevance" means for Wikipedia articles and Q&A forums, not your customer support docs or internal documentation. They hit a ceiling: good enough to beat embedding similarity, but not good enough for production.

You need domain-specific training, but that brings you back to the annotation problem.

But your LLM is already making relevance judgments. It just doesn't call it that.

When an LLM in a production RAG system generates a response with citations, it sees K candidate documents, attempts to use them to answer a real user question, and implicitly judges which documents are actually useful. It tells you via citations: "I used these documents successfully."

This is better than human annotation because the LLM is making judgments in the context of a real task. It's not an isolated relevance judgment. It's a practical utility test. The model either successfully used the document or it didn't.

This happens automatically for every production query. No annotation cost, no inter-annotator variance, scales infinitely.

Cited documents are positives. Non-cited documents that were retrieved but ignored are hard negatives. The model saw them and rejected them.

Here's what this looks like in practice:

Example RAG Interaction

Query:

How do I reset my password?

Response:

To reset your password, go to the login page and click "Forgot Password" 1. Enter your registered email and you will receive a reset link 1. Step 4: Click the link and create a new password 2. For detailed step-by-step instructions, see the password reset guide 2.

Click citation numbers to highlight corresponding documents below.

Before Reranking (Top 5)

Position bias: Cited documents appear at lower ranks (positions 2, 4), demonstrating that the LLM may over-cite documents that appear early in the context window.

Training Labels Generated

Positives

(How do I reset my password?, Account Settings Guide) → Label: 1

(How do I reset my password?, Password Reset Steps) → Label: 1

Hard Negatives

(How do I reset my password?, Password FAQ) → Label: 0

(How do I reset my password?, Security Overview) → Label: 0

(How do I reset my password?, Login Troubleshooting) → Label: 0

Key insight: The LLM saw all 5 retrieved documents but only cited 2. This creates 2 positive examples and 3 hard negative examples for training a reranker that learns to distinguish relevant documents from retrieved-but-irrelevant ones.

Graded relevance: Some documents are cited multiple times (Account Settings Guide (2x), Password Reset Steps (2x)). Documents cited multiple times indicate higher relevance than those cited once—this signal can be used for graded relevance scores rather than binary labels.

The LLM made relevance judgments automatically. It saw 5 documents, used 2 to answer the question, and told you which ones via citations. This creates labeled training data for free.

Measuring Reranker Quality

The demo above calculates NDCG (Normalized Discounted Cumulative Gain), the standard metric for ranking quality. Here's what to track:

Metric	What It Measures
NDCG@K	Ranking quality with position discounting. Are relevant docs near the top?
MAP	Mean average precision. Rewards getting all relevant docs, not just the first.
Recall@K	Coverage. What fraction of relevant docs appear in the top K?

NDCG@10 is typically the headline metric. It captures whether your reranker is surfacing relevant documents in the positions that actually get used.

Should You Build a Custom Reranker?

Before investing in custom training, ask whether it's worth it for your situation.

When Off-the-Shelf Is Enough

Start by benchmarking your current options:

Embedding similarity only (your current retrieval)
Off-the-shelf rerankers (Cohere Rerank, Jina Reranker, bge-reranker-large)
LLM-based reranking (expensive but sets an upper bound on quality)

If bge-reranker-large already gets you 90% of the way to LLM-quality reranking on your domain, the ROI on custom training may not be there.

When Custom Wins

Domain-specific rerankers tend to outperform when:

Your corpus has specialized vocabulary (legal, medical, technical docs)
Your queries have patterns that differ from web search
Commercial rerankers are a significant cost line item at your scale
You're processing 1M+ queries/month

Data Requirements

Dataset Size	Expected Outcome
10K pairs	Enough to see if the approach works; won't beat strong baselines
50-100K pairs	Competitive with off-the-shelf rerankers on your domain
500K+ pairs	Potential to meaningfully outperform commercial solutions

The Intercom team reported training on 400K queries × 40 candidates (16M pairs) to beat Cohere Rerank-v3.5. That's achievable if you have production traffic generating citations continuously.

The ROI Calculation

The case for building:

Cost: At scale, commercial reranker APIs add up. Self-hosted can reduce costs by 70-90%.
Latency: You control the hardware and can optimize for your P99 requirements.
Domain adaptation: Your model learns your corpus's vocabulary and users' query patterns.

The case for buying:

Time to value: Commercial rerankers work out of the box. Custom training is a multi-month investment.
Maintenance: Models drift, need retraining, require monitoring.
Opportunity cost: Is reranking the highest-leverage improvement right now?

Rule of thumb: If you're processing < 100K queries/month, commercial rerankers are probably the right call. Above 1M queries/month with specialized domain needs, custom training starts to pay off.

Implementation

If you've decided to build, here's how to do it.

Extracting Training Data

For each query-response interaction, extract three things:

Query: The user's question (or a summarized version if you're condensing conversation history)
Positive documents: All documents cited in the response
Hard negatives: All documents that were retrieved and sent to the LLM, but not cited

The extraction logic is straightforward: iterate through all retrieved documents. If a document appears in the citation list, it's a positive (label: 1). If it was retrieved but not cited, it's a hard negative (label: 0).

Hard negatives are valuable. These aren't random documents. They're documents your retrieval system thought were relevant but the LLM disagreed. Training on these teaches the reranker to make the same distinctions.

Graded Relevance

Binary labels (cited/not-cited) work, but you're leaving signal on the table:

Documents cited multiple times in a response vs cited once
Documents cited across multiple queries on the same topic
Position in response where the citation appears (earlier often indicates higher utility)

To compute graded relevance, track citation frequency across interactions. For each document, count how many times it was cited when it appeared in retrieved results. A document cited in 8/10 interactions for similar queries is a stronger positive than one cited 2/10 times.

Aggregating across interactions lets you construct graded relevance scores instead of binary labels. This gives your reranker more nuanced training signal.

Cleaning Noisy Labels

Raw citation extraction gets you started, but label noise will cap your model's performance. Watch for:

Position bias: LLMs over-cite documents that appear early in the context window. Mitigation: shuffle document order across requests, or use multiple samples with different orderings and look for citation consistency.

Citation style variance: Some responses cite heavily, others synthesize without attribution. Mitigation: filter to interactions where at least one citation exists; treat zero-citation responses as unlabeled rather than all-negative.

Query-document mismatch: If your initial retrieval is poor, you're labeling noise. Mitigation: use a precision-oriented retrieval cutoff, or filter to queries where the response indicates the documents were helpful.

Training the Reranker

After extraction, your training data consists of (query, document, label) triplets. Here's what it looks like:

Query	Document (excerpt)	Label
How do I reset my password?	To reset your password, go to...	1
How do I reset my password?	Account settings allow you to...	0
How do I reset my password?	Our platform uses encryption...	0
Where can I find account settings?	Account settings allow you to...	1

For binary labels, use 1 (cited) or 0 (not cited). For graded relevance, labels are continuous scores (0.0 to 1.0) based on citation frequency across interactions.

You'll have many more negatives than positives. That's expected. A typical interaction might yield 2-3 positives and 20-30 hard negatives. This imbalance is fine; the hard negatives are valuable training signal.

To train your reranker, you can either send your citation-derived data to third-party fine-tuning services like Cohere Reranker, or fine-tune an open-source model like ms-marco-MiniLM-L-12-v2 or bge-reranker-base yourself.

Important: This setup enables continuous fine-tuning. Even after your reranker is deployed, you keep gathering citation data from production traffic. You can periodically retrain on the accumulated data, continuously improving the model over time without manual annotation.

The Flywheel

Once deployed, your reranker improves the data pipeline that trains the next version:

Better reranking → more relevant documents in context
More relevant context → higher quality responses
Higher quality responses → cleaner citation patterns
Cleaner citations → better training labels
Better labels → better next reranker

This compounds. The first iteration might match your baseline. The third iteration, trained on cleaner data from a better system, can meaningfully outperform.

Monitor for regression. If response quality drops (measured via user feedback, resolution rate, or LLM-as-judge), your reranker may be confidently wrong on some query class. Keep human-in-the-loop evaluation for edge cases.

Mining Reranker Training Data from RAG Citations