
[rag]
Mining Reranker Training Data from RAG Citations
Using citation behavior in production RAG systems to generate labeled training data for domain-specific reranking models.

The quality of your RAG system depends on which documents make it into the LLM's context window. Get the wrong documents, and the model hallucinates or gives generic answers. Get the right ones, and it produces accurate, domain-specific responses.
Document ranking is the bottleneck. Your retrieval system might return 1K candidates, but you can only send 50 to the LLM. Which ones? That decision (reranking) determines whether your RAG system works or fails.
Training a domain-specific reranker requires labeled data: (query, document) pairs with relevance scores. You have two options, and both have problems.
Manual annotation is expensive and inconsistent. Annotators need to understand your full corpus context to make relative judgments. They can't just label documents as "relevant" or "not relevant" in isolation. Relevance is relative, not absolute. Two annotators will rank the same documents differently. And it doesn't scale: annotating 100K+ query-document pairs is prohibitively expensive.
Off-the-shelf rerankers plateau on your domain. Pretrained rerankers like Cohere Rerank or bge-reranker-large are trained on web search data. They learn what "relevance" means for Wikipedia articles and Q&A forums, not your customer support docs or internal documentation. They hit a ceiling: good enough to beat embedding similarity, but not good enough for production.
You need domain-specific training, but that brings you back to the annotation problem.
But your LLM is already making relevance judgments. It just doesn't call it that.
When an LLM in a production RAG system generates a response with citations, it sees K candidate documents, attempts to use them to answer a real user question, and implicitly judges which documents are actually useful. It tells you via citations: "I used these documents successfully."
This is better than human annotation because the LLM is making judgments in the context of a real task. It's not an isolated relevance judgment. It's a practical utility test. The model either successfully used the document or it didn't.
This happens automatically for every production query. No annotation cost, no inter-annotator variance, scales infinitely.
Cited documents are positives. Non-cited documents that were retrieved but ignored are hard negatives. The model saw them and rejected them.
Here's what this looks like in practice:
Example RAG Interaction
How do I reset my password?
Before Reranking (Top 5)
Training Labels Generated
The LLM made relevance judgments automatically. It saw 5 documents, used 2 to answer the question, and told you which ones via citations. This creates labeled training data for free.
Measuring Reranker Quality
The demo above calculates NDCG (Normalized Discounted Cumulative Gain), the standard metric for ranking quality. Here's what to track:
| Metric | What It Measures |
|---|---|
| NDCG@K | Ranking quality with position discounting. Are relevant docs near the top? |
| MAP | Mean average precision. Rewards getting all relevant docs, not just the first. |
| Recall@K | Coverage. What fraction of relevant docs appear in the top K? |
NDCG@10 is typically the headline metric. It captures whether your reranker is surfacing relevant documents in the positions that actually get used.
Should You Build a Custom Reranker?
Before investing in custom training, ask whether it's worth it for your situation.
When Off-the-Shelf Is Enough
Start by benchmarking your current options:
- Embedding similarity only (your current retrieval)
- Off-the-shelf rerankers (Cohere Rerank, Jina Reranker,
bge-reranker-large) - LLM-based reranking (expensive but sets an upper bound on quality)
If bge-reranker-large already gets you 90% of the way to LLM-quality reranking on your domain, the ROI on custom training may not be there.
When Custom Wins
Domain-specific rerankers tend to outperform when:
- Your corpus has specialized vocabulary (legal, medical, technical docs)
- Your queries have patterns that differ from web search
- Commercial rerankers are a significant cost line item at your scale
- You're processing 1M+ queries/month
Data Requirements
| Dataset Size | Expected Outcome |
|---|---|
| 10K pairs | Enough to see if the approach works; won't beat strong baselines |
| 50-100K pairs | Competitive with off-the-shelf rerankers on your domain |
| 500K+ pairs | Potential to meaningfully outperform commercial solutions |
The Intercom team reported training on 400K queries × 40 candidates (16M pairs) to beat Cohere Rerank-v3.5. That's achievable if you have production traffic generating citations continuously.
The ROI Calculation
The case for building:
- Cost: At scale, commercial reranker APIs add up. Self-hosted can reduce costs by 70-90%.
- Latency: You control the hardware and can optimize for your P99 requirements.
- Domain adaptation: Your model learns your corpus's vocabulary and users' query patterns.
The case for buying:
- Time to value: Commercial rerankers work out of the box. Custom training is a multi-month investment.
- Maintenance: Models drift, need retraining, require monitoring.
- Opportunity cost: Is reranking the highest-leverage improvement right now?
Rule of thumb: If you're processing < 100K queries/month, commercial rerankers are probably the right call. Above 1M queries/month with specialized domain needs, custom training starts to pay off.
Implementation
If you've decided to build, here's how to do it.
Extracting Training Data
For each query-response interaction, extract three things:
- Query: The user's question (or a summarized version if you're condensing conversation history)
- Positive documents: All documents cited in the response
- Hard negatives: All documents that were retrieved and sent to the LLM, but not cited
The extraction logic is straightforward: iterate through all retrieved documents. If a document appears in the citation list, it's a positive (label: 1). If it was retrieved but not cited, it's a hard negative (label: 0).
Hard negatives are valuable. These aren't random documents. They're documents your retrieval system thought were relevant but the LLM disagreed. Training on these teaches the reranker to make the same distinctions.
Graded Relevance
Binary labels (cited/not-cited) work, but you're leaving signal on the table:
- Documents cited multiple times in a response vs cited once
- Documents cited across multiple queries on the same topic
- Position in response where the citation appears (earlier often indicates higher utility)
To compute graded relevance, track citation frequency across interactions. For each document, count how many times it was cited when it appeared in retrieved results. A document cited in 8/10 interactions for similar queries is a stronger positive than one cited 2/10 times.
Aggregating across interactions lets you construct graded relevance scores instead of binary labels. This gives your reranker more nuanced training signal.
Cleaning Noisy Labels
Raw citation extraction gets you started, but label noise will cap your model's performance. Watch for:
Position bias: LLMs over-cite documents that appear early in the context window. Mitigation: shuffle document order across requests, or use multiple samples with different orderings and look for citation consistency.
Citation style variance: Some responses cite heavily, others synthesize without attribution. Mitigation: filter to interactions where at least one citation exists; treat zero-citation responses as unlabeled rather than all-negative.
Query-document mismatch: If your initial retrieval is poor, you're labeling noise. Mitigation: use a precision-oriented retrieval cutoff, or filter to queries where the response indicates the documents were helpful.
Training the Reranker
After extraction, your training data consists of (query, document, label) triplets. Here's what it looks like:
| Query | Document (excerpt) | Label |
|---|---|---|
| How do I reset my password? | To reset your password, go to... | 1 |
| How do I reset my password? | Account settings allow you to... | 0 |
| How do I reset my password? | Our platform uses encryption... | 0 |
| Where can I find account settings? | Account settings allow you to... | 1 |
For binary labels, use 1 (cited) or 0 (not cited). For graded relevance, labels are continuous scores (0.0 to 1.0) based on citation frequency across interactions.
You'll have many more negatives than positives. That's expected. A typical interaction might yield 2-3 positives and 20-30 hard negatives. This imbalance is fine; the hard negatives are valuable training signal.
To train your reranker, you can either send your citation-derived data to third-party fine-tuning services like Cohere Reranker, or fine-tune an open-source model like ms-marco-MiniLM-L-12-v2 or bge-reranker-base yourself.
Important: This setup enables continuous fine-tuning. Even after your reranker is deployed, you keep gathering citation data from production traffic. You can periodically retrain on the accumulated data, continuously improving the model over time without manual annotation.
The Flywheel
Once deployed, your reranker improves the data pipeline that trains the next version:
- Better reranking → more relevant documents in context
- More relevant context → higher quality responses
- Higher quality responses → cleaner citation patterns
- Cleaner citations → better training labels
- Better labels → better next reranker
This compounds. The first iteration might match your baseline. The third iteration, trained on cleaner data from a better system, can meaningfully outperform.
Monitor for regression. If response quality drops (measured via user feedback, resolution rate, or LLM-as-judge), your reranker may be confidently wrong on some query class. Keep human-in-the-loop evaluation for edge cases.
Cite this post
Cole Hoffer. (Feb 2026). Mining Reranker Training Data from RAG Citations. Cole Hoffer. https://www.colehoffer.ai/articles/building-rerankers-from-rag-citations