
Relevancy Filtering in Retrieval Pipelines
Using LLM-based classification as a second pass to filter retrieval candidates when similarity thresholds fail to generalize.

Part 1 covered filter extraction. Part 2 covered HyDE for vocabulary bridging. Even with both, retrieval systems face a fundamental problem: where do you draw the line between relevant and not relevant?
The standard answer is a similarity threshold. Documents above 0.75 cosine similarity are returned; below are filtered. This works until you have multiple corpora or customer datasets with different characteristics.
Two Different Use Cases
Before diving into solutions, it's worth distinguishing between two retrieval patterns:
General top-k semantic filtering — Documents are retrieved for LLM synthesis, where the model can filter out irrelevant content during response generation. In this case, some bad results are acceptable. Having a slightly too high or too low top-k doesn't matter much because the LLM acts as a final filter when synthesizing the response.
Query-to-dataset mapping — The goal is mapping a user query to a usable, complete dataset where every document must be relevant. This is common in customer support systems, knowledge bases, or any scenario where users need to review or act on the retrieved documents directly. Here, precision matters more than recall, and you need confidence that every returned document is actually relevant.
The strategy we'll cover focuses on the second use case: ensuring high precision when documents are shown directly to users.
Thresholds Don't Generalize
We've seen the same threshold produce high-precision results for one customer and noisy results for another. The issue is that embedding similarity scores reflect relative position in vector space, not absolute relevance.
What affects score distributions: document length, vocabulary diversity, embedding model quirks, topic specificity. A threshold tuned for one distribution doesn't transfer.
You can tune per-corpus, but that doesn't scale. Even within a single corpus, optimal thresholds vary by query type.
Candidate Generation, Not Final Results
A reframe that's worked for us: treat initial retrieval as candidate generation. Cast a wide net with a permissive threshold. Then evaluate candidates for true relevance in a second pass.
The second pass is binary classification: "Is this document relevant to this query?" An LLM handles this well—it's applying semantic understanding that embedding similarity can't capture.
A document about "checkout failures" gets recognized as relevant to "payment problems" even if the embedding similarity is marginal. A document about "API response times" gets filtered out of a query about "customer response times" even if the score is high.
Interactive Demo: Relevancy Filtering with Early Stopping
The demo below shows how relevancy filtering works with early stopping based on rolling average relevance. Try different queries to see how the system processes candidates and stops early when relevance drops.
Relevancy Filtering with Early Stopping
Processed Documents (15)
Why This Is Feasible Now
A few years ago, this approach would have been impractical. Running hundreds of LLM calls per query? The cost and latency would be prohibitive.
The economics have shifted. OpenAI's rate limits now support 10,000+ requests per minute on standard tiers. Smaller, faster models handle binary classification effectively—you don't need GPT-4 for a yes/no relevance decision. Per-token costs have dropped to the point where classifying 100 candidates adds cents, not dollars, to a query.
The parallel nature of these classifications means you're not waiting for sequential API calls. You can process dozens or hundreds of candidates concurrently, limited only by rate limits. Combined with early stopping, actual compute per query is often much lower than the candidate count suggests.
This changes what's architecturally viable. Techniques that were theoretically sound but practically impossible are now just... practical.
The Strategy: Wide Search + Parallel Classification
The approach combines structured pre-filtering and HyDE with a wide initial search, then runs a highly parallelizable set of binary relevancy checks using small, fast LLMs.
Here's how it works:
- Structured pre-filtering — Extract and apply structured filters (dates, categories, metadata) to reduce the corpus
- HyDE — Generate a hypothetical document to bridge vocabulary gaps between query and corpus
- Wide semantic search — Cast a wide net with a permissive similarity threshold (e.g., top 200-500 candidates)
- Parallel binary classification — Run hundreds of small, cheap "does document X match query Y?" checks in parallel
- Early stopping with rolling average — Track a rolling average of relevance (e.g., last 10 classifications). When the rolling average drops below a threshold (e.g., 30%), stop processing since lower-ranked documents are unlikely to be relevant
This dynamically adjusts the result set size query-to-query while maintaining high precision. Queries with many relevant documents process more candidates. Queries with few relevant matches stop early, saving compute.
Making It Efficient
Running classification on every candidate still requires thought. A few patterns:
Parallel processing — These binary classifications are highly parallelizable. You can run hundreds of small LLM calls concurrently, limited only by rate limits, not sequential processing.
Early stopping with rolling average — Track the relevance rate of the last N classifications (e.g., 10). When this rolling average drops below a threshold (e.g., 30%), stop processing. The logic: if the last 10 documents were mostly irrelevant, the remaining lower-ranked documents are unlikely to be relevant either.
Batching — Process multiple candidates in a single LLM call when possible. Reduces round-trip overhead significantly.
Async processing with streaming — For user-facing applications, start returning results as soon as relevant candidates are identified. Users see partial results while classification continues in the background.
The combination of parallel processing and early stopping means actual LLM calls per query are often 30-50% of the candidate count, not 100%.
The Full Pipeline
The three techniques in this series layer together:
| Layer | What It Handles |
|---|---|
| Filter Extraction | Structured constraints that shouldn't go to semantic search |
| HyDE | Vocabulary mismatch between query and corpus |
| Relevancy Filtering | Score threshold calibration across different data distributions |
The flow: extract structured filters → apply filters → generate HyDE document → wide semantic search (top 200-500) → parallel binary classification with early stopping → final relevant results.
Not every query needs every layer. Simple queries with matching vocabulary skip HyDE. Queries without structural constraints skip filter extraction. High-confidence initial results might skip relevancy filtering.
The pipeline adapts based on query characteristics and result confidence. The goal is applying compute where it improves outcomes, not uniformly on every request.
For the query-to-dataset mapping use case, relevancy filtering is especially valuable. It ensures that every document shown to users is actually relevant, not just semantically similar. The early stopping mechanism means the system dynamically adjusts result set size based on how many relevant documents exist, rather than using a fixed top-k that might be too high or too low.
What We're Watching
Rate limits and costs continue improving. Models optimized for classification tasks keep getting faster. The threshold for "worth doing" keeps dropping.
We expect the pattern of using LLMs for post-retrieval refinement to become standard as the economics make it obvious. The interesting questions shift from "can we afford this?" to "what else can we do with cheap inference?"
Cite this post
Cole Hoffer. (Jan 2026). Relevancy Filtering in Retrieval Pipelines. Cole Hoffer. https://www.colehoffer.ai/articles/advanced-rag-relevancy-filtering