Part 1 covered filter extraction. Part 2 covered HyDE for vocabulary bridging. Even with both, retrieval systems face a fundamental problem: where do you draw the line between relevant and not relevant?

The standard answer is a similarity threshold. Documents above 0.75 cosine similarity are returned; below are filtered. This works until you have multiple corpora or customer datasets with different characteristics.

Two Different Use Cases

Before diving into solutions, it's worth distinguishing between two retrieval patterns:

General top-k semantic filtering — Documents are retrieved for LLM synthesis, where the model can filter out irrelevant content during response generation. In this case, some bad results are acceptable. Having a slightly too high or too low top-k doesn't matter much because the LLM acts as a final filter when synthesizing the response.

Query-to-dataset mapping — The goal is mapping a user query to a usable, complete dataset where every document must be relevant. This is common in customer support systems, knowledge bases, or any scenario where users need to review or act on the retrieved documents directly. Here, precision matters more than recall, and you need confidence that every returned document is actually relevant.

The strategy we'll cover focuses on the second use case: ensuring high precision when documents are shown directly to users.

Thresholds Don't Generalize

We've seen the same threshold produce high-precision results for one customer and noisy results for another. The issue is that embedding similarity scores reflect relative position in vector space, not absolute relevance.

What affects score distributions: document length, vocabulary diversity, embedding model quirks, topic specificity. A threshold tuned for one distribution doesn't transfer.

You can tune per-corpus, but that doesn't scale. Even within a single corpus, optimal thresholds vary by query type.

Candidate Generation, Not Final Results

A reframe that's worked for us: treat initial retrieval as candidate generation. Cast a wide net with a permissive threshold. Then evaluate candidates for true relevance in a second pass.

The second pass is binary classification: "Is this document relevant to this query?" An LLM handles this well—it's applying semantic understanding that embedding similarity can't capture.

A document about "checkout failures" gets recognized as relevant to "payment problems" even if the embedding similarity is marginal. A document about "API response times" gets filtered out of a query about "customer response times" even if the score is high.

Interactive Demo: Relevancy Filtering with Early Stopping

The demo below shows how relevancy filtering works with early stopping based on rolling average relevance. Try different queries to see how the system processes candidates and stops early when relevance drops.

Relevancy Filtering with Early Stopping

Processed

15 / 15

Relevant

Precision

46.7%

Final Rolling Avg

40.0%

Processed Documents (15)

#1Score: 0.89

Checkout process fails when using credit card payment method

#2Score: 0.85Rolling avg: 100%

Payment gateway error occurs every time at final step

#3Score: 0.82Rolling avg: 100%

Transaction fails repeatedly during payment processing

#4Score: 0.79Rolling avg: 100%

User unable to reset password via email link

#5Score: 0.76Rolling avg: 75%

Cannot complete purchase, payment button does nothing

#6Score: 0.73Rolling avg: 80%

Checkout crashes when entering credit card details

#7Score: 0.70Rolling avg: 83%

Payment processing API returns timeout errors

#8Score: 0.67Rolling avg: 86%

#9Score: 0.64Rolling avg: 75%

API authentication token expiration not working correctly

#10Score: 0.61Rolling avg: 67%

Database connection timeout errors

#11Score: 0.58Rolling avg: 60%

Credit card validation fails during checkout flow

#12Score: 0.55Rolling avg: 60%

App is extremely slow, takes forever to load

#13Score: 0.52Rolling avg: 50%

Performance is terrible, constant lag and freezing

#14Score: 0.49Rolling avg: 40%

Interface is confusing, hard to find settings

#15Score: 0.46Rolling avg: 40%

Would love to see dark mode option added

How it works: Documents are ranked by RAG similarity scores (shown above), but RAG ranking is imperfect—notice how some irrelevant documents appear early (high scores but wrong topic) and some relevant documents appear later. Documents are classified in parallel by relevance. A rolling average of the last 10 classifications is tracked. When the rolling average drops below 30%, classification stops early since lower-ranked documents are unlikely to be relevant. This filters out false positives from RAG ranking and dynamically adjusts the result set size while maintaining high precision.

Why This Is Feasible Now

A few years ago, this approach would have been impractical. Running hundreds of LLM calls per query? The cost and latency would be prohibitive.

The economics have shifted. OpenAI's rate limits now support 10,000+ requests per minute on standard tiers. Smaller, faster models handle binary classification effectively—you don't need GPT-4 for a yes/no relevance decision. Per-token costs have dropped to the point where classifying 100 candidates adds cents, not dollars, to a query.

The parallel nature of these classifications means you're not waiting for sequential API calls. You can process dozens or hundreds of candidates concurrently, limited only by rate limits. Combined with early stopping, actual compute per query is often much lower than the candidate count suggests.

This changes what's architecturally viable. Techniques that were theoretically sound but practically impossible are now just... practical.

The Strategy: Wide Search + Parallel Classification

The approach combines structured pre-filtering and HyDE with a wide initial search, then runs a highly parallelizable set of binary relevancy checks using small, fast LLMs.

Here's how it works:

Structured pre-filtering — Extract and apply structured filters (dates, categories, metadata) to reduce the corpus
HyDE — Generate a hypothetical document to bridge vocabulary gaps between query and corpus
Wide semantic search — Cast a wide net with a permissive similarity threshold (e.g., top 200-500 candidates)
Parallel binary classification — Run hundreds of small, cheap "does document X match query Y?" checks in parallel
Early stopping with rolling average — Track a rolling average of relevance (e.g., last 10 classifications). When the rolling average drops below a threshold (e.g., 30%), stop processing since lower-ranked documents are unlikely to be relevant

This dynamically adjusts the result set size query-to-query while maintaining high precision. Queries with many relevant documents process more candidates. Queries with few relevant matches stop early, saving compute.

Making It Efficient

Running classification on every candidate still requires thought. A few patterns:

Parallel processing — These binary classifications are highly parallelizable. You can run hundreds of small LLM calls concurrently, limited only by rate limits, not sequential processing.

Early stopping with rolling average — Track the relevance rate of the last N classifications (e.g., 10). When this rolling average drops below a threshold (e.g., 30%), stop processing. The logic: if the last 10 documents were mostly irrelevant, the remaining lower-ranked documents are unlikely to be relevant either.

Batching — Process multiple candidates in a single LLM call when possible. Reduces round-trip overhead significantly.

Async processing with streaming — For user-facing applications, start returning results as soon as relevant candidates are identified. Users see partial results while classification continues in the background.

The combination of parallel processing and early stopping means actual LLM calls per query are often 30-50% of the candidate count, not 100%.

The Full Pipeline

The three techniques in this series layer together:

Layer	What It Handles
Filter Extraction	Structured constraints that shouldn't go to semantic search
HyDE	Vocabulary mismatch between query and corpus
Relevancy Filtering	Score threshold calibration across different data distributions

The flow: extract structured filters → apply filters → generate HyDE document → wide semantic search (top 200-500) → parallel binary classification with early stopping → final relevant results.

Not every query needs every layer. Simple queries with matching vocabulary skip HyDE. Queries without structural constraints skip filter extraction. High-confidence initial results might skip relevancy filtering.

The pipeline adapts based on query characteristics and result confidence. The goal is applying compute where it improves outcomes, not uniformly on every request.

For the query-to-dataset mapping use case, relevancy filtering is especially valuable. It ensures that every document shown to users is actually relevant, not just semantically similar. The early stopping mechanism means the system dynamically adjusts result set size based on how many relevant documents exist, rather than using a fixed top-k that might be too high or too low.

What We're Watching

Rate limits and costs continue improving. Models optimized for classification tasks keep getting faster. The threshold for "worth doing" keeps dropping.

We expect the pattern of using LLMs for post-retrieval refinement to become standard as the economics make it obvious. The interesting questions shift from "can we afford this?" to "what else can we do with cheap inference?"