Advanced RAG

Structured Pre-Filtering for RAG Systems

Why decomposing queries into structured filters before semantic search improves retrieval precision and performance.

Cole Hoffer
Hero image for Structured Pre-Filtering for RAG Systems

There's a pattern we see in most RAG implementations: the entire user query gets embedded and sent to vector search. For simple queries, this works fine. But it falls apart when users include structured constraints alongside semantic intent.

"Show me customer complaints about checkout from the last quarter."

"Last quarter" isn't a semantic concept. It's a date range that maps directly to indexed columns. Embedding this phrase alongside "customer complaints about checkout" dilutes the search signal. The retrieval system now has to do work that a timestamp filter handles in milliseconds.

The solution is a two-step retrieval process: extract structured filters first to reduce the corpus, then run semantic search on the filtered set. This post walks through how to implement this pattern using structured output prompts and why it improves both precision and performance.

The Problem with Embedding Everything

When you embed a query like "bugs from last quarter," the embedding model tries to capture both the temporal constraint ("last quarter") and the semantic intent ("bugs"). This creates several problems:

  1. Diluted search signal — The embedding space has to encode both structured constraints and semantic meaning, reducing the signal-to-noise ratio for actual semantic matching
  2. Inefficient search — Vector search runs over the entire corpus when a simple date filter could eliminate 80% of documents first
  3. Poor handling of constraints — Embeddings don't naturally express "between these dates" or "equals this category" — they express similarity, not exact matches

Consider a corpus of 100,000 documents. A query like "critical bugs from Q1 2024" might match semantically with 5,000 documents, but only 200 actually fall in that date range and category. Running vector search on all 100,000 documents wastes compute and risks returning irrelevant results that happen to be semantically similar but don't match the constraints.

Decomposition as a First-Class Step

We treat query decomposition as an explicit stage in the pipeline, not an afterthought. Before any embedding happens, we separate:

Structured filters — constraints that hit database indexes (dates, categories, metadata fields, enums)

Semantic intent — the conceptual meaning that actually needs vector search

The flow looks like this:

1flowchart TD
2 A["User Query"] --> B["Filter Extraction<br/>(LLM with structured output)"]
3 B --> C["Structured Filters"]
4 B --> D["Sanitized Query<br/>(semantic only)"]
5 C --> E["Apply Filters"]
6 D --> F["Embed & Search"]
7 E --> G["Filtered Corpus"]
8 F --> G
9 G --> H["Final Results"]

This isn't a novel insight. But we've found that making it a deliberate, LLM-powered step rather than relying on regex patterns or keyword matching handles the messy reality of how users phrase queries.

The extraction prompt gives the model the schema of available filters—what fields exist, what values are valid, what operators apply. The model returns both structured filters and the remaining text for semantic search.

Example: Before and After Decomposition

Consider the query: "Show me critical bugs from Q1 2024 about authentication"

Without decomposition:

  • Entire query gets embedded: embed("Show me critical bugs from Q1 2024 about authentication")
  • Vector search runs on full corpus
  • Results may include bugs from other quarters or non-critical issues

With decomposition:

  • Filter extraction identifies:
    • Date range: 2024-01-01 to 2024-03-31
    • Category: bug
    • Priority: 5 (critical)
  • Sanitized query: "authentication" (filter-matched parts removed)
  • Filters reduce corpus from 100,000 → 200 documents
  • Vector search runs on filtered set with query "authentication"
  • Results are guaranteed to match all constraints

The key insight: "Q1 2024", "critical", and "bugs" aren't semantic concepts—they're exact matches that should hit indexes, not embedding space.

Interactive Demo: Explore Pre-Filtering

The interactive demo below lets you experiment with different queries to see how filter extraction and pre-filtering work in practice. Try queries that combine date ranges, categories, and priority levels to see how the corpus size reduces and how results change.

Structured Pre-Filtering: Two-Step Retrieval

Original Query:
Show me bugs from last quarter
Step 1: Extract Structured Filters
Date Range
Jan 1, 2024 - Mar 31, 2024
Category
bug
Corpus reduced from 26 to 5 documents (81% reduction)
Step 2: Semantic Search on Filtered Corpus
Sanitized Query:
bugs

Filter-matched parts removed from query text

Results from Filtered Corpus (5 docs)

✓ Pre-filtered
No matching documents found

Results from Full Corpus (26 docs)

No pre-filtering
No matching documents found
How it works: The query is first decomposed into structured filters (date, category, priority) and semantic intent. Filters reduce the corpus size before semantic search runs. The sanitized query (with filter-matched parts removed) is used for text search, improving precision by focusing on the remaining semantic content.

The Structured Output Pattern

Filter extraction uses the same structured output pattern we use for classification tasks. The prompt defines a JSON schema that the model must conform to, ensuring reliable extraction.

Here's the prompt structure:

1System:
2You are a query decomposition system. Extract structured filters from user queries.
3
4Available filter fields:
5- date: ISO date strings (YYYY-MM-DD). Can extract ranges from phrases like "last quarter", "this month", "Q1 2024"
6- category: One of ["bug", "feature", "support", "documentation"]
7- priority: Integer 1 to 5, where 5 is critical
8
9Return a JSON object with this exact schema:
10{
11 "filters": {
12 "dateRange": {
13 "start": string | null, // ISO date or null
14 "end": string | null // ISO date or null
15 } | null,
16 "category": string | null, // One of the valid categories or null
17 "priorityRange": {
18 "min": number | null, // 1 to 5 or null
19 "max": number | null // 1 to 5 or null
20 } | null
21 },
22 "sanitizedQuery": string // Original query with filter-matched parts removed
23}
24
25Only extract filters for fields that exist in the schema. If the query mentions concepts that aren't filterable, leave them in sanitizedQuery.

Note: This schema can be defined using Zod for runtime validation:

1import { z } from 'zod'
2
3const FilterSchema = z.object({
4 filters: z.object({
5 dateRange: z.object({
6 start: z.string().nullable(),
7 end: z.string().nullable()
8 }).nullable(),
9 category: z.enum(["bug", "feature", "support", "documentation"]).nullable(),
10 priorityRange: z.object({
11 min: z.number().int().min(1).max(5).nullable(),
12 max: z.number().int().min(1).max(5).nullable()
13 }).nullable()
14 }),
15 sanitizedQuery: z.string()
16})

Example conversation:

1User:
2Show me critical bugs from Q1 2024 about authentication
3
4Assistant:
5{
6 "filters": {
7 "dateRange": {
8 "start": "2024-01-01",
9 "end": "2024-03-31"
10 },
11 "category": "bug",
12 "priorityRange": {
13 "min": 5,
14 "max": 5
15 }
16 },
17 "sanitizedQuery": "authentication"
18}

The structured output ensures:

  • Reliable parsing — No regex or string matching needed
  • Schema validation — Model can't hallucinate invalid filter values
  • Explicit sanitization — The sanitized query is part of the output, making it clear what text remains for semantic search

Handling Ambiguous Cases

The ambiguous cases are where this pays off. When a user mentions "mobile issues," should "mobile" be a filter or semantic search? It depends on your schema.

If you have a platform field with "mobile" as an enum value, extract it as a structured filter. If "mobile" is just a topic users discuss but not an indexed field, pass it through to semantic search. The LLM can make this distinction when given the schema context.

This also prevents hallucinated filters. Users will reference concepts that sound like they should be filterable but aren't in your schema. Rather than guessing, the model leaves these for semantic search where they belong.

For example, if a user queries "mobile app crashes" but your schema doesn't have a platform field, the model should:

  • Not extract "mobile" as a filter (it's not in the schema)
  • Leave "mobile app crashes" in the sanitized query for semantic search

The schema acts as a guardrail, preventing the model from inventing filters that don't exist.

The Two-Step Retrieval Process

Once filters are extracted, the retrieval process becomes two sequential steps:

Step 1: Apply Structured Filters

Filter application happens at the database/index level, not in embedding space. This is fast—typically milliseconds for indexed fields.

The corpus size reduction can be dramatic. A date range filter might eliminate 90% of documents. Combined with category and priority filters, you can reduce a 100,000 document corpus to a few hundred before running any vector search.

Step 2: Semantic Search on Filtered Set

After filtering, run semantic search on the remaining documents using the sanitized query. This is where embeddings shine—matching conceptual meaning rather than exact strings.

The sanitized query is critical here. If you used the original query "critical bugs from Q1 2024 about authentication", the embedding would still encode the temporal and categorical constraints, diluting the semantic signal. By using just "authentication", the embedding focuses purely on the semantic concept.

For best results, consider using hybrid search (combining semantic and keyword search) on the filtered set. You can also apply HyDE (Hypothetical Document Embeddings) to the sanitized query to bridge vocabulary gaps between user queries and your corpus. HyDE generates a hypothetical document that matches your corpus style, then searches for documents similar to that hypothetical—this is especially valuable when user query language doesn't match your document vocabulary.

Query Sanitization: Why It Matters

Query sanitization removes the parts of the query that were matched to structured filters. This serves two purposes:

  1. Cleaner semantic signal — The embedding only encodes semantic intent, not structured constraints
  2. Prevents double-filtering — You don't want the embedding to try to match "Q1 2024" semantically when you've already filtered by date

Consider what happens if you don't sanitize:

1# Bad: Using original query after filtering
2filters = extract_filters("bugs from Q1 2024")
3filtered = apply_filters(corpus, filters)
4results = vector_search(embed("bugs from Q1 2024"), filtered) # Still encoding date!

The embedding for "bugs from Q1 2024" encodes both "bugs" (semantic) and "Q1 2024" (temporal). But you've already filtered by date, so the temporal encoding is redundant and potentially harmful—it might cause the model to prefer documents that mention "Q1 2024" in the text, even though you've already filtered to that range.

With sanitization:

1# Good: Using sanitized query
2filters = extract_filters("bugs from Q1 2024")
3filtered = apply_filters(corpus, filters)
4results = vector_search(embed("bugs"), filtered) # Pure semantic signal

The embedding for "bugs" focuses purely on the semantic concept, which is what you want after filtering.

Performance Analysis: When This Pays Off

Filter extraction adds latency—typically 200-500ms for the LLM call. Whether this makes sense depends on your corpus size and query patterns.

Corpus Size Considerations

Large corpora (100k+ documents) — Pre-filtering almost always pays off. Even a modest filter (e.g., "this month") might reduce the corpus by 90%, making vector search 10x faster.

Medium corpora (10k-100k documents) — Depends on filter selectivity. If filters reduce the corpus by 50%+, the latency trade-off is usually worth it.

Small corpora (less than 10k documents) — May not be worth it unless filters are highly selective. The LLM call overhead might exceed the vector search time saved.

Query Pattern Considerations

Frequent structured constraints — If 30%+ of queries include date ranges, categories, or other structured constraints, pre-filtering is almost certainly worth it.

Pure semantic queries — If most queries are purely conceptual ("explain authentication"), pre-filtering adds latency without benefit. Consider making it optional or skipping the LLM call when no structured constraints are detected.

Mixed patterns — The best approach is adaptive: extract filters, but if no filters are found, skip directly to semantic search on the full corpus.

Integration with Existing RAG Pipelines

Pre-filtering integrates cleanly with existing RAG systems. The pattern is:

  1. Before embedding — Extract filters and sanitize query
  2. Before vector search — Apply filters to reduce corpus
  3. Vector search — Run on filtered corpus with sanitized query
  4. After retrieval — Continue with your existing reranking/LLM steps

This means you can add pre-filtering to existing systems without changing the rest of your pipeline. The vector search step just receives a smaller corpus and a different query—everything else stays the same.

Combining with Other Techniques

Pre-filtering works well with other RAG enhancements:

  • HyDE — Apply HyDE to the sanitized query after filtering
  • Reranking — Rerank the filtered results as usual
  • Relevancy filtering — Apply relevancy checks to filtered results

The techniques layer naturally: filter → HyDE → vector search → rerank → relevancy check.

Conclusion

Structured pre-filtering represents a meaningful improvement for RAG systems with rich metadata. By separating structured constraints from semantic intent, you get:

Better precision — Results are guaranteed to match user constraints, not just semantically similar

Better performance — Filtering reduces corpus size before expensive vector search

Better scalability — As corpora grow, filtering becomes more valuable

Cleaner semantics — Sanitized queries focus embeddings on actual semantic concepts

The trade-off is latency. The LLM call for filter extraction adds 200-500ms per query. For large corpora with frequent structured constraints, this is almost always worth it. For small corpora or purely semantic queries, consider making filtering adaptive or optional.

The structured output pattern makes implementation straightforward: define your schema, prompt the model, and parse the JSON. No complex regex or keyword matching needed. And as your schema evolves, you can add new filter fields by updating the prompt—no code changes required.

If your corpus has metadata and users frequently constrain on it, structured pre-filtering is the better way to do RAG retrieval.

Cite this post

/

Cole Hoffer. (Jan 2026). Structured Pre-Filtering for RAG Systems. Cole Hoffer. https://www.colehoffer.ai/articles/advanced-rag-structured-pre-filtering

Stay in the loop

Get notified when I publish new articles on RAG, search, and AI systems.