Advanced RAG

Applying HyDE to Domain-Specific Search

Using Hypothetical Document Embeddings to bridge vocabulary gaps between user queries and specialized document corpora.

Cole Hoffer
Hero image for Applying HyDE to Domain-Specific Search

Part 1 covered extracting structured filters before semantic search. That handles the "when" and "what category" parts of a query. But the semantic search itself can still underperform.

The failure mode: vocabulary mismatch. A user searching for "frustration with the payment flow" may not find documents where customers wrote "the checkout process keeps failing." Semantically related, but the embedding similarity falls below retrieval thresholds.

This gap is especially pronounced in domain-specific corpora. Users ask questions in casual language. Documents contain specialized terminology, abbreviations, or just different phrasing for the same concepts.

Hypothetical Documents as Search Anchors

HyDE—Hypothetical Document Embeddings, from Gao et al. (2022)—addresses this directly. Instead of embedding the query, generate a hypothetical document that would answer it. Search for documents similar to the hypothetical.

The intuition: an LLM can bridge the vocabulary gap by generating text that sounds like your corpus. That generated text becomes a better search anchor than the original query phrasing.

Interactive Visualization: Embedding Space Transformation

The visualization below shows how HyDE bridges vocabulary gaps in a 2D embedding space. Try different queries to see how the standard query embedding (red) is far from relevant documents, while the HyDE query embedding (green) moves closer to the relevant cluster. The curved arrow shows the transformation from standard to HyDE.

HyDE: Bridging Vocabulary Gaps in Embedding Space

Query:
show me people who are mad when checking out
HyDE Document:
Tried to buy something yesterday and the app crashed right when I hit the pay button. This is the third time this week. My card is fine, but the transaction never goes through. Really frustrating experience.

The HyDE document is generated by sampling actual feedback items from the corpus to match style, then converting the query into a hypothetical feedback item in that same style.

StandardHyDEEmbedding Dimension 1Embedding Dimension 2

Standard Query - Top 3 Matches

Very slow response times when navigating betwee...
App is extremely slow, takes forever to load
App crashes frequently, especially on older dev...

HyDE Query - Top 3 Matches

Payment gateway error occurs every time at fina...
Transaction fails repeatedly during payment pro...
Cannot complete purchase, payment button does n...
How it works: Semantic search uses cosine similarity to measure how similar query and document embeddings are. The standard query embedding (red) has lower cosine similarity with relevant documents due to vocabulary mismatch. HyDE generates a hypothetical document by sampling actual corpus documents to match style, then converting the query into a hypothetical feedback item. This moves the query embedding (green) closer to the relevant cluster, improving cosine similarity scores.

Understanding the Embedding Space

Important: While the visualization above looks like a standard 2D scatter plot (which naturally suggests Euclidean distance), semantic search actually uses cosine similarity, not Euclidean distance. This is a crucial distinction.

Cosine similarity measures the angle between vectors from the origin, not their distance apart. Two embeddings can be far apart in Euclidean distance but have high cosine similarity if they point in similar directions. Conversely, two embeddings can be close in Euclidean distance but have low cosine similarity if they point in different directions.

In the visualization, imagine drawing a line from the origin (0,0) to each point. The angle between these lines determines cosine similarity—not how far apart the points are. This is why the visualization is centered at (0,0): it emphasizes that we care about direction, not distance.

Try the "show me people who are mad when checking out" query in the visualization. Notice how the standard query embedding (red) points in a different direction from the payment cluster documents. Even though they might appear "close" in the 2D plot, their cosine similarity is low because they point in different directions from the origin. The query's vocabulary ("mad", "checking out") doesn't align with the document vocabulary ("checkout process", "payment gateway"), creating this directional mismatch.

HyDE bridges this gap by generating a hypothetical document that uses the same vocabulary as your corpus. The HyDE query embedding (green) points in a direction much closer to the relevant cluster—the angle from the origin is more similar. This improves cosine similarity, which directly improves retrieval effectiveness.

The "Top 3 Matches" panels below the visualization show the practical impact: HyDE consistently retrieves more relevant documents because the cosine similarity (angle alignment) is higher, even if the Euclidean distance in the 2D projection looks similar.

This isn't just a visualization artifact—it reflects real retrieval behavior. Lower cosine similarity means lower retrieval scores, which means relevant documents may not appear in your top-k results. HyDE improves similarity by aligning the query embedding's direction with the document vocabulary, making retrieval more effective.

Making It Practical

The academic formulation generates one hypothetical. In practice, we generate several and merge results. This provides robustness against any single generation being off-target.

A detail that matters: the hypotheticals should match the style of your actual documents. We include a few randomly sampled documents from the corpus in the generation prompt. This grounds the output in reality—appropriate length, tone, terminology.

Without style samples, hypotheticals tend toward generic, formal text that may not resemble your actual data at all.

Style-Matched Generation Example

Here's what happens when you include style samples versus when you don't:

Query: "show me people who are mad when checking out"

Without style samples: The user is experiencing frustration with the payment process. The checkout functionality is not working as expected, causing customer dissatisfaction with the transaction workflow.

With style samples (sampled from actual feedback): The checkout process keeps failing when I try to pay. Payment gateway error occurs every time at final step. Cannot complete purchase, payment button does nothing.

Notice how the style-matched version matches the casual, feedback-style tone of the actual documents—exactly like the HyDE document shown in the visualization above. This vocabulary alignment is what makes HyDE effective.

Here's a simplified prompt structure that achieves this:

1You are generating hypothetical customer feedback items that would
2answer this query: "{query}"
3
4Here are some examples of actual feedback from our corpus:
5{sample_doc_1}
6{sample_doc_2}
7{sample_doc_3}
8
9Generate 3-5 hypothetical feedback items in the same style and tone
10as the examples above that would answer the query. Each should use
11different phrasing while maintaining the same vocabulary and style.
12
13Return a JSON object with this schema:
14{
15 "hypotheticals": [
16 "first hypothetical feedback item",
17 "second hypothetical feedback item",
18 "third hypothetical feedback item"
19 ]
20}

Prompt Engineering Best Practices

  • Style samples: Include 3-5 randomly sampled documents from your corpus. Choose representative examples that capture the typical length, tone, and terminology.
  • Diversity matters: Don't just sample from one cluster. Mix examples from different topics to help the model understand the overall corpus style.
  • Temperature: Use moderate temperature (0.7-0.9) to get variation while maintaining coherence. Lower temperatures (0.3-0.5) produce more conservative, similar outputs.
  • Edge cases: For very short queries, the model may need more guidance. For very long queries, consider truncating or summarizing before generation.

Merging Multiple Results

With multiple hypothetical embeddings, you get multiple result sets. The architecture looks like this:

Query → Multiple HyDE generations → Multiple embeddings → Multiple searches → Merge results

Each hypothetical gets its own semantic search (and optionally BM25 search). Then you merge the ranked lists. Important: Always include the original query as one of the search queries alongside the HyDE hypotheticals. The original query is especially valuable for BM25 keyword matching, while HyDE hypotheticals improve semantic search.

Merging Strategies

Two basic approaches we've experimented with:

Average pooling — a document's score is the average across all searches. Favors documents consistently relevant to different phrasings.

Max pooling — a document's score is its best score across any search. Favors documents that strongly match at least one interpretation.

Max pooling works better when queries have multiple valid interpretations. Average pooling works better when you want high-confidence matches across all interpretations.

A hybrid approach: use max pooling to select top-k candidates from each search, then average pool those candidates for final ranking.

Using RRF for Multiple HyDE Embeddings

Creating multiple HyDE embeddings and fusing results can further improve performance. Instead of score-based pooling, use Reciprocal Rank Fusion (RRF) to combine results from multiple hypotheticals.

RRF works with rank positions rather than scores, making it robust to different score scales. It's particularly effective when combining:

  • Original query search results
  • Multiple HyDE hypothetical search results
  • BM25 and semantic results for each query

For details on implementing RRF, see our Reciprocal Rank Fusion guide. The benefits include:

  • Robustness: More resilient to individual generation failures—if one hypothetical is off-target, others compensate
  • Phrasing diversity: Captures different ways of expressing the same concept
  • Scale independence: No need to normalize scores across different retrievers

Always include the original query in the RRF fusion alongside HyDE hypotheticals. This ensures you get both keyword matches (from the original query via BM25) and semantic matches (from HyDE hypotheticals).

Combining with Hybrid Search

HyDE improves semantic search, but keyword matching still catches exact terms that embeddings miss. The complete architecture:

HyDE generation → Multiple hypotheticals (plus original query) → Each gets BM25 + semantic search → Merge all results

As shown in the visualization above, the improved cosine similarity from HyDE makes semantic search more effective. But BM25 still catches exact term matches that embeddings might miss. Running both for each query (original + hypotheticals) provides comprehensive coverage.

Merging Strategy for Hybrid

For each query (original + each HyDE hypothetical), you get two ranked lists:

  1. BM25 results (keyword matching)
  2. Semantic search results (embedding similarity)

Combine these using RRF or score-based fusion, then merge across all queries (original + hypotheticals) using RRF again.

Key point: Always include the original query as one of the search queries alongside HyDE hypotheticals. The original query is especially valuable for BM25 keyword matching—users often include exact terms that match your corpus vocabulary. HyDE hypotheticals improve semantic search by bridging vocabulary gaps. Together they provide comprehensive coverage: exact matches via BM25 on the original query, and conceptual matches via semantic search on HyDE hypotheticals.

When to Use HyDE

HyDE isn't free. Generation adds 500-1500ms before search begins. Use it when:

  • Query vocabulary likely mismatches corpus — User-facing queries with casual language, while your corpus uses specialized terminology
  • Corpus has specialized terminology — Technical documentation, domain-specific jargon, or industry-specific phrasing
  • Precision matters more than speed — You can afford the latency for better results
  • Vocabulary gap is the problem — Standard semantic search underperforms due to phrasing differences, not other issues

Skip HyDE when:

  • Query vocabulary already matches corpus well — Your users already use the same terminology as your documents
  • Sub-100ms latency requirements — Real-time systems where generation latency is prohibitive
  • Simple, straightforward queries — Queries that work fine with standard semantic search
  • Small corpus — Where standard search already works well

It can also introduce noise if the LLM generates concepts that aren't actually in your data. The hypothetical document contains the term, but no real documents do—leading to poor matches. This is why style samples matter: they ground the generation in your actual corpus vocabulary.

Consider a hybrid approach: route complex queries with vocabulary mismatch to HyDE, while simple queries go directly to standard search. A/B test to measure the impact.

Conclusion

As the visualization above demonstrates, HyDE bridges vocabulary gaps when query language doesn't match corpus style. By generating hypothetical documents that mirror your corpus vocabulary, HyDE moves query embeddings closer to relevant document clusters, improving cosine similarity and retrieval effectiveness.

This technique is worth the complexity when you have vocabulary mismatch problems and can afford the 500-1500ms generation latency. For user-facing queries with casual language searching specialized corpora, HyDE can meaningfully improve precision—especially when combined with hybrid search and proper merging strategies.

Cite this post

/

Cole Hoffer. (Jan 2026). Applying HyDE to Domain-Specific Search. Cole Hoffer. https://www.colehoffer.ai/articles/advanced-rag-hyde

Stay in the loop

Get notified when I publish new articles on RAG, search, and AI systems.