Back to Notes
May 20268 min read

Building RAG Pipelines That Actually Work

AIRAGLangChain

Retrieval-Augmented Generation (RAG) is simple in a tutorial, but hard in production. Standard RAG pipelines suffer from poor chunk retrieval, retrieval noise, and context length limitations.

Here are the key improvements that made a huge difference in my production setup:

1. Advanced Chunking Strategies Instead of simple character splits, use parent-child chunking. Index small chunks (e.g., 100-200 tokens) for high vector search accuracy, but retrieve the larger parent paragraph (e.g., 800 tokens) to provide rich context to the LLM.

2. Hybrid Search Combine dense vector embeddings (for semantic matching) with sparse lexical search (BM25, for matching specific names, IDs, or keywords). Use a Reciprocal Rank Fusion (RRF) algorithm to combine scores.

3. Re-Ranking Run a cross-encoder model (like Cohere Rerank or BGE-Rerank) on the top 25-50 retrieved chunks. It calculates a precise relevance score between the query and each chunk, filtering out irrelevant chunks.

4. Query Rewriting Users rarely write optimal search queries. Introduce a query pre-processing step that rewrites the user's input into multiple search queries, or expands it with synonyms, before hitting the vector database.

Thoughts? Feedback? Let's talk about RAG, scaling, and animation!

Drop a letter →