Large language models (LLMs) are astonishing pattern-completion engines, but they are also static archives: everything they “know” is baked into billions of parameters frozen at training time. Once the world changes—or your sensitive company manuals that are not public—the model is stuck in yesterday’s reality.

Retrieval-Augmented Generation (RAG) fixes that. This recipe solves three chronic LLM pain points:

  1. Freshness – swap in new documents; no expensive re-training.

  2. Transparency & traceability – show the user where each fact came from and reduce halluzinations.

  3. Parameter efficiency – you don’t need to finetune the LLM on every domain.

The idea crystallised in the paper Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks by Patrick Lewis et al., 2020. Since then it has become a de-facto blueprint for building production-grade, knowledge-grounded chatbots and assistants.

The Two Pipelines That Matter

The diagram below captures both flows: the lower left block is the index pipeline, the upper right block is the RAG pipeline where retrieval meets generation.

Index Pipeline — Make knowledge searchable

Your documents go through an embedding model that converts each text chunk into a vector. Those vectors are written to a vector store (FAISS, pgvector, DuckDB, etc.) where “distance ≈ semantic similarity”. My blog article “Vector database – what, why, and how” describes basics for storing vectors: Document ─▶ Loading & Cleansing ─▶ Chunking ─▶ Embedding Model ─▶ [0.11, 0.42, …] ─▶ Vector Store

RAG (Query-Time) Pipeline — Answer with context

The query pipeline uses the knowledge base created in the vector store by the index pipeline to answer questions.

  1. A user question arrives.

  2. Retriever does a similarity search in the vector store.

  3. Retrieved passages are augmented onto the prompt.

  4. The LLM reads this enriched prompt and generates the final answer.

RAG with pipelines

Hands-On Workshop: What You’ll Build

Interested to experiment and learn RAG? My github repository for TDWI 2025 conference contains a working example in a notebook. The notebook is structured into the following sections:

️ Workshop Step Pipeline Stage Why It Matters
Load & clean data – quality first Pre-Index Garbage in ⇒ garbage vectors out
Chunking – slice the elephant Index Keeps vectors inside context window
Embedding – pour the world into vectors Index Numeric keys for semantic search
Vector Store – math, not magic (FAISS / DuckDB / PostgreSQL) IndexRetriever Fast nearest-neighbor lookup
Prompt polishing – smartly weave sources Augmentation Controls context length & attribution
“Chat-GPT-ified” – context in, hallucinations out LLM Generation Grounds answers; reduces fabrications

Technical Walk-Through — Deep Dive

Data Ingestion & Cleaning

Before a single vector is born, your raw corpus must be typed, de-noised, and traceable.

  • Type detection & parsing – Use multi-format loaders so PDFs, HTML, Markdown, and plain text all arrive as uniform text objects.

  • Boiler-plate stripping – Navigation bars, cookie banners, and “back to top” links poison embeddings because they appear verbatim across pages; strip them early.

  • Language & encoding normalisation – Only embed languages the model supports, and fix mojibake so “RAG” ≠ “R�G”.

  • Metadata stamping – Attach {url, section, timestamp} to every chunk; this powers later citations, TTL expiry, and rollback in case of data-quality regressions.

Smart Chunking

Chunk size is the silent killer of retrieval quality.

  • Balancing act – Large chunks carry more context but may loose relevant information; tiny chunks have relevant information but may loose context.

  • Overlap matters – Introduce 10-20 % overlap so sentences that cross boundaries aren’t torn apart, especially for legal or medical texts with heavy pronoun use.

  • Domain-specific splitters – Section-aware splitting keeps logical blocks intact in manuals or docs.

Embeddings

Here, text turns into mathematics.

  • Model choice – Open-source sentence transformers give you on-prem freedom; vendor APIs (e.g. OpenAI) trade cost for turnkey quality and speed.

  • Dimensionality & normalisation – Most modern models output 384- to 1024-dimensional vectors. ℓ²-normalising them turns cosine similarity into a simple dot product, lowering computation cost downstream.

  • Batching & streaming – Stream chunks in batches to avoid exhausting GPU RAM; persist vectors incrementally so failed runs can resume without re-embedding everything.

Vector Store Design

Where those vectors actually live determines latency, recall, and scalability.

  • FAISS – Ideal for single-node or GPU-accelerated use cases; supports HNSW graphs and IVF for high-recall ANN. Easy to start.

  • DuckDB – Great for local notebooks or browser demos; minimal install footprint and instant SQL analytics on your embeddings.

  • PostgreSQL + pgvector – Brings ACID guarantees, SQL joins, and enterprise orchestration. Works well when you need to mix structured filters (e.g. WHERE language='de') with semantic search.

  • Tuning knobs – HNSW and IVF control retrieval speed vs retrieval accuracy.

  • Combined search – similarity search is most often not enough, so combine with text search and/or full text search.

Prompt Augmentation

Retrieval and prompt augmentation is more than “k-nearest neighbours.”

  • First-stage ANN – A fast, approximate search returns the top k candidates in a few milliseconds.

  • Grounding guardrails – Explicitly instruct the model: “Answer using only the context. If unsure, say ‘I don’t know.’” This simple line slashes hallucination rates.

  • Top 2-3 candidates – The Top 1 result is often not enough, so use top 2-3 candidates. Similarity distance might be overrated.

Generation

Where retrieval hands off to generation.

  • Grounding guardrails – A low temperature is favourable to reduce hallucination rates. Also validate results, especially for sensitive data that a user may not be allowed to see.
  • Citation scaffolding – Embed source IDs (e.g. [S1]) and url into the final answer to give the user more information and the possibility to validate the result.


These six stages—cleaning, chunking, embedding, storing, prompting, and generation—form the spine of any robust RAG system. Master their trade-offs and you’ll unlock fast, fresh, and faithful answers from your knowledge base.

Quick Recap — Why Your RAG Pipeline Matters

Cleaning → Chunking → Embedding → Vector Storage → Retrieval → Prompt Augmentation.
Those six steps turn raw, ever-changing documents into grounded, low-hallucination answers. Get any one wrong and you either lose recall, blow past the context window, or feed the LLM junk it will faithfully regurgitate. Nail them all and you unlock fresh, auditable knowledge on demand.

Ready to experiment & learn?

  • Clone the notebook: Drop your own web pages into the “Data Ingestion” cells.

  • Guardrails and security: RAG also imposes risks to security like prompt injection or data leakage. Internal users may get confidential information by prompt injection. Guardrails and security measures are urgently needed.
  • Improve it: a production pipeline should not use a notebook but pure Python code (or any other programming language) for software engineering purposes. There is still room for many other improvements like storing more metadata, evaluation metrics, pdf file handling, …. The demo notebook is just to start.

Ready? Fork the repo, swap in your corpus, and share your first grounded chatbot with the world.

Looking Ahead — From “Plain RAG” to Agentic RAG & the Model Context Protocol

Trend What Changes Why It Matters
Agentic RAG Instead of a single, linear pipeline, autonomous agents decide when to reformulate queries, call specialised tools (SQL, Jira, web search), or loop until confidence thresholds are met. Reflection, planning, and multi-step tool use make answers both deeper and more resilient. Early benchmarks show 10-20 pt gains in answer accuracy on multihop tasks – because the system can think, try, and verify rather than fire-and-forget.
Model Context Protocol (MCP) An open standard (backed by Anthropic, Microsoft, etc.) that lets any LLM or agent discover and call external data sources through a single, secure handshake—think “HTTP for AI tools.” Slashes the M × N integration mess: one client can talk to many tools, and many models can share one tool server. Early enterprise pilots report integration time cut by 70 %.

What This Means

  1. Design for autonomy: Break monolithic RAG flows into callable skills (retrieve, rank, validate) so future agents can orchestrate them.

  2. Adopt MCP-ready endpoints: Expose your vector store, SQL warehouse, or function library behind MCP today; tomorrow’s agents will plug-and-play.

  3. Iterate safely: Keep the evaluation harness running; as agents begin to loop and plan, automated tests will catch drift before users do.