Skip to main content

RAG Patterns Explained Simply

5 min read

Data Eng

RAG is ETL + search + LLM. Your pipelines feed the index. You own chunking and freshness.

Backend

Implement it as: embed docs → store in vector DB → at query time: retrieve, inject, prompt.

Ml Eng

You know embeddings. RAG = retrieval layer + prompt engineering. No fine-tuning required.

RAG Patterns Explained Simply

TL;DR

  • RAG = give the LLM your own data instead of relying on its training. Retrieve relevant chunks, stuff them in the prompt.
  • When to use: Q&A over docs, support bots, internal knowledge, anything where "the answer is in our data."
  • Pitfalls: bad chunking, stale index, retrieval misses, prompt overflow.

What RAG Actually Is

You have documents. Users ask questions. The LLM doesn't know your documents — it was trained on the public internet. So you:

  1. Chunk your docs into pieces (paragraphs, sections, or semantic units)
  2. Embed each chunk into a vector (a list of numbers)
  3. Store embeddings in a vector DB (Pinecone, Weaviate, pgvector, etc.)
  4. At query time: embed the user's question, find the K most similar chunks, put them in the prompt
  5. Prompt: "Here's the context: [chunks]. Answer the question using only this context."

The LLM answers from your data. No hallucination about things it doesn't know. (Or less, anyway.)

The Simple Flow

[Your Docs] → Chunk → Embed → [Vector DB]
                                 ↑
[User Question] → Embed → Retrieve top-K → [Prompt: context + question] → LLM → Answer

When RAG Beats Plain LLM

Use CaseRAG?Why
"What's our vacation policy?"YesPolicy lives in your HR docs
"Summarize this PDF"NoThe PDF is the input; no retrieval
"Answer from our API docs"YesDocs are your corpus
"Write a haiku"NoNo corpus; LLM's training is enough

Rule: If the answer is in your data, use RAG.

Common Pitfalls

Chunking too small. "What is the return policy?" — if "return policy" is split across 3 chunks, retrieval may miss it. Chunk by semantic meaning, not fixed character count.

Chunking too big. Huge chunks = noisy retrieval, and you hit prompt token limits. Balance: 300–800 tokens per chunk often works.

Stale index. Docs change. Re-index on publish, or run a nightly job. RAG with outdated data = wrong answers.

Retrieval misses. User phrasing doesn't match doc phrasing. Use hybrid search (keyword + vector) or query expansion.

No "I don't know." If retrieval returns junk, the LLM will still answer. Add: "If the context doesn't contain the answer, say 'I don't have that information.'"

Minimal RAG Stack (Conceptual)

  • Chunking: By section or by overlap (e.g., 512 tokens, 64 overlap)
  • Embedding model: OpenAI text-embedding-3-small, Cohere, or open source (sentence-transformers)
  • Vector store: pgvector (Postgres), Pinecone, or Chroma for prototyping
  • LLM: Same as always. GPT-4, Claude, etc. Prompt = context + question
# At query time: embed question, retrieve, stuff prompt
def answer_question(question: str, top_k: int = 3) -> str:
  q_embedding = embed_model.embed(question)
  chunks = vector_db.search(q_embedding, top_k=top_k)
  context = "\n\n".join(c["text"] for c in chunks)
  prompt = f"Context:\n{context}\n\nQuestion: {question}\nIf the context doesn't contain the answer, say 'I don't have that information.'"
  return llm.complete(prompt)

Quick Check

Your RAG returns 3 chunks. The LLM still hallucinates. What's the most likely fix?

Do This Next

  1. Pick one document set — your API docs, a FAQ, or internal wiki. Export as text.
  2. Chunk it — by section or 500-token windows. How many chunks do you get?
  3. Prototype — use a hosted vector DB (Pinecone free tier) or pgvector. Embed chunks, embed 3 test questions, retrieve top-3. Are they relevant?