RAG Patterns Explained Simply

TL;DR

RAG = give the LLM your own data instead of relying on its training. Retrieve relevant chunks, stuff them in the prompt.
When to use: Q&A over docs, support bots, internal knowledge, anything where "the answer is in our data."
Pitfalls: bad chunking, stale index, retrieval misses, prompt overflow.

What RAG Actually Is

You have documents. Users ask questions. The LLM doesn't know your documents — it was trained on the public internet. So you:

Chunk your docs into pieces (paragraphs, sections, or semantic units)
Embed each chunk into a vector (a list of numbers)
Store embeddings in a vector DB (Pinecone, Weaviate, pgvector, etc.)
At query time: embed the user's question, find the K most similar chunks, put them in the prompt
Prompt: "Here's the context: [chunks]. Answer the question using only this context."

The LLM answers from your data. No hallucination about things it doesn't know. (Or less, anyway.)

The Simple Flow

[Your Docs] → Chunk → Embed → [Vector DB]
                                 ↑
[User Question] → Embed → Retrieve top-K → [Prompt: context + question] → LLM → Answer

When RAG Beats Plain LLM

Use Case	RAG?	Why
"What's our vacation policy?"	Yes	Policy lives in your HR docs
"Summarize this PDF"	No	The PDF is the input; no retrieval
"Answer from our API docs"	Yes	Docs are your corpus
"Write a haiku"	No	No corpus; LLM's training is enough

Rule: If the answer is in your data, use RAG.

Common Pitfalls

Chunking too small. "What is the return policy?" — if "return policy" is split across 3 chunks, retrieval may miss it. Chunk by semantic meaning, not fixed character count.

Chunking too big. Huge chunks = noisy retrieval, and you hit prompt token limits. Balance: 300–800 tokens per chunk often works.

Stale index. Docs change. Re-index on publish, or run a nightly job. RAG with outdated data = wrong answers.

Retrieval misses. User phrasing doesn't match doc phrasing. Use hybrid search (keyword + vector) or query expansion.

No "I don't know." If retrieval returns junk, the LLM will still answer. Add: "If the context doesn't contain the answer, say 'I don't have that information.'"

Minimal RAG Stack (Conceptual)

Chunking: By section or by overlap (e.g., 512 tokens, 64 overlap)
Embedding model: OpenAI text-embedding-3-small, Cohere, or open source (sentence-transformers)
Vector store: pgvector (Postgres), Pinecone, or Chroma for prototyping
LLM: Same as always. GPT-4, Claude, etc. Prompt = context + question

# At query time: embed question, retrieve, stuff prompt
def answer_question(question: str, top_k: int = 3) -> str:
  q_embedding = embed_model.embed(question)
  chunks = vector_db.search(q_embedding, top_k=top_k)
  context = "\n\n".join(c["text"] for c in chunks)
  prompt = f"Context:\n{context}\n\nQuestion: {question}\nIf the context doesn't contain the answer, say 'I don't have that information.'"
  return llm.complete(prompt)

Quick Check

Your RAG returns 3 chunks. The LLM still hallucinates. What's the most likely fix?

Do This Next

Pick one document set — your API docs, a FAQ, or internal wiki. Export as text.
Chunk it — by section or 500-token windows. How many chunks do you get?
Prototype — use a hosted vector DB (Pinecone free tier) or pgvector. Embed chunks, embed 3 test questions, retrieve top-3. Are they relevant?