RAG Patterns Explained Simply
Data Eng
RAG is ETL + search + LLM. Your pipelines feed the index. You own chunking and freshness.
Backend
Implement it as: embed docs → store in vector DB → at query time: retrieve, inject, prompt.
Ml Eng
You know embeddings. RAG = retrieval layer + prompt engineering. No fine-tuning required.
RAG Patterns Explained Simply
TL;DR
- RAG = give the LLM your own data instead of relying on its training. Retrieve relevant chunks, stuff them in the prompt.
- When to use: Q&A over docs, support bots, internal knowledge, anything where "the answer is in our data."
- Pitfalls: bad chunking, stale index, retrieval misses, prompt overflow.
What RAG Actually Is
You have documents. Users ask questions. The LLM doesn't know your documents — it was trained on the public internet. So you:
- Chunk your docs into pieces (paragraphs, sections, or semantic units)
- Embed each chunk into a vector (a list of numbers)
- Store embeddings in a vector DB (Pinecone, Weaviate, pgvector, etc.)
- At query time: embed the user's question, find the K most similar chunks, put them in the prompt
- Prompt: "Here's the context: [chunks]. Answer the question using only this context."
The LLM answers from your data. No hallucination about things it doesn't know. (Or less, anyway.)
The Simple Flow
[Your Docs] → Chunk → Embed → [Vector DB]
↑
[User Question] → Embed → Retrieve top-K → [Prompt: context + question] → LLM → Answer
When RAG Beats Plain LLM
| Use Case | RAG? | Why |
|---|---|---|
| "What's our vacation policy?" | Yes | Policy lives in your HR docs |
| "Summarize this PDF" | No | The PDF is the input; no retrieval |
| "Answer from our API docs" | Yes | Docs are your corpus |
| "Write a haiku" | No | No corpus; LLM's training is enough |
Rule: If the answer is in your data, use RAG.
Common Pitfalls
Chunking too small. "What is the return policy?" — if "return policy" is split across 3 chunks, retrieval may miss it. Chunk by semantic meaning, not fixed character count.
Chunking too big. Huge chunks = noisy retrieval, and you hit prompt token limits. Balance: 300–800 tokens per chunk often works.
Stale index. Docs change. Re-index on publish, or run a nightly job. RAG with outdated data = wrong answers.
Retrieval misses. User phrasing doesn't match doc phrasing. Use hybrid search (keyword + vector) or query expansion.
No "I don't know." If retrieval returns junk, the LLM will still answer. Add: "If the context doesn't contain the answer, say 'I don't have that information.'"
Minimal RAG Stack (Conceptual)
- Chunking: By section or by overlap (e.g., 512 tokens, 64 overlap)
- Embedding model: OpenAI
text-embedding-3-small, Cohere, or open source (sentence-transformers) - Vector store: pgvector (Postgres), Pinecone, or Chroma for prototyping
- LLM: Same as always. GPT-4, Claude, etc. Prompt = context + question
# At query time: embed question, retrieve, stuff prompt
def answer_question(question: str, top_k: int = 3) -> str:
q_embedding = embed_model.embed(question)
chunks = vector_db.search(q_embedding, top_k=top_k)
context = "\n\n".join(c["text"] for c in chunks)
prompt = f"Context:\n{context}\n\nQuestion: {question}\nIf the context doesn't contain the answer, say 'I don't have that information.'"
return llm.complete(prompt)Quick Check
Your RAG returns 3 chunks. The LLM still hallucinates. What's the most likely fix?
Do This Next
- Pick one document set — your API docs, a FAQ, or internal wiki. Export as text.
- Chunk it — by section or 500-token windows. How many chunks do you get?
- Prototype — use a hosted vector DB (Pinecone free tier) or pgvector. Embed chunks, embed 3 test questions, retrieve top-3. Are they relevant?