The Problem RAG Solves
Large Language Models like GPT or Claude are trained on a snapshot of public internet data. They know nothing about your company's internal documents, your private customer database, your latest product specs, or anything that happened after their training cutoff.
If you ask the model "what's our refund policy?" it has no idea. If it answers anyway, it's making something up (this is called hallucination).
You have three options to fix this:
Fine-tuning: retrain the model on your data. Expensive, slow, and the model still hallucinates on edge cases.
Prompt stuffing: paste all your documents into every prompt. Fails immediately because of context window limits.
RAG: store your documents externally, search the relevant ones at query time, and add only those to the prompt. This is what 90% of production "AI on company data" systems use.
The Core Idea
RAG (Retrieval-Augmented Generation) splits the work into two phases:
Step 1: Embeddings (Turn Text Into Numbers)
An embedding is a fixed-size vector of numbers that represents the meaning of a piece of text. Similar text gets similar vectors. Different text gets different vectors.
Example: the sentence "I love dogs" might embed to [0.21, -0.04, 0.87, ...]. The sentence "Puppies make me happy" embeds to a vector that's close to that one because they mean similar things, even though they share no words.
Embeddings come from specialized models (OpenAI's text-embedding-3, Cohere's embed-v3, open-source models like all-MiniLM). You pass them text, they return a vector (typically 384 to 1536 dimensions).
Step 2: Chunking
You can't embed an entire 200-page PDF as one vector. Embedding models have token limits (typically 8K tokens) and the resulting vector wouldn't represent fine-grained details.
Instead, split documents into chunks of a few hundred to a few thousand tokens. Embed each chunk separately. Now retrieval can find the specific chunk that answers a question, not the whole document.
Common chunking strategies:
Fixed-size: every 500 tokens. Simple but breaks context (a sentence might split across chunks).
Sentence-aware: split on sentence boundaries. Better preserves meaning.
Semantic: use NLP to find natural section breaks (paragraphs, headers). Best quality, more complex.
Overlapping: chunks share some content (e.g., 500 tokens with 50-token overlap). Reduces "missed context" near chunk boundaries.
Step 3: Vector Database
Once you have millions of chunk embeddings, you need fast similarity search. Traditional databases can't do "find me the 5 vectors most similar to this one" efficiently.
Vector databases specialize in this. They use approximate nearest neighbor (ANN) algorithms (HNSW, IVF) to find similar vectors in milliseconds even across billions of entries.
Popular options:
Pinecone: managed, easy, expensive at scale.
Weaviate: open source, full-featured.
Qdrant: open source, performant.
Milvus: open source, designed for very large scale.
pgvector: a Postgres extension. Use this if you already have Postgres and your scale is moderate.
Step 4: Retrieval
At query time:
1. Embed the user's question with the same embedding model used for indexing.
2. Ask the vector database: "find the K most similar chunks to this query embedding."
3. The database returns the top K chunks (typically K = 3 to 10).
This is similarity search via cosine distance, dot product, or Euclidean distance. Cosine is most common.
Step 5: Augmenting the Prompt
Take the retrieved chunks and stuff them into the LLM prompt:
You are a helpful assistant. Answer the user's question using
the context below. If the context doesn't contain the answer,
say so honestly.
Context:
[chunk 1: our refund policy is 30 days, no questions asked...]
[chunk 2: returns must include original packaging...]
[chunk 3: digital products cannot be refunded after download...]
Question: How long do I have to return a physical product?
Answer:
The LLM now has fresh, specific data to ground its answer. Instead of inventing a refund policy, it can cite the real one.
Production Gotchas
Embedding drift: if you switch embedding models, you must re-embed everything. Embeddings from different models are not interchangeable.
Stale data: when source documents change, you need to re-embed and update the vector database. Build a sync pipeline early.
Bad retrieval = bad answers: if retrieval pulls wrong chunks, the LLM answers wrong with high confidence. Test retrieval quality independently of generation quality.
Hybrid search: pure vector search misses keyword matches (someone searching for an exact part number). Combine with traditional BM25 keyword search and merge results. This is called hybrid retrieval and almost always improves quality.
Reranking: the top 10 from vector search aren't always the best. Run a smaller reranker model on those 10 to pick the best 3 to include in the prompt.
Context window limits: retrieved chunks compete for space in the prompt. If you retrieve 10 long chunks, you might exceed the model's context. Trade off K vs chunk size.
RAG vs Fine-tuning
| RAG | Fine-tuning | |
|---|---|---|
| Updates data | Easy: re-embed changed docs | Hard: full retraining |
| Cost | Mostly inference costs | Heavy upfront training cost |
| Citations | Returns source chunks | No source attribution |
| Best for | Knowledge that changes | Style, format, domain language |
| Hallucination risk | Lower (grounded in retrieved text) | Higher (model still confabulates) |
For most "answer questions about my data" use cases, RAG wins. Fine-tuning makes sense when you need the model to adopt a specific style or fluency in a domain language, not just access to facts.
The One Thing to Remember
RAG is not magic. It's a pipeline: split your data into chunks, embed them, store the vectors, search at query time, and feed the relevant pieces into the LLM's prompt. Each step has knobs that affect quality. Get all the knobs roughly right and you can build production-grade AI on private data without ever touching model training.