Vector Databases Explained - Mohammadali Bazyar

The Problem Traditional Databases Can't Solve

"Find the row where id = 42." Easy. Index lookup, microseconds.

"Find the 10 vectors most similar to this one." A vector is a list of 1536 floating point numbers. "Similar" is measured by cosine distance or Euclidean distance, not equality. There's no obvious way to index this efficiently. Naively, you'd compute distance to every vector in the database, which is O(N) and unsustainable past a few hundred thousand vectors.

Vector databases exist to solve this exact problem. They make "find similar" queries fast even on billions of vectors.

Why It Matters Now

Embedding models turn anything (text, images, audio) into vectors. Similar inputs produce similar vectors. So similarity search on vectors lets you do:

Semantic search: find documents that mean the same thing as a query, not just keyword matches.
Image search: find images similar to a query image.
Recommendation: find products similar to ones a user liked.
RAG: retrieve relevant context for an LLM. See the RAG article.
Clustering and deduplication: group similar items.

The proliferation of LLMs and embedding models made vector databases essential. Five years ago they were niche; now they're standard infrastructure.

Approximate Nearest Neighbor (ANN)

Exact nearest-neighbor search is O(N). Too slow. Vector databases use approximate algorithms that trade a small amount of accuracy for massive speed gains.

Typical recall (the fraction of true top-K returned): 95-99%. Speed gain: 100-10,000x faster than exact search. For most use cases, slightly imperfect retrieval is fine if it's instant.

The Two Main ANN Algorithms

HNSW (Hierarchical Navigable Small World)

The dominant algorithm in modern vector databases. Builds a graph where each vector is a node connected to a few others. Multiple layers: a sparse top layer for fast routing, dense bottom layers for precision.

To search: enter at the top, greedily move to nearer neighbors, descend layers, refine. Most of the search space is skipped.

Pros: very fast, very accurate, good for filtered queries.
Cons: memory-heavy. Index typically larger than the original vectors.

IVF (Inverted File Index)

Cluster the vectors into N partitions (using k-means). To search, find the nearest few partitions and only check vectors within them.

Pros: memory-efficient, scales to billions.
Cons: recall depends on how many partitions you probe. Tuning matters.

Often combined with Product Quantization (PQ): compress vectors to a fraction of their size with minimal accuracy loss. IVF+PQ is the workhorse for very large vector indexes.

The Major Players

Pinecone: fully managed. Easy onboarding. Expensive at scale. Most popular among smaller teams.
Weaviate: open source, full-featured. Hybrid search (vector + keyword) built in. Self-hosted or managed.
Qdrant: open source, written in Rust. Fast and memory-efficient. Good filtered search.
Milvus: open source, designed for very large scale. Multiple ANN algorithms.
pgvector: a Postgres extension. Lets you do vector search in your existing Postgres. Use this if scale is moderate (millions, not billions) and you want to consolidate infrastructure.
Elasticsearch / OpenSearch: added vector search. Useful if you already have a hybrid search setup.

Filtered Search

Real applications combine vector similarity with metadata filters: "find documents similar to my query, but only from the last 30 days, only in English, only from author X."

Two approaches:

Pre-filter: apply metadata filter first, then vector search on the smaller result set. Fast if the filter is selective.
Post-filter: vector search first, then drop results that fail the metadata filter. Can return fewer than K results.

Modern databases pick automatically based on selectivity. Some support hybrid index structures for both.

Hybrid Search

Pure vector search misses exact-match cases. Someone searching for "iPhone 16 Pro Max 1TB" wants exact keyword matches as much as semantic similarity. Hybrid search combines BM25 (traditional keyword scoring) with vector search and merges results, often boosting quality 10-30%.

Operational Concerns

Indexing throughput: ingesting millions of vectors takes time. Plan for it.
Update cost: some indexes are expensive to update. HNSW handles updates well; some others require rebuilding.
Memory vs disk: large indexes don't fit in RAM. Some databases support disk-backed indexes with caching.
Multi-tenancy: if many users have separate vector sets, isolate them properly. Cross-tenant leaks are catastrophic.

The One Thing to Remember

Vector databases solve a specific problem (fast similarity search) that traditional databases handle poorly. The math (HNSW, IVF, PQ) is interesting; the practical decision is which database fits your scale, your team, and your existing stack. For most starting projects, pgvector or Pinecone is the right choice. Specialized vector databases like Qdrant or Milvus pay off at very large scale or specific access patterns.