Embeddings and Semantic Search - Mohammadali Bazyar

What an Embedding Is

An embedding is a fixed-size list of numbers (typically 384 to 1536 dimensions) that represents the meaning of something. The same model produces vectors for any input. Crucially: similar inputs produce similar vectors.

The sentence "I love dogs" might embed to [0.21, -0.04, 0.87, ...]. The sentence "Puppies make me happy" embeds to a vector that's close in vector space because the meaning overlaps, even though the words don't.

This single property unlocks an enormous range of applications.

Why Keyword Search Fails

Traditional search (BM25, Elasticsearch defaults) matches words. If you search for "fast car" and a document says "speedy automobile," keyword search misses it entirely. They're synonymous but share no words.

Embedding-based search finds them anyway. The vectors for "fast car" and "speedy automobile" are close. Distance-based retrieval surfaces the match.

How Embeddings Are Created

Embedding models are neural networks trained to produce vectors where semantically similar inputs are close. Training uses techniques like contrastive learning: show the model pairs of similar things and pairs of dissimilar things; train the network to make similar things' vectors close and dissimilar ones' vectors far.

You don't train them yourself. You use a pre-trained model:

OpenAI text-embedding-3-small / large: high quality, easy API.
Cohere embed-v3: strong on retrieval tasks.
Voyage AI: specialized for retrieval.
Sentence-Transformers (all-MiniLM, all-mpnet-base): open source, run locally.
BGE, E5, GTE: open source, often top-of-leaderboard.

Embeddings Aren't Just for Text

Models exist for almost any modality:

Images: CLIP, DINO. Useful for image similarity, image search.
Audio: embed audio for music similarity, speaker identification.
Code: CodeBERT, GraphCodeBERT. "Find similar functions in our codebase."
Multimodal: embed text and images in the same space (CLIP). Search images by text or vice versa.

Building Semantic Search

The pipeline:

1. Index: for each document, generate an embedding. Store the vector and a pointer to the document.
2. Query: at search time, embed the user's query.
3. Retrieve: find the K vectors closest to the query embedding.
4. Return: the documents associated with those vectors.

The retrieval step uses a vector database (see the Vector Databases article). The embedding step uses a model API or a local model.

Hybrid Search: The Practical Sweet Spot

Pure semantic search misses exact-match cases. Someone searching for an SKU number, a phone number, or an exact phrase wants keyword matches. Hybrid search combines:

Vector similarity (semantic).
BM25 keyword scoring (lexical).
Optionally, metadata filters (date, language, category).

Results from each are merged with a scoring function (often Reciprocal Rank Fusion). Quality improves 10-30% over either approach alone.

Reranking

The first-pass retrieval (vector + BM25) returns the top 50-100. A second-pass reranker model scores each candidate against the query more precisely. Rerankers are smaller, slower, and more accurate than embedding models. Output: a final top 10.

Common rerankers: Cohere Rerank, BGE-reranker, cross-encoder models.

Common Pitfalls

Long documents: a single embedding can't represent a 50-page PDF well. Chunk first.
Embedding drift: swapping models means re-embedding everything. Vectors from different models are not comparable.
Asymmetric retrieval: queries are short, documents are long. Some models have separate "query embedding" and "document embedding" modes.
Out-of-distribution queries: the model wasn't trained on your domain. Medical text, legal text, code: you may need a domain-specific embedding model.
Cost: embedding millions of documents adds up. Open-source models help.

Beyond Search: Other Uses of Embeddings

Clustering: group similar items. K-means on embeddings.
Classification: nearest-neighbor classification using a small set of labeled examples.
Deduplication: find near-duplicate content.
Recommendation: "users who liked X" is nearest-neighbor on user embeddings.
Anomaly detection: items far from any cluster are unusual.
Visualization: project high-dimensional embeddings to 2D (UMAP, t-SNE) for exploration.

The One Thing to Remember

Embeddings turn meaning into geometry. Once your data is in vector space, distance equals similarity. That single trick reshapes how search, recommendation, and many other features can be built. The technical pieces (which model, which vector database, what chunk size) are details. The conceptual leap (storing meaning as vectors) is the foundation everything else builds on.