Designing Search Autocomplete - Mohammadali Bazyar

The Problem That Looks Simple

You start typing in the Google search box. After two letters, you see ten suggested completions. By the third letter, the list refines. The whole interaction feels instantaneous, completing each keystroke within tens of milliseconds.

That single feature, search autocomplete, hides one of the most demanding real-time systems in the world. Imagine the requirements: every keystroke, for billions of users, returning the most useful 10 suggestions out of millions of possibilities, in under 100ms, while constantly updating to reflect breaking news and trending queries.

This article walks through how to build a system that meets those constraints.

Step 1: Requirements

Functional Requirements

Prefix Suggestions

Given a typed prefix, return the top N most likely completions.

Ranking

Suggestions ranked by popularity, recency, personalization, geography, and quality signals.

Real-time Trending

Breaking news appears as suggestions within minutes, not hours or days.

Localization

Suggestions reflect the user's language, region, and time of day.

Filtering

Spam, profanity, and low-quality queries excluded.

Typo Tolerance

Common misspellings still produce sensible suggestions.

Non-Functional Requirements

Latency: end-to-end under 100ms p99. The suggestion list must update almost as fast as the user can type.
Throughput: billions of keystrokes per day. Each one is a query.
Availability: 99.99%. Autocomplete is the most user-visible feature in any search product.
Freshness: trending topics appear within minutes.
Cost: the system runs constantly. Cost per query has to be tiny.

Step 2: Capacity Estimation

Metric

Calculation

Result

Daily search users

given

~1 billion

Searches per user/day

given (avg)

Total search queries/day

1B × 5

~5 billion

Keystrokes per query

average

~10

Total autocomplete reqs/day

5B × 10

~50 billion

Average QPS

50B / 86400

~580,000/sec

Peak QPS

avg × 5

~3 million/sec

Three million autocomplete requests per second at peak. Each must complete in under 100ms. This is a read-heavy, latency-critical workload. Architecture must aggressively pre-compute everything possible and serve from cache.

Step 3: The Right Data Structure — Tries

The fundamental data structure for prefix matching is the trie (pronounced "try"), short for "retrieval tree."

A trie is a tree where each node represents one character. Walking down from the root spells out a string. Each node can store metadata: search count, last seen, language tags, etc.

To find suggestions for prefix "ny":

1. Start at the root.
2. Walk to the 'n' child.
3. Walk to the 'y' child.
4. The subtree rooted there contains every word that starts with "ny."
5. Sort that subtree by some score (popularity, recency).
6. Return the top 10.

Lookup is O(prefix_length). Critically, this is independent of total dataset size. A trie with billions of strings has the same lookup cost as one with thousands.

Memory Cost of a Trie

Tries are memory-hungry. A trie storing 10 million unique queries averaging 20 characters each, with 26-character pointers per node, can easily reach gigabytes. Optimizations help:

Compressed tries (Patricia tries): merge chains of nodes with single children into a single node holding a substring.
Ternary search trees: a balanced compromise between hash maps and tries.
Encoded character indices: instead of 26 pointers per node, use a hash map for sparse alphabets.

Step 4: Storage Strategy

Even with optimization, the global trie is too big for one machine. Two approaches dominate.

Strategy A: Sharded Trie Service

The trie lives in a distributed service. Different machines own different sub-trees, sharded by the first few characters.

Lookups route based on the prefix. A query starting with "ny" goes to one shard. A query starting with "tech" goes to another.

Pros: handles infrequent and rare prefixes. Memory grows with the dataset.
Cons: complex shard rebalancing. Hot shards (popular first letters) need replicas.

Strategy B: Pre-computed Top-K Cache (Used in Production)

The simpler and faster approach: don't traverse a trie at query time. Pre-compute the top 10 suggestions for every reasonable prefix. Store them in a key-value cache.

The key is the prefix. The value is the ranked list of suggestions.

"n"     -> ["news", "netflix", "nfl", "new york", ...]
"ny"    -> ["new york", "nyt", "nyc", "ny times", ...]
"nyc"   -> ["nyc subway", "nyc weather", "nyc times square", ...]
"nyc s" -> ["nyc subway", "nyc storm", "nyc summer", ...]

Lookup is one Redis GET. Microseconds. The cost is in offline computation: rebuilding this cache from query logs.

How Many Prefixes?

Every prefix that gets typed at least once. In practice, that is millions to tens of millions. Storage: at 10 suggestions per prefix and 100 bytes each, 100 million prefixes is 100 GB. Fits in a Redis cluster.

Most production systems use the cached top-K approach for the common case and fall back to a sharded trie for rare prefixes.

Step 5: Full Architecture

Autocomplete Architecture

Read Path

User types

CDN edge cache

Autocomplete API

Prefix Cache (Redis)

on cache miss

Fallback

Trie Service
sharded

offline pipeline

Build

Search Logs (Kafka)

Aggregator

Ranker

Trie Builder

Top-K Cache (rebuilt)

real-time overlay

Trending

Real-time Stream

Trending Detector

Trending Overlay

How a Request Flows

1. User types a character. The client sends the current prefix to the Autocomplete API.
2. API checks the Prefix Cache (Redis). If hit, return immediately.
3. The API also checks the Trending Overlay for any urgent trending suggestions matching this prefix.
4. Merge cached top-K with trending. Apply per-user personalization tweaks.
5. Return the final list.

If the cache misses (rare prefix), the API falls back to the Trie Service to compute live. This is slower (a few ms) but acceptable as a fallback.

Step 6: Ranking Suggestions

The hardest question: out of millions of completions for "ny," which 10 do you show?

The score is a weighted combination of signals.

Frequency

Most basic. How often has this query been searched globally? More frequent = higher score.

Computed from search logs. Aggregated daily or hourly.

Recency

Trending right now? "Election results 2026" should appear immediately on election day, even if it has zero historical frequency.

A boost based on recent search velocity (e.g., last 1 hour vs last 30 days).

Personalization

Show this specific user what they have searched before. If a user often searches for "neural networks," when they type "n" the suggestion appears prominently.

Per-user search history kept in a separate fast lookup. Merged with global suggestions on every request.

Geography

"ny" suggests "new york" in NY but "nyt" in Tokyo (where "new york" is less searched). Per-region top-K caches solve this.

Time of Day

"weather" is more popular in the morning. "movie times" in the evening. Time-aware suggestions improve quality.

Quality Signals

Avoid spam, profanity, low-quality queries. Some suggestions are blocklisted globally (sensitive content). Others are demoted via quality scores.

Combining Signals

The score is something like:

score = base_freq + recency_boost × weight + personalization_boost × weight + ...

Weights tuned offline from labeled data ("which suggestion did users actually click?").

Step 7: How Fresh Should Suggestions Be?

Three update cadences serve different needs:

Daily: base trie rebuilt from yesterday's search logs. Captures slow-moving popularity. Good enough for most prefixes.
Hourly: for medium-velocity changes. New product launches, conference keywords.
Real-time (within minutes): for breaking news. A separate "trending" pipeline operates here.

Production systems combine all three: a daily-built base trie plus an hourly delta plus a real-time trending overlay.

The Trending Pipeline

A streaming job watches search query velocity. When a query suddenly sees a 10x increase in volume in the last 30 minutes, it gets flagged as trending. Trending queries are pushed into a small, separate cache that the API consults alongside the base cache.

Trending detection itself uses techniques like exponential moving averages and statistical anomaly detection. A query whose 10-minute volume far exceeds its expected baseline is "trending."

Step 8: Latency Optimizations

Total budget: 100ms. Time spent on:

Network round trip from client to server: ~30-50ms (mostly fixed by physics)
Cache lookup: ~1ms
Personalization merge: ~5ms
Server-side processing: ~10ms
Time spent serializing the response: ~5ms

Edge Caching

Popular prefixes (single letters, common bigrams) are cached at CDN edges. The first letter "a" gets the same suggestions for everyone in your region. Edge cache turns the round trip into ~10ms.

Client-Side Debouncing

Don't fire a request on every keystroke. Wait 50ms after the user stops typing. If they keep typing, cancel the in-flight request and start a new one. Saves request count without hurting UX.

Pre-fetching

When the user types "n," pre-fetch results for "na," "ne," "ni," "no," "nu" in the background. Most users will type one of those next; the response is already cached when they do.

Local Caching

The client caches recent results. If the user types "ny," then deletes back to "n," then types "new," the responses for "n" and "ny" might already be in local memory.

Step 9: Personalization Approaches

Personalization adds significant value but complicates everything. Two main approaches.

Server-Side Blend

Include user_id in the request. Server reads user history (recent searches, preferences), merges global top-K with personal top-K, returns the blended list.

Pros: high quality. Server has access to long-term history and ML models.
Cons: more expensive. Latency budget reduced. User history must be in fast lookup storage.

Client-Side Merge

Server returns global suggestions. Client maintains local recent searches. Client merges them.

Pros: faster, less server work, privacy-friendly.
Cons: client cannot do sophisticated ranking. Limited to "show recent searches that match prefix."

Most production systems use both: client-side merge for recent history, server-side blend for ML-driven personalization that needs broader signals.

Step 10: Typo Tolerance

Real users mistype. "necflix" should still suggest "netflix." The system must handle this without showing junk for genuine misspelled-by-design queries.

Approaches:

Edit-distance lookup: when no exact prefix match, search the trie for prefixes within edit distance 1 (one character off). Compute set during the offline build.
Phonetic matching: for some languages, words that sound the same are mapped together (Soundex, Metaphone).
Common-typo dictionary: learned from query logs. "necflix" -> "netflix" gets recorded because users frequently type "necflix" then refine to "netflix."
ML-based correction: a sequence-to-sequence model that learns to correct typos in context.

The simpler approaches handle most cases. Sophisticated correction matters for languages with frequent transliteration (e.g., people typing English approximations of non-English words).

Step 11: Storage Choices

Top-K Prefix Cache: Redis cluster. Sharded by prefix. Capacity: 50-200 GB depending on language coverage and depth. Hot shards replicated.

Trending Overlay: Redis. Tiny (a few thousand entries at most). Updated every minute.

Search Log: Kafka for the streaming side, plus a long-term store (HDFS, S3) for offline aggregation jobs. Volume: petabytes per year.

Aggregated Frequency Tables: stored in a wide-column store (Cassandra, Bigtable). Sharded by query. Updated daily.

Trie Service Storage: in-memory in the service itself (gigabytes per shard). Cold storage of trie snapshots in object storage for failover.

User History: per-user Redis hash, with longer-term storage in a SQL or NoSQL database.

Step 12: Edge Cases and Operational Concerns

The Empty Prefix

What do you show before the user types anything? Common patterns: recent personal searches, recent global trending, sponsored "trending now" topics, or a search history dropdown.

Single-Letter Prefixes

"a" produces enormous candidate sets. The cache must serve them very quickly because they are the most-hit prefixes. Heavy edge caching at CDNs.

Sensitive Content Filtering

Some queries should never appear as suggestions: child safety topics, self-harm topics, banned political queries in some jurisdictions. A blocklist runs on every output. The blocklist is updated continuously by trust and safety teams.

Spam and Manipulation

Bots can artificially inflate query volume to surface unwanted suggestions. Anti-abuse filters look for suspicious patterns (sudden spikes from few IPs, abnormal query distributions). Suspect queries are demoted or quarantined.

Internationalization

Non-Latin scripts (Chinese, Arabic, Devanagari) need their own tokenization. CJK languages especially: a "prefix" in Chinese is not a Latin letter; it might be a partial Pinyin transcription, a partial Hanzi, or a stroke pattern. Each major language has its own tokenization pipeline feeding the trie.

Cache Invalidation

When the daily rebuild produces a new top-K cache, you need to swap it in atomically. Two strategies: cache aside with versioned keys, or blue-green cache deployment.

Failure Modes

Redis cluster goes down? Fall back to trie service. Trie service unreachable? Show last known cache (stale but better than nothing). Network partition? CDN edge serves cached suggestions even with origin down.

Autocomplete is the front door of search. The fallback strategy is "always show something."

Step 13: Recap of Key Decisions

Pre-computed top-K cache as the primary store. One key-value lookup per request. Microseconds.
Sharded trie service as fallback. Rare prefixes still get sensible answers.
Daily base + hourly delta + real-time trending overlay. Three update cadences, each tuned to its data velocity.
Edge caching for the most popular prefixes. Cuts round-trip latency dramatically.
Server-side ML personalization plus client-side history merge. Best of both performance and quality.
Typo tolerance via edit-distance and learned corrections. Robust to real user input.
Sensitive-content blocklist applied at output. Safety always overrides relevance.
Anti-abuse filters in the offline pipeline. Bots cannot pollute the trie.

The One Thing to Remember

Autocomplete is a problem where the right data structure (trie) plus aggressive precomputation (cached top-K per prefix) collapse a hard problem into a single key-value lookup. The architecture is a clean separation of online reads (fast cache lookups) from offline builds (heavy trie rebuilding from search logs). The same pattern applies to many "rank top N from a huge corpus" problems: feed candidate generation, recommendation precomputation, search ranking. The sophistication is in the offline pipeline, not the online serving. Get the offline pipeline right and serving becomes trivial.