Why Cache Invalidation Is Famously Hard

Phil Karlton once said: "There are only two hard things in computer science: cache invalidation and naming things." Three decades later, it is still true. Naming is hard because it requires precision in language. Cache invalidation is hard because it requires precision about time: knowing exactly when a piece of cached data has become stale and acting on that knowledge correctly across distributed systems.

The setup of every cache problem is the same. You have data living somewhere authoritative (a database). You make copies of it somewhere faster (a cache). The cache lets you serve millions of reads per second without melting your database. Wonderful.

Then the data changes. The cache still has the old version. Someone reads from the cache and sees stale data. Sometimes this is harmless. Sometimes it costs millions of dollars in fraud, customer complaints, or lawsuits. The job of cache invalidation is to make sure that "sometimes harmless" never tips into "sometimes catastrophic."

This article walks through every major strategy, when each works, when each fails, and how to combine them.

Step 1: Why It Is Genuinely Hard

The naive answer is "when the data changes, delete the cache entry." Easy, right? It is not. Every part of that sentence hides a problem.

"When the Data Changes"

How do you know data has changed? Three ways:

You wrote the change. The application that wrote knows. But maybe not, because the write went directly to the database via SQL and your cache lives in a different service. Or because background jobs change data without telling you.
The data has a TTL. You agreed in advance that data is "fresh enough" for N seconds. After that, it is stale by definition.
The data version increments. Some external version number tells you when it changes. But what tells you that the version changed?

"Delete the Cache Entry"

Which entry? In which cache? You probably have multiple caches: the database query cache, the application-level cache, Redis, the CDN, browser caches. Invalidation must propagate to all of them. If you forget one, users see stale data from that one.

"Across Distributed Systems"

If your cache is replicated across multiple nodes, the invalidation must reach all of them. Network failures, race conditions, ordering issues all complicate this. You can issue an "invalidate" command and have it succeed on three nodes and fail on the fourth.

The Result

Cache invalidation requires you to think simultaneously about consistency, distribution, ordering, and timing. Get any one wrong and stale data leaks. The bugs are intermittent, hard to reproduce, and only appear at scale. This is why it is famously hard.

Step 2: Strategy 1 — TTL (Time To Live) Expiration

The simplest strategy. Every cache entry has a duration in seconds. After that duration passes, the cache automatically deletes the entry. The next read fetches fresh data and re-populates the cache.

How It Works

When you write to the cache, set an expiration time:

cache.set("user:42", user_data, ex=300)  # 5 minutes

For 5 minutes, reads hit the cache. After 5 minutes, the cache returns "not found" and the application falls through to the database, then re-caches.

Trade-offs

Pros: Zero application logic. Just set the TTL and forget. Works with any cache backend (Redis, Memcached, all of them support TTL natively). Self-healing: even if you forgot to invalidate, the data eventually catches up.
Cons: Stale-data window equals the TTL. A 60-second TTL means up to 60 seconds of staleness for any change.

When It Is Right

The data changes infrequently and stale-by-N-seconds is acceptable. Examples: product listings on an e-commerce site, weather data, exchange rates, paginated lists, search index summaries.

Pick the TTL based on how stale is acceptable. Aggressive (1-30 seconds) for things that change often and matter. Generous (hours) for things that rarely change.

When It Is Wrong

Changes need to be reflected instantly. Account balances, inventory counts, real-time chat, security tokens. Stale data here causes real problems: a user transferring money might see an outdated balance and overdraft; a user buying the last item in stock might be told the order succeeded when inventory has run out.

Variants

Sliding TTL: the timer resets every time the entry is accessed. Used in session caches: as long as the user is active, their session stays cached.
Per-key TTL: different entries get different TTLs based on their importance. Cheap and easy in Redis.
Random jitter on TTL: add some randomness (e.g., 300 seconds plus or minus 30) so that not every entry expires at the same instant. Prevents simultaneous expiration storms (covered below).

Step 3: Strategy 2 — Event-Based Invalidation

The other simple end. When the source of truth changes, explicitly delete or update the cache entry. The next read fetches fresh data.

How It Works

def update_user(user_id, new_data):
    db.update("users", user_id, new_data)
    cache.delete(f"user:{user_id}")

Two operations: update database, then invalidate cache. Order matters: invalidate AFTER the database write succeeds, otherwise a concurrent read might re-cache the old data.

Trade-offs

Pros: Zero stale data. The cache is in sync within milliseconds of the change.
Cons: Every code path that modifies data has to remember to invalidate. Easy to miss one. Once a stale-cache bug exists, it is famously hard to find.

The Discipline Problem

You have ten different code paths that update users: the API endpoint, an admin tool, a batch job, a webhook handler, a migration script, a CLI command, a CRON job, a periodic sync, an event consumer, a background worker. Each one needs to call the invalidation. Miss any of them and stale data leaks.

This is the #1 source of stale-cache bugs in production. The fix is structural: route all writes through a single layer that handles invalidation. Repository pattern, write-through service, or change data capture (covered later).

Update vs Invalidate

Two flavors:

Invalidate: delete the cache entry. Next read repopulates from source. Simple. Slight latency penalty on the first read after change.
Update: overwrite the cache entry with the new value. No re-fetch needed. Slightly faster reads but more code (you must construct the new cache value at the point of write).

Most teams default to invalidate. It is simpler and the latency penalty is negligible.

Step 4: Strategy 3 — Write-Through Caching

Writes go through the cache, which writes to both the cache and the database synchronously. Reads always see fresh data because writes update the cache immediately.

How It Works

The cache layer sits between the application and the database. The application writes to the cache. The cache writes to the database synchronously, then updates its own copy.

def cache_write_through(key, value):
    db.update(key, value)       # blocks until DB confirms
    cache.set(key, value)       # only after DB success

If the database write fails, the cache write doesn't happen. If both succeed, both are consistent.

Trade-offs

Pros: Consistency without separate invalidation logic. Reads after writes always see the new value. Simple mental model.
Cons: Slower writes (must wait for both). Adds dependency on cache being available for writes. Mixed-cache scenarios are hard (which cache layer writes through?).

When It Is Right

Critical-consistency data where you cannot tolerate any staleness window. Inventory systems, financial transactions, real-time configuration.

Related Patterns

Write-around: writes go directly to the database, bypassing the cache entirely. The cache only populates from reads. Less stale data than read-through with TTL because the cache holds only data that has been read recently. But the first read after a write always misses.

Write-back (write-behind): writes go to the cache only; the cache lazily writes back to the database. Fastest writes, dangerous on cache failure (data loss). Used in disk caches, rarely in distributed application caches.

The Caching Strategies overview article covers all of these in detail.

Step 5: Strategy 4 — Versioning (Cache Busting)

Don't invalidate at all. Change the cache key. Old entries linger but no one asks for them. Eventually the cache evicts them by LRU or other policy.

How It Works

Common with static assets. Append a hash or version to the URL or key:

/static/style.v123abc.css   (today)
/static/style.v124def.css   (after change)

When the CSS changes, the URL becomes a new version. Browsers (and CDNs) treat it as a new resource. Cached copies of v123 sit unused until evicted.

For application-level caches:

cache.set("user:42:v3", user_data)
# After data changes:
cache.set("user:42:v4", user_data)

Reads use the current version number. Old versions remain until evicted.

Trade-offs

Pros: No invalidation needed. Multiple versions can coexist (useful during deploys). Cache hit rate is high because nobody ever reads stale data.
Cons: Cache fills up with old versions. Need an eviction policy that handles them. The version number itself must be stored somewhere reliable.

When It Is Right

Static assets that have a clear "version" meaning: bundled JS/CSS, images for product photos, immutable user-uploaded files. Build tools generate versioned filenames automatically.

Also good for objects that change occasionally and where you want to atomically swap to the new version.

Step 6: Strategy 5 — Tag-Based Invalidation

Tag cache entries by what they depend on. When a tag's underlying data changes, invalidate everything with that tag.

How It Works

cache.set("user:42:profile", data, tags=["user:42"])
cache.set("user:42:settings", data, tags=["user:42"])
cache.set("homepage:popular", data, tags=["user:42", "users", "popularity"])

# When user 42 changes anything:
cache.invalidate_tag("user:42")
# All three entries above are gone.

The Power

This solves the "this view depends on multiple data sources" problem. A homepage shows popular products, recent reviews, user-specific recommendations. If any of those source data sets changes, the homepage view should be invalidated. Tags let you express this without writing complex invalidation logic per route.

Implementations

Varnish: tag-based invalidation built in (called "ban" or "purge").
Cloudflare and other CDNs: tag-based purge as a paid feature.
Custom Redis-based: you build it yourself. A tag is a Redis set holding the keys with that tag. Invalidating a tag iterates through the set and deletes the keys.

Trade-offs

Pros: Express complex invalidation rules cleanly. One change can invalidate many related entries.
Cons: Storage overhead (tag-to-keys mappings). Risk of over-invalidation if tags are too broad. Tag explosion if every key gets many tags.

When It Is Right

Complex applications where many cached views depend on overlapping source data. CMS-like systems, e-commerce homepages, dashboards. Anywhere you find yourself writing "if X changes, invalidate A, B, C, and D."

Step 7: Strategy 6 — CDC-Based Invalidation

Connect cache invalidation to the database's transaction log. When the database commits a change, the cache automatically invalidates relevant entries.

How It Works

Tools like Debezium, Maxwell, or AWS DMS watch the database's write-ahead log. Every commit produces a change event with old and new values. A consumer maps these events to cache invalidations.

For example, the consumer sees: "row in users table with id=42 was updated." It invalidates user:42 in Redis. Plus any tagged entries.

Trade-offs

Pros: Single source of truth. The database knows all changes, regardless of which application wrote them. No risk of forgetting an invalidation in some code path. Works for direct SQL writes, ORM writes, batch jobs, migrations, and external tools.
Cons: Adds infrastructure complexity (CDC pipeline). Latency window between commit and invalidation (typically seconds, sometimes more). Initial setup is non-trivial.

When It Is Right

Systems with multiple writers (services, scripts, admin tools) where coordinated cache invalidation is otherwise impossible. CDC removes the discipline problem from event-based invalidation.

The CDC article goes into detail on this.

Step 8: The Cache Stampede Problem

This deserves its own section. It is the #1 reason cached systems fail under load.

The Problem

A popular cache entry expires. Within milliseconds, a million concurrent users all miss the cache simultaneously. They all hit the database for the same data. The database melts.

The Cache Stampede
Before expiry
1M concurrent users
Cache: HOT
all hits served
DB: idle
cache expires
Stampede
1M users
Cache: MISS
all bypass
DB: 1M concurrent queries
OVERLOADED

Solutions

Probabilistic Early Refresh

As the TTL approaches, occasionally refresh the entry before it expires. Each request has a small (and growing) probability of triggering refresh. By the time TTL hits zero, someone has likely already refreshed.

Implementation: each read computes a random number. If random < (now - issued) / (expiry - issued) × beta, refresh now. Small constant beta (1.0 typical) tunes how aggressively to refresh.

Single-Flight / Request Coalescing

When many concurrent requests miss the cache for the same key, only one request actually goes to the database. Others wait for it to complete and use its result.

Languages and frameworks provide primitives:

Go: singleflight from x/sync.
Java: Caffeine cache library.
Python: implemented manually with locks.
JS/Node: implemented manually with promise reuse.

Effect: even with a million concurrent cache misses, only one database call happens. The other 999,999 requests wait briefly and get the result from cache.

Almost every production cache system needs this. Without it, traffic spikes will eventually take down origin.

Lock-Based Regeneration

Similar to single-flight but using an explicit lock (often Redis-based). The first request acquires a lock; others either wait or briefly serve stale data; the holder regenerates and releases.

Pre-warming

For very hot keys, refresh proactively (via a background job) before they expire. Ensures the cache is always populated. Used heavily in CDNs for predicted-popular content.

Stale-While-Revalidate

An HTTP cache directive that says: "if the entry is expired, serve it anyway and refresh in the background." Users always get a fast response. The cache catches up asynchronously.

Eliminates stampedes because nobody waits for refresh during a miss. The trade-off: brief staleness window (seconds to a few minutes depending on background refresh latency).

Random TTL Jitter

If many keys have the same TTL, they all expire at the same time, multiplying the stampede. Add randomness: TTL of 300 seconds becomes 270 to 330 seconds (10% jitter). Different keys expire at different times. Smooths the load.

Step 9: Negative Caching

Don't just cache positive responses. Cache "not found" results too, with a short TTL.

Why It Matters

Without it, every request for a non-existent item hits the database. If an attacker repeatedly requests random non-existent IDs, every request bypasses cache and hammers the database. Or if your application has a bug that requests bad keys, your database load spikes.

With negative caching, the second request for "missing key" hits the cache and gets a cheap negative result instead of going to the database.

Implementation

def get_user(user_id):
    cached = cache.get(f"user:{user_id}")
    if cached is not None:
        if cached == NOT_FOUND_SENTINEL:
            return None
        return cached

    user = db.get(user_id)
    if user is None:
        cache.set(f"user:{user_id}", NOT_FOUND_SENTINEL, ex=60)
        return None
    cache.set(f"user:{user_id}", user, ex=300)
    return user

Caveats

Use a short TTL for negatives (60 seconds typical). Otherwise newly-created records get hidden until the negative cache expires.

Distinguish "not found" from "lookup failed." Don't cache temporary failures (database timeout, network error) as negative results; they will recover.

Step 10: HTTP Caching Specifics

HTTP has its own cache control mechanisms. Browsers and CDNs respect them automatically.

Cache-Control Header

The most important header. Controls how clients (browsers, CDNs) cache the response.

Cache-Control: public, max-age=3600 — cache anywhere for 1 hour.
Cache-Control: private, max-age=600 — only the user's browser, not CDNs.
Cache-Control: no-cache — must revalidate before serving.
Cache-Control: no-store — never cache, anywhere.
Cache-Control: max-age=300, stale-while-revalidate=86400 — fresh for 5 minutes, then serve stale up to 1 day while refreshing.

ETag and If-None-Match

Conditional revalidation. The server sends an ETag (a content fingerprint) with each response. The client's next request includes If-None-Match with the ETag. If the content hasn't changed, the server returns 304 Not Modified instead of the full body. Saves bandwidth.

Last-Modified and If-Modified-Since

Same idea, timestamp-based. Less reliable than ETag (only second-precision, no detection of content swaps).

Vary Header

Tells caches that the response depends on certain request headers. Vary: Accept-Language means cache separately per Accept-Language value.

Be careful: every Vary axis multiplies cache entries. Vary: User-Agent can fragment a single object into millions of cache entries.

Step 11: Choosing a Strategy

Decision Tree

Question 1: Can I tolerate staleness for some seconds/minutes? If yes, TTL is the simplest.

Question 2: Do I need invalidation faster than my TTL allows? If yes, add event-based or CDC.

Question 3: Do I have multiple writers (services, jobs, tools)? If yes, CDC removes the "forgot to invalidate" risk.

Question 4: Do many cached views depend on overlapping source data? If yes, tag-based.

Question 5: Is the data immutable once created or rarely changed? If yes, versioned URLs are cleanest.

Question 6: Can I tolerate any cache stampede on hot keys? Probably not. Add stale-while-revalidate or single-flight.

Combining Strategies

Real systems combine strategies per data type:

User profile: TTL of 5 minutes + event-based invalidate on profile updates.
Static assets: versioned URLs, never invalidated.
CDN edge cache: stale-while-revalidate plus tag-based purge for special events.
Application read-through cache: TTL with random jitter and single-flight protection.
Configuration data: write-through cache for instant consistency.

The wrong answer is "one strategy fits all." Different data has different needs.

Step 12: Edge Cases and Operational Concerns

The Read-After-Write Race

You write to the database, then invalidate the cache. Between those steps, a concurrent reader might fetch from the database (still old data) and write to the cache. Now the cache has stale data even though you just invalidated.

Fix: invalidate AFTER the database write. Even better: invalidate twice (before and after) with a sufficient delay. Or use single-flight on the read path so concurrent reads coalesce.

Cache Inconsistency Across Replicas

If your cache is replicated (Redis cluster, multi-region), invalidations must propagate. A small replication lag can mean one replica still has stale data while another is fresh.

For most use cases, eventual consistency is fine. For strict consistency, write-through or quorum-based caches.

Cache Warming on Cold Start

You restart the cache cluster. All keys are gone. Every request now hits origin. Origin cannot handle the load. Cluster cannot warm up because new requests overwhelm origin.

Fix: warm the cache from a backup before opening to traffic. Or open to traffic gradually. Or pre-load critical keys from origin before going live. Or use single-flight to bound concurrent origin queries.

Cache Memory Pressure and Eviction

Caches have bounded memory. When full, they evict. The eviction policy (LRU, LFU, random) determines what stays. Wrong policy can evict your hot keys to make room for cold ones.

Monitor cache hit rate. If it drops, investigate: maybe eviction is too aggressive, or your working set has grown beyond cache size.

Distributed Cache Failure

The cache cluster goes down. All requests fall through to origin. Origin overwhelmed. Compounds in milliseconds.

Defense: fallback caches in-process; degraded service mode that serves slightly stale or simplified data; circuit breakers that quickly fail with cached "we're degraded" responses rather than overloading origin.

Multi-Layer Cache Coherence

Modern systems have many cache layers. Browser cache, CDN cache, in-process cache, distributed cache, database query cache. Invalidation must propagate through all of them.

The right pattern: explicit cache hierarchies where each layer knows its parent. Or use cache header chaining: when origin sends a fresh response, it propagates Cache-Control everywhere.

Avoiding Over-Invalidation

Tag-based invalidation can over-invalidate if tags are too broad. "Invalidate all user-related caches" might wipe out millions of entries when you only meant to update one user.

Fix: more granular tags. Hierarchical tags (user:42:profile vs. user:42:settings). Track which tags actually invalidate frequently and refine.

Auditability

When stale data appears in production, you want to debug. Log every cache write, invalidation, and miss. Sample at high rates so you have data when something goes wrong. Without logs, debugging cache issues is shooting in the dark.

Step 13: Recap of Key Decisions

TTL is the simplest strategy. Use it as the default unless you need stricter staleness control.
Event-based invalidation gives instant consistency. But requires discipline across all writers.
CDC removes the discipline problem. Connects invalidation to the database transaction log.
Write-through caching for critical-consistency data. Slower writes, instant reads.
Versioned URLs for immutable assets. No invalidation needed.
Tag-based invalidation for complex view dependencies. Express "invalidate all derivatives of X" cleanly.
Always have stampede protection. Single-flight or stale-while-revalidate.
Negative caching to prevent miss storms. Cache "not found" too.
Random TTL jitter to prevent simultaneous expiration. Smooths the cache miss load.
Combine strategies per data type. No single strategy fits everything.

The One Thing to Remember

There is no universal cache invalidation strategy. Each approach has a failure mode: TTL accepts staleness, event-based requires discipline, write-through slows writes, tag-based risks over-invalidation, versioning requires URL changes. The skill is recognizing which failure mode you can live with for each piece of data, then layering strategies. TTL plus event-based plus stale-while-revalidate is a strong default for most caches. Add CDC if you have multiple writers. Add tag-based if you have complex view dependencies. Always add stampede protection because traffic spikes will eventually find your weak points. The hardest part of cache invalidation is not the algorithms; it is admitting upfront which staleness window your business can tolerate, and designing around that constraint instead of pretending caches and consistency can both be perfect.