Why Distributed Systems Cannot Always Be Consistent
Imagine you have a single database. You write data to it, then you read it, and you always see exactly what you just wrote. Simple. Predictable. This is called strong consistency. ACID guarantees give it to you out of the box, as long as everything fits on one machine.
Now imagine your system is too big for one machine. You have a payments service, an inventory service, a shipping service, a notifications service. Each runs on its own servers, with its own database, in different regions. A single user action (like placing an order) needs to update all of them.
You suddenly hit a wall. Forcing all those services to agree on the same state at the exact same moment is either impossibly slow, fragile, or both. So distributed systems make a different bargain: eventually, all the services will agree. For a brief window, they will not.
That bargain is called eventual consistency, and it is the foundation of nearly every system you use at scale: Amazon's shopping cart, Twitter's timeline, your bank's transaction history, Uber's location tracking. They all live in this world.
The challenge is not whether to accept eventual consistency. It is which pattern to use to manage it. Get the pattern wrong and you get duplicate orders, lost messages, or impossible-to-debug race conditions. Get it right and your system feels seamless to users despite running across hundreds of servers.
This article walks through the four core patterns: Event-based, Background Sync, Saga, and CQRS. By the end, you will know when to reach for each one.
The Consistency Spectrum
Consistency is not a yes-or-no property. It is a spectrum. On one end you have absolute strong consistency: every read sees the latest write, no matter what. On the other end you have eventual consistency: reads might see stale data for a while, but eventually they catch up. In between are many flavors.
Most real systems pick a position on this spectrum based on their needs. A bank balance needs to be strongly consistent. A social media like count is fine being eventually consistent. The trick is picking correctly per use case, not for the whole system.
Pattern 1: Event-Based Eventual Consistency
This is the most common pattern in modern systems. The idea is simple: when one service changes its data, it publishes an event describing what happened. Other services that care about that event subscribe and update their own data accordingly.
The services never call each other directly. They communicate through a message broker (like Kafka, RabbitMQ, or Pulsar). This keeps them loosely coupled. Service A does not know or care which other services are listening. It just shouts "OrderCreated" into the void, and the void figures out who needs to hear it.
Real-world example: An e-commerce platform. When the Orders service writes a new order, it publishes OrderCreated. Five different downstream services consume that event independently. Inventory drops the stock count. Billing kicks off the charge. Shipping creates a label. Notifications emails the customer. Analytics tracks the conversion. None of these services know about each other.
- Loose coupling. Add new consumers without touching the producer.
- High throughput. The event bus absorbs spikes.
- Audit trail. Every change is an immutable event.
- Resilient. If one consumer is down, others keep working.
- Hard to debug across services.
- Out-of-order events can cause weird states.
- You must handle duplicate events (idempotency).
- No straightforward rollback if something fails downstream.
When to use it: Default choice for microservices. Use it whenever a state change in one service should trigger reactions in others, and you do not need to roll back the whole chain on failure.
Pattern 2: Background Sync
This pattern does not react to events at all. Instead, a scheduled job runs periodically and copies data from one place to another. It is the boring, reliable workhorse of eventual consistency.
You set up a cron job (or an Airflow DAG, or a Lambda on a timer) that runs every 5 minutes, every hour, every night. The job reads from the source, transforms what it needs, and writes to the target. Differences between the two systems get reconciled with each run.
system of record
derived view
Real-world example: A data warehouse. Operational databases serve the live application. Every night at 3 AM, an ETL job copies fresh data into the warehouse so analysts can query it without slowing down production. Reports lag the live data by a day, but everyone is okay with that.
Another example: search indexes. A nightly job rebuilds the Elasticsearch index from the canonical Postgres data. New articles take up to 24 hours to appear in search, which is fine for many use cases.
- Dead simple to implement and reason about.
- No event infrastructure needed.
- Easy to re-run if something fails (just trigger again).
- Easy to add new sync jobs without touching producers.
- Lag equals the schedule interval. Always stale by some amount.
- Big jobs that scan entire tables get slow at scale.
- Hard to handle deletes (the source no longer has the row).
- Wasted compute when there's nothing new to sync.
When to use it: When the lag is acceptable (minutes to hours), the data volume is manageable, and you want simplicity over real-time freshness. Common for analytics, reporting, search indexing, and cross-region sync.
Pattern 3: Saga
The hardest case is when a single business transaction spans multiple services and you need all of them to succeed (or all of them to be rolled back). You cannot use a database transaction because the data lives in different databases. You cannot just hope events work out, because if step 3 fails after step 2 already happened, you are left with corrupt state.
A saga solves this. It is a sequence of local transactions. Each step writes to one service's database. If any step fails, the saga runs compensating transactions in reverse order to undo the previous steps.
There are two flavors of saga: choreography and orchestration.
Saga Flavor 1: Choreography
No central coordinator. Each service listens for events and decides what to do next. The flow emerges from the conversation between services.
Pros: No single point of failure. Easy to add new steps. Fits nicely with event-driven systems.
Cons: The overall flow is hidden. You have to read multiple services' code to understand what happens end-to-end. Cycles and ambiguity become real risks as the saga grows.
Saga Flavor 2: Orchestration
A central orchestrator coordinates the saga. It calls each service in sequence, knows the expected outcome, and triggers compensating transactions if any step fails.
Real-world example: Booking a vacation package. The orchestrator reserves a flight, then a hotel, then a car. If the car reservation fails, it cancels the hotel, then the flight. The user either gets the full package or nothing.
Pros: Clear, centralized logic. Easy to understand and monitor. Compensations are explicit.
Cons: The orchestrator becomes a single point of complexity (though tools like Temporal, AWS Step Functions, and Camunda make this manageable). Some coupling re-introduced.
Compensating Transactions
The most important concept in sagas. Since you cannot use database rollback across services, you have to undo previously committed work by writing compensating actions.
Note that some compensations are tricky or impossible. You can refund a charge, but you cannot un-send an email. Real saga design requires thinking about this carefully.
When to use sagas: When you need transactional semantics across services. Booking systems, payment flows, multi-step workflows, anything where partial completion is unacceptable.
Pattern 4: CQRS (Command Query Responsibility Segregation)
The previous patterns were about consistency between services. CQRS is about something different: separating writes from reads within a single domain.
The key insight: most applications have very different needs for writes versus reads. Writes need integrity, validation, business rules. Reads need speed, flexibility, denormalization. Trying to use the same data model for both forces compromises.
CQRS splits them. The write model handles commands (create order, update profile, cancel subscription). The read model handles queries (show me my orders, show top products, show user profile). They live in separate databases, optimized differently, kept in sync via events.
validates, applies rules
normalized, ACID
fast lookups
denormalized, optimized
Real-world example: An e-commerce platform's product page. Writes (admin updating prices, inventory changes) go to a normalized Postgres database with full ACID. Reads (millions of users browsing) go to a denormalized read model in Elasticsearch or Redis with all the data needed for the page already pre-joined and pre-formatted. New writes propagate to the read model within a second or two via events.
Another example: a banking dashboard. The transaction ledger is an append-only write model with strict ACID. The user-facing account balance is a separately maintained read model that pre-computes balances from the ledger. The user always sees a snappy summary even though the underlying ledger is huge.
- Reads and writes scale independently.
- Different data models for different needs (normalized writes, denormalized reads).
- Different storage engines per side (Postgres for writes, Elasticsearch for reads).
- Read side can power many different views without affecting writes.
- Complexity: two data models, two databases, sync layer between them.
- Read-after-write surprises (just wrote something, immediately query, do not see it yet).
- Eventual consistency lag must be visible to users or designed around.
- Overkill for simple CRUD apps.
When to use CQRS: When read and write workloads have very different characteristics. Heavy read traffic on data that changes slowly. Complex domain with strict write rules but simple lookup needs. High-throughput systems where mixing reads and writes on the same database hurts both.
Common Failure Modes
Eventual consistency introduces a class of bugs that simply do not exist in single-database, strongly consistent systems. Knowing these in advance saves a lot of pain.
Which Pattern Should You Use?
The One Thing to Remember
Eventual consistency is not a compromise. It is a tool. Picking it gives you scale, availability, and looser coupling that strong consistency cannot match. The price you pay is reasoning about time: knowing that for some window, your different services will disagree.
The four patterns covered here are not mutually exclusive. Real systems use multiple at once. Events flow between services for cross-service consistency. CQRS optimizes a single service's read and write needs. Sagas wrap critical multi-step flows. Background sync fills in when nothing else needs to be real-time.
The goal is not to avoid eventual consistency. It is to choose where to place it deliberately, with the right pattern, and design around the fact that the system will sometimes be temporarily inconsistent. The teams that get this right build systems that feel seamless to users despite running across hundreds of servers in different parts of the world. The teams that get it wrong get woken up at 3 AM trying to figure out why an order has been refunded twice.