Database Replication - Mohammadali Bazyar

Why Replication Exists

You have one database. It works. Then a hard drive fails, the data center loses power, or a process crashes mid-write, and your data is suddenly unavailable or worse, gone. The whole product is down.

Or your traffic grows. One database can handle 10,000 reads per second. Now you need 100,000. The single machine is the bottleneck.

Or your users span continents. A database in Virginia is 200ms away from Tokyo. Every query feels slow.

All three problems have the same answer: replication. Run multiple copies of the database, kept in sync, distributed across machines and regions. Each copy is a real database that can serve traffic. Failures of one copy don't take down the system. Reads spread across copies. Copies near users serve them faster.

Replication is the foundation of every production-grade database setup. Even modest applications run with at least one replica for failover. This article walks through the mechanisms, the trade-offs, and the operational concerns that make it work.

Step 1: The Three Reasons to Replicate

Availability

If one server dies, another keeps serving. The product stays up. This is the #1 reason most teams replicate. Even tiny applications benefit: a 99.9% uptime SLA with one server typical of cloud infrastructure means roughly 9 hours of downtime per year. Two replicas in standby cuts that dramatically.

Read Scalability

Distribute read traffic across multiple copies. Each replica handles a slice of the load. If your application is read-heavy (most are: 90% reads, 10% writes is typical), this multiplies your effective capacity.

Writes still go to one place (in leader-follower setups), so write scalability requires sharding (covered in another article).

Geography

Keep replicas near users. A database in the same region as the user gives 5-50ms latency. Cross-continent gives 100-200ms. For interactive applications, the difference is felt.

Multi-region replication is non-trivial: replication lag, conflict resolution, regulatory considerations all come into play. But for global apps it is usually worth the effort.

Step 2: The Three Replication Topologies

Three fundamental approaches differ in how writes are handled.

Topology 1: Leader-Follower (Master-Slave)

One server is the leader. All writes go to it. The leader streams its changes to one or more followers. Followers serve reads only; they never accept writes directly.

Leader-Follower Topology

Leader
writes + reads

replicates to

Follower 1
reads only

Follower 2
reads only

Follower 3
reads only

Pros: simple to reason about. No write conflicts because there's only one writer. Easy to scale reads (add followers).
Cons: the leader is a write bottleneck and a single point of failure. Promotion of a new leader during failover is tricky and one of the riskiest operations in distributed databases.

Used by: Postgres (streaming replication), MySQL (replication), Redis (replication), MongoDB (replica sets). The default and most common topology in nearly all relational databases.

Topology 2: Multi-Leader (Master-Master)

Multiple servers can accept writes. Each one replicates its changes to the others.

Pros: writes can happen anywhere. Local writes possible per region (a user in Tokyo writes to Tokyo's leader, not Virginia's). No single bottleneck for writes.
Cons: conflict resolution. If two leaders accept conflicting writes simultaneously (user A is changed differently in two places), the system has to merge them. Most conflict-resolution approaches are imperfect (last-write-wins discards data; CRDT requires careful design).

Used by: Cassandra (sort of, peer-based with quorums), CouchDB, multi-region setups in some cloud SQL services, MySQL multi-master configurations.

Multi-leader is most commonly used for cross-region writes where you want local-region latency. Otherwise it adds complexity without clear benefit.

Topology 3: Leaderless

No designated leader. Every node accepts writes. Clients write to several nodes simultaneously and read from several to ensure consistency.

Uses quorums: with N replicas, W writes acknowledged and R reads acknowledged, you get strong consistency when W + R > N. Common config: N=3, W=2, R=2.

Pros: very fault-tolerant. No leader to fail. Writes succeed even if some nodes are down. Linearly scalable.
Cons: the client (or coordinator) handles all the complexity of figuring out which value is current. Requires "read repair" and "anti-entropy" background processes to keep replicas converged. More complex client behavior.

Used by: DynamoDB, Cassandra (when configured this way), Riak. Inspired by the original Dynamo paper from Amazon.

Leaderless is the choice when extreme write availability matters more than simpler operational models.

Step 3: Synchronous vs Asynchronous Replication

Orthogonal to topology. The question: when does the leader confirm a write to the client?

Synchronous Replication

Wait until all (or some) followers acknowledge the write before responding to the client. Strong durability. The write is safe even if the leader crashes immediately after responding.

The cost: every write pays the round-trip to followers. If a follower is slow, all writes slow down. If a follower is down, writes might fail entirely.

Asynchronous Replication

Respond to the client as soon as the leader has the write. Followers catch up later. Fast and resilient: a slow follower doesn't slow down writes.

The cost: if the leader crashes before replicating, those last few writes are lost. Tens of milliseconds of data potentially gone, which matters for some applications and is fine for many.

Semi-Synchronous

Wait for at least one follower to acknowledge (not all). Balances safety and speed. Used by MySQL, MongoDB (with write concerns), and others.

Most production databases default to semi-sync because it gives most of the durability of full sync without the worst-case latency.

What to Choose

Pure async: speed-critical workloads where small data loss windows are acceptable.
Pure sync: regulated workloads (banking, healthcare) where every committed write must be durable on multiple machines.
Semi-sync: most everything else. The pragmatic default.

Step 4: Replication Lag

With async (or semi-sync), followers are always slightly behind the leader. Usually milliseconds. Sometimes seconds. Occasionally minutes during heavy load or network issues.

This causes the most-discussed problem in replicated systems: read-your-write inconsistency.

The Problem

A user writes data. Their client immediately reads. The read goes to a follower that hasn't caught up yet. The user sees old data. They panic: "Did my change save?"

Common scenarios: user updates their profile, refreshes the page, sees the old version. User posts a comment, sees the page without their comment. User makes a payment, sees their balance still showing the old amount.

Mitigations

Read from leader after writes. For a short window after a user writes, route their reads to the leader. Sticky session by user ID. Simple and effective. Trade-off: the leader gets more read load.

Track replication position. Remember the leader's log position when the user writes. Make subsequent reads wait until the follower catches up to that position. More complex but precise.

Show pending state in UI. Tell the user "saving..." until the change is confirmed propagated. Application-level workaround.

Read repair. If you read different values from different followers, ask the leader. Fix the lagging follower.

Monotonic Reads

A weaker consistency guarantee but useful: ensure a single user never sees data go "backward in time." If user A reads version N from one follower, all subsequent reads (in the same session) must see at least version N. Even if a different follower would otherwise return an older state.

Implementation: route a session's reads to the same follower (sticky), or track minimum acceptable replication position per session.

Step 5: Failover — When the Leader Dies

The leader crashes. One of the followers needs to take over. This is failover, and it is one of the most operationally complex moments in distributed databases.

Detection

How do you know the leader is really down vs just slow? Heartbeats with timeouts. The cluster monitor pings the leader every few seconds. If multiple heartbeats fail, declare it dead.

The hard part: avoiding false positives. A 30-second network blip should not trigger failover; the leader will recover. Aggressive timeouts cause unnecessary failovers; lazy timeouts cause user-visible outages while the system waits to declare death.

New Leader Selection

Which follower takes over? Usually the one with the most recent data (the smallest replication lag). All followers compare their replication positions; the most up-to-date wins.

This requires consensus: the cluster must agree on the new leader. Multiple followers running for leader simultaneously creates split-brain. Modern systems use Raft or Paxos for the election; older ones use heuristic approaches that can fail.

Reconfiguration

Once the new leader is elected, all clients and remaining followers must learn about it. Connections to the old leader fail; new connections target the new leader.

This is where many client libraries fail. They might keep retrying the dead leader, miss the announcement, or attempt to write to a different node. Modern client libraries handle this gracefully (sentinel for Redis, replica set drivers for MongoDB, etc.).

Data Loss During Failover

If async replication and the old leader had writes that hadn't been replicated yet, those writes are lost when the new leader takes over. Period. There is no way to recover them once the old leader is gone.

For critical data, use sync replication so this can't happen. For everything else, accept that small data loss windows are real and design around it.

Split Brain

The old leader was thought to be dead. The cluster elected a new leader. The old leader was actually alive (just network-partitioned from the cluster). Now both think they are leader. They both accept writes. The data diverges.

This is a catastrophic failure mode in distributed databases. Mitigation:

Fencing tokens. Every leader gets an incrementing epoch number. Storage and clients reject writes from old epochs. The old leader's writes are rejected once the new leader takes over.
STONITH ("Shoot The Other Node In The Head"). Forcefully kill the old leader (often via remote management interfaces) before promoting a new one.
Quorum-based election. A new leader requires majority. The old leader, partitioned, cannot form a majority and gradually times out.

The Leader Election article goes deeper on these mechanisms.

Tools That Automate Failover

Patroni for Postgres. MHA or Orchestrator for MySQL. Sentinel for Redis. The database's own clustering features (MongoDB replica sets, etcd Raft, etc.).

All have their own quirks. None is foolproof. Test failover before you need it; it will happen at a bad time.

Step 6: How Replication Actually Works (Under the Hood)

Three main implementations.

Statement-Based Replication

The leader ships every write SQL statement to followers; followers replay them.

Brittle. Non-deterministic functions like NOW(), RAND(), UUID() produce different results on different machines. Some statements with side effects (auto-increment, triggers) are hard to replicate exactly. MySQL had problems with statement-based replication for years.

Mostly historical now. Modern systems prefer the alternatives.

Write-Ahead Log (WAL) Shipping

The leader writes a binary log of changes (this log already exists for crash recovery). The leader streams that log to followers; followers apply the log entries to their own state.

Reliable: the same bytes that the leader committed get applied to followers. No interpretation. Used by Postgres streaming replication.

Trade-off: tightly coupled to the database engine. Cross-version replication can be fragile (a Postgres 16 follower might not understand a Postgres 17 WAL format).

Logical (Row-Based) Replication

Ship row-level change events: "row X was updated to {a:1, b:2}." More portable across versions and even across databases.

Used by MySQL row-based replication (the modern default), Postgres logical replication (10+), and CDC tools (Debezium etc.).

Modern logical replication can do things WAL shipping can't: replicate a subset of tables, transform rows on the way, replicate to a different database engine. The cost: slightly more overhead and complexity than raw WAL shipping.

Step 7: Architecture Patterns

Single-Region Production Setup

One leader, two followers in the same region. Semi-sync replication (wait for at least one follower to acknowledge). Reads spread across followers. Failover automated via Patroni or similar.

This is a baseline for many web applications. It survives single-server failures and gives modest read scalability.

Multi-Region Read Replicas

Leader in primary region. Read replicas in other regions for low-latency reads. Writes always cross region to the primary.

Trade-off: writes are slow for users far from the primary. Reads are fast everywhere. Suitable when reads dominate and writes are tolerated to be slower.

Multi-Region Active-Active (Multi-Leader)

Each region has its own leader. Local writes are fast. Replication propagates to other regions asynchronously. Conflicts handled per the application's chosen strategy.

Most complex setup. Used by global services that need write availability everywhere.

Read Replicas for Analytics

A separate replica dedicated to analytics queries. Big, expensive aggregations don't impact the primary OLTP workload. Modern variations use Change Data Capture into a data warehouse (covered in the CDC article).

Hot Standby vs Cold Standby

Hot standby: replica is actively serving reads, ready to be promoted in seconds.
Warm standby: replica is up-to-date but not actively serving traffic. Promotion takes seconds.
Cold standby: backups stored elsewhere; restore takes hours but storage cost is minimal.

Most applications use hot standby for primary protection plus cold/cloud backups for disaster recovery.

Step 8: Edge Cases and Operational Concerns

Initial Replica Setup

Spinning up a new replica from a multi-TB database takes hours. The replica must catch up from scratch. Methods:

Take a backup of the leader and restore it to the new replica, then catch up via WAL.
Use a tool like pg_basebackup or innobackupex that streams the data while the leader keeps running.
For huge databases, set up replication from a snapshot to minimize impact.

Replication Slot Bloat (Postgres-Specific)

If a follower is slow or disconnected, the leader keeps WAL data so the follower can catch up. If the follower stays disconnected, the leader's WAL grows indefinitely until disk fills.

Monitor replication slot lag. Set max_slot_wal_keep_size to bound the growth. Recreate the replica if it falls too far behind.

Schema Migrations on Replicated Databases

Adding a column on a multi-TB table is expensive. With replication, both leader and replicas pay the cost. Online schema change tools (gh-ost, pt-online-schema-change for MySQL) help.

Some changes (renaming columns, changing types) require downtime or careful multi-step migrations to avoid breaking replication.

Monitoring

Critical metrics:

Replication lag (in bytes and in seconds). Should be low and stable.
Replication errors (failed transactions, breaks).
Connection state to all followers.
Disk space on leader and followers.
WAL/binlog size growth.

Lag spikes are early warning of trouble.

Backup Strategies on Top of Replication

Replication is not backup. It protects against single-server failures, not data corruption (which propagates) or accidental deletes (which propagate). Run regular backups separately, including point-in-time recovery via WAL archives.

Network Costs

Cross-region replication ships every change across the network. For high-write workloads, the bandwidth bill is real. Some setups compress WAL streams or use logical replication with filtering to reduce bandwidth.

Read Replicas Lag During Heavy Writes

A bulk import or migration on the leader produces enormous WAL. Replicas struggle to keep up. Read consistency degrades. Plan around large bulk operations: do them during low-traffic periods, throttle them, or temporarily redirect reads to the leader.

Clock Drift Between Nodes

Some replication mechanisms use timestamps. Clock drift between leader and followers can cause issues. Use NTP or similar to keep clocks within a few milliseconds.

Cascading Replication

A follower can replicate to its own followers (reducing leader load). Useful for very large fan-out. Adds another level of lag but spreads bandwidth load.

Step 9: Recap of Key Decisions

Replication is not optional in production. Even one replica saves you from inevitable disk failures.
Leader-follower is the right default. Simple, well-understood, supported everywhere.
Multi-leader for multi-region writes. Pay for it with conflict resolution complexity.
Leaderless for extreme write availability. The Dynamo lineage.
Async or semi-sync for most workloads. Pure sync is too slow for typical apps.
Replication lag is real and visible to users. Design around read-your-writes and monotonic reads.
Failover is the riskiest moment. Automate it. Test it. Plan for split-brain.
Replication is not backup. Always run separate backups.
Monitor lag, errors, and disk space. Lag spikes are the earliest warning.

The One Thing to Remember

Replication is the foundation of every production database setup. The question is not whether to use it but which topology and how synchronous. Leader-follower with async or semi-sync is the right starting point for almost everyone. Multi-leader and leaderless are powerful but bring conflict-resolution complexity that most teams don't actually need. Whichever you pick, design around replication lag, plan failover before you need it, test it before production, and remember that the most dangerous moment in any replicated system is not a quiet failure; it is a leader transition that goes wrong. Lots of databases promise easy replication. Few of them are easy when something breaks. Invest in monitoring, in tooling, and in your understanding of how it actually works.