Leader Election - Mohammadali Bazyar

Why Distributed Systems Need a Leader

Many distributed systems need a single node to be the "leader" for some task: the one accepting writes (in databases), the one running a scheduled job, the one handing out unique IDs. If two nodes think they're the leader at the same time (split brain), things break in subtle, terrifying ways: duplicated writes, inconsistent state, lost updates.

Leader election is the mechanism for nodes to agree on who's in charge. It sounds simple. It is one of the deepest problems in computer science.

The Hard Part: It's an Agreement Problem

If you have 5 nodes and they all need to agree on which one is the leader, you cannot just have node 1 declare itself the leader. What if node 1 has a network issue and the others can't see it? What if node 2 declared itself first and node 3 already started following node 2?

This is a classic distributed consensus problem. The famous FLP impossibility result says that perfect consensus is impossible in a fully asynchronous network with even one faulty node. In practice, we accept slightly weaker guarantees and use protocols that work "most of the time" with high probability.

The Bully Algorithm

The simplest leader election. Every node has a unique ID. The node with the highest ID wins.

When a node thinks the leader is dead (no heartbeats), it sends an "election" message to all higher-ID nodes. If none reply, it declares itself the leader. If a higher-ID node replies, that node takes over the election.

Pros: simple to understand and implement.
Cons: the highest-ID node always wins, which doesn't account for which node is healthiest or most up-to-date. Also brittle in network partitions.

Used for simple coordination in some legacy systems. Modern distributed systems use stronger algorithms.

Raft: The Modern Standard

Raft is the consensus algorithm most modern systems use. It was designed to be understandable (Paxos is famously hard). Used by etcd, Consul, CockroachDB, TiKV, Kafka KRaft mode, and many others.

The mechanics:

1. Every node is in one of three states: Follower, Candidate, or Leader.
2. Time is divided into terms, each with at most one leader.
3. Followers wait for heartbeats from the leader. If they don't hear one within an election timeout (randomized, typically 150-300ms), they become Candidates and start an election.
4. A Candidate increments the term, votes for itself, and asks others for votes.
5. A node grants its vote to the first valid Candidate it sees in that term (one vote per term).
6. A Candidate with a majority of votes becomes Leader. Sends heartbeats to silence other candidates.

If two candidates split the vote (rare, due to randomized timeouts), the term ends without a leader. Election restarts after another timeout.

The Critical Property: Majority

Raft requires a majority (more than half) of nodes to elect a leader and to commit any change. This is what prevents split brain.

If you have 5 nodes and the network splits into a group of 3 and a group of 2, only the group of 3 can elect a leader. The group of 2 cannot reach majority and stays leaderless. When the partition heals, the smaller group accepts the leader from the larger group.

This is why Raft clusters always have an odd number of nodes (3, 5, 7). Even numbers waste a node: 4 nodes can tolerate the same number of failures as 3 (one), with extra cost.

Paxos

The original consensus algorithm, formalized by Leslie Lamport. Mathematically beautiful, infamously hard to understand. Two key phases (Prepare and Accept) and three roles (Proposers, Acceptors, Learners).

Paxos handles consensus on a single value. To handle a sequence of values (the typical need), you use Multi-Paxos.

Used by: Google Chubby, Spanner, ZooKeeper (a Paxos-like protocol called Zab).

In practice, most teams choose Raft over Paxos because it's easier to implement and reason about. Paxos remains important academically and runs in some battle-tested systems, but new systems usually pick Raft.

ZooKeeper / etcd / Consul

Most teams don't implement leader election from scratch. They use a coordination service that already does it correctly.

ZooKeeper: the granddaddy. Used by Kafka (older versions), HBase, Solr.
etcd: Raft-based. Backbone of Kubernetes.
Consul: Raft-based. Service discovery + leader election.

You "elect" a leader by trying to acquire a special lock or ephemeral node. The first node to succeed is the leader. The lock is held while the leader is alive (heartbeats); if the leader dies, the lock is released and someone else acquires it.

Failure Detection: The Quiet Hard Part

Leader election starts with "the leader is dead, let's elect a new one." But how do you know the leader is dead?

You don't, really. You only know it hasn't sent a heartbeat in some time. That could mean:

The leader actually crashed.
The network between you and the leader is partitioned.
The leader is overloaded and slow but still alive.

If you elect a new leader while the old one is actually still running, you have two leaders. This is the source of split-brain bugs. Solutions: fencing tokens (every leader has an incrementing term/epoch number; storage rejects writes from old terms), STONITH (Shoot The Other Node In The Head: forcefully kill the old leader), and majority-based protocols (Raft) that can't elect a new leader without majority.

Where Leader Election Shows Up

Database primary failover: Postgres + Patroni, MongoDB replica sets, Redis Sentinel.
Distributed scheduling: only one cron job runs even though many workers are deployed.
Distributed locks: the "leader" of a lock is the one who acquired it.
Coordination services: ZooKeeper, etcd, Consul each elect their own internal leader.
Microservice singletons: only one instance of a service should run at a time.

The One Thing to Remember

Don't roll your own leader election unless you really, really know what you're doing. This is one of the canonical examples of "looks simple, is treacherous." Use etcd, ZooKeeper, or Consul. They have spent decades getting the edge cases right. The right move is almost always to delegate this problem to one of them and focus your energy on whatever your actual product does.