Designing a Chat System - Mohammadali Bazyar

Why Chat Is Harder Than It Looks

"Send a message" sounds trivial. WhatsApp, Slack, Discord, and Telegram make it look easy. Behind the scenes, real-time messaging is one of the hardest distributed systems problems. You need to deliver every message exactly once, in order, to potentially millions of recipients, with sub-second latency, even when phones go offline mid-conversation.

This article walks through how to design a chat system that handles all of that.

Requirements

Functional

1-on-1 Chat

Send and receive messages between two users.

Group Chat

Up to 1000 members per group. Everyone sees every message.

Presence

Show online/offline status. "Last seen" timestamps.

Read Receipts

Show sent, delivered, and read indicators.

Message History

Persist messages. Allow scrolling back through years of history.

Push Notifications

Notify users on phones when app is closed.

Non-Functional

Latency: end-to-end message delivery in under 200ms p99.
Reliability: messages never lost. Exactly-once delivery semantics.
Order: messages appear in the order they were sent, per conversation.
Scale: 500M daily active users, 100B messages per day.

Capacity Estimation

500M users sending an average of 200 messages per day:

Messages per second: 100B / 86400 = ~1.15M/sec average. Peak ~5M/sec.
Storage per day: 100B * 200 bytes/msg = 20 TB.
Concurrent connections: ~50M simultaneous WebSockets at peak.

The Core Challenge: Real-Time Delivery

Traditional HTTP is request-response. The client asks, the server answers. That's not how chat works. Chat needs the server to push messages to the client without being asked.

Three Approaches to "Push"

Method	How	Latency	Resource Cost
Short Polling	Client asks every N seconds: "any new messages?"	Up to N seconds	High (wasted requests)
Long Polling	Client asks, server holds the connection until a message arrives.	Near-instant	Medium (one connection per user)
WebSockets	Persistent bidirectional connection.	Instant	Lowest (single connection serves both directions)

Modern chat systems use WebSockets. They're efficient, low-latency, and bidirectional. The client and server keep one TCP connection open and exchange messages as they happen.

The Architecture

Chat System Architecture

Client

Mobile / Web App

WebSocket

Edge

Connection Servers
hold WebSockets

routes messages

Logic

Chat Service
validates, fans out

Presence Service
online status

writes

Storage

Redis
active sessions, presence

Kafka
message queue

Cassandra
message history

Connection Servers

The connection layer holds millions of open WebSockets. Each WebSocket is tied to a specific connection server. When user A sends a message to user B, the system must figure out which connection server holds B's WebSocket and route the message there.

To do this:

1. When a user connects, register them in Redis: user_id -> connection_server_id.
2. When sending to that user, look up their server in Redis.
3. Forward the message to that specific server, which sends it through the WebSocket.

If the user is on multiple devices, they have multiple WebSockets, possibly on different servers. Send to all of them.

Storing Messages

Messages need to be persisted forever. Cassandra is the typical choice:

Write throughput: handles millions of writes per second.
Time-series friendly: messages are naturally ordered by timestamp.
Tunable consistency: you can pick between availability and consistency per query.

Schema:

CREATE TABLE messages (
    conversation_id UUID,
    message_id TIMEUUID,
    sender_id UUID,
    content TEXT,
    sent_at TIMESTAMP,
    PRIMARY KEY (conversation_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);

Partitioning by conversation_id means all messages in one conversation live on the same shard, ordered by time. Loading "last 50 messages in this chat" is a single fast query.

Message Delivery Guarantees

This is the trickiest part of chat systems. You want exactly-once delivery: every message reaches its recipient once and only once.

The straightforward path looks like this:

1. User A sends message via WebSocket.
2. Server saves it to Cassandra.
3. Server pushes it through B's WebSocket.
4. B's app shows the message.

What if B is offline? Save the message anyway. When B comes back, fetch unread messages.
What if step 3 fails (B's WebSocket dies mid-push)? B reconnects and pulls the messages they missed.
What if step 4 fails (B's app crashes before showing)? Use sequence numbers and acknowledgments to detect missing messages.

Sequence Numbers and ACKs

Every message in a conversation has a monotonically increasing sequence number. Both client and server track them. When B's app receives message #42, it sends back an ACK: "got #42." If the server never sees that ACK, it knows to retry.

If the client receives #41 then #43 (skipping #42), it knows to ask the server for the missing #42.

This is how chat systems guarantee no message is lost or shown out of order.

Group Chat: The Fan-Out Problem

1-on-1 chat is easy. Send the message to one recipient. Group chat with 1000 members means fanning out one message to 1000 WebSockets. Across thousands of groups, this scales to billions of fan-outs per day.

Approach:

1. The chat service receives the message and writes it once to Cassandra (under the group's conversation_id).
2. It looks up the member list of the group.
3. For each online member, it routes to their connection server.
4. For each offline member, it queues a notification.

Optimization: don't fan out synchronously. Push the fan-out work into Kafka. Background workers consume from Kafka and deliver. This keeps the sender's request fast.

Presence Service

Showing "online" status sounds simple but is surprisingly expensive. Every change in status (user opens app, user closes app, user types) potentially triggers updates to everyone in their contact list.

Naive approach: store a flag per user. Updates broadcast to everyone who has them as a contact. Doesn't scale.

Better approach:

Store presence in Redis with a TTL. The client sends a heartbeat every 30 seconds; if the heartbeat stops, the entry expires and the user is "offline."
Subscribers (people viewing this user's status) get push notifications when status changes, instead of polling.
Don't broadcast to everyone in the contact list. Only update people currently viewing this user's profile.

Push Notifications

When the app is closed, WebSockets are dead. To deliver new messages, you need OS-level push (Apple APNs, Google FCM).

The flow:

1. Message arrives. Recipient's WebSocket isn't connected.
2. Chat service queues a push notification.
3. Push service sends it via APNs or FCM.
4. User's phone receives it, app wakes up, fetches new messages.

APNs and FCM impose their own rate limits and have their own latency. Plan around them.

End-to-End Encryption

Modern chat (WhatsApp, Signal) encrypts messages so even the service provider can't read them. The implementation is a deep topic, but the high-level design:

Each device has a public/private key pair generated locally.
Public keys are uploaded to the server. Private keys never leave the device.
Sender encrypts each message with the recipient's public key before sending.
Server only sees ciphertext. It routes and stores it but cannot decrypt.
Recipient decrypts with their private key.

Group chat uses sender keys: each member shares an encryption key with the group, rotated when membership changes.

The One Thing to Remember

A chat system is the intersection of nearly every distributed systems concern: real-time push, persistent storage, fan-out, ordering, exactly-once delivery, presence, and offline handling. WhatsApp scale doesn't come from any single clever trick. It comes from getting all of these right at once. The boring parts (sequence numbers, ACKs, message queues) are what actually make a chat system feel reliable.