Designing a Notification System - Mohammadali Bazyar

The Problem That Hides Real Complexity

Every modern app sends notifications. Your phone buzzes when a friend likes your photo, when your package ships, when there is a security alert on your bank account, when a coworker mentions you in a chat. Each of these arrives through a different channel: push, email, SMS, in-app, or sometimes web push.

It looks simple from outside. "When event X happens, send a message to user Y." A weekend project. But scale it to billions of notifications per day across multiple products and you end up with one of the most subtle distributed systems problems in the industry.

The challenges are not just technical: deduplication, rate limiting, user preferences, quiet hours, batching, retries, channel-specific quirks (APNs vs FCM, SMS rate limits, email deliverability), and the dreaded "notification storm" when a bug fires the same event 10,000 times. Get any of these wrong and your users disable notifications entirely, which is worse than not having them.

This article walks through how to build a notification system that handles all of it.

Step 1: Requirements

Functional Requirements

Multi-Channel

Push (mobile/web), email, SMS, in-app feed. The system picks channels based on urgency and user preferences.

Templates

Each notification type has templates per channel and per locale. Apps emit events, the system renders them.

Preferences

Users control which notification types reach them through which channels, and during which hours.

Deduplication

If the same event fires twice, the user only sees one notification.

Delivery Tracking

Know which notifications were delivered, opened, clicked. Surface to users (history) and ops (debugging).

Reliability

Critical notifications (2FA codes, security alerts) MUST arrive. Retry on failure. Dead-letter on persistent failure.

Non-Functional Requirements

Latency: high-priority notifications (2FA, login alerts) deliver in under 5 seconds. Marketing notifications can take minutes.
Scale: billions of notifications per day. Tens of thousands per second at peak.
Channel constraints: APNs caps at ~1000 messages per second per certificate, SMS providers have per-number rate limits, email providers monitor sending reputation. The system must respect all of them.
Compliance: GDPR, CCPA, CAN-SPAM. Users can unsubscribe per channel, and unsubscribes must be honored quickly.

Step 2: Capacity Estimation

Metric

Calculation

Result

Daily users

given

~500 million

Notifications per user/day

given (avg)

~10

Total notifications/day

500M × 10

~5 billion

Average QPS

5B / 86400

~58,000/sec

Peak QPS

avg × 5

~290,000/sec

Notification log storage

5B × 500 bytes × 90 days

~225 TB

Two implications. First, the system handles hundreds of thousands of operations per second at peak. Second, you cannot send 290,000 push notifications per second directly to APNs (limit is ~1000/sec per certificate). The system must batch, throttle, and parallelize across many credentials.

Step 3: The Layered Architecture

The right way to think about a notification system is as a pipeline of stages. Each stage has one job. Stages are connected via queues so they scale independently and absorb backpressure.

Notification Pipeline

Producers

App services
emit events

events to

Bus

Event Bus (Kafka)
topic per event type

consumed by

Process

Notification Service
core orchestrator

Preference Filter

Template Renderer

Deduplicator

enqueues per channel

Channel Queues

Push Queue

Email Queue

SMS Queue

In-app Queue

workers send via

Providers

APNs / FCM

SES / SendGrid

Twilio

In-app feed DB

tracking

Storage

Notification Log
history, audit

User Preferences

Dedup Cache

Why This Layering

Each stage is independent and replaceable.

Producers only know how to emit a high-level event ("OrderShipped" with order_id and user_id). They never call APNs directly. They never know whether the user wants email or push. They emit, and forget.

The Notification Service reads events and orchestrates everything: load user preferences, decide which channels to use, render templates, check deduplication, enqueue per-channel jobs.

Channel workers are dumb. They pull from their queue and send. Each channel has its own worker pool because each provider has its own quirks and rate limits.

This separation means a developer adding a new feature emits one event. The notification team controls everything else: routing, batching, throttling, A/B testing message text. Adding a new channel (say WhatsApp) is one new queue and one new worker pool.

Step 4: Templates and Localization

Notifications must never be hard-coded strings in your code. You will have hundreds of notification types and thousands of variants (per channel, per locale). Hard-coded means engineering bottlenecks for every copy change.

The right pattern: a template store, organized by:

(notification_type, channel, locale)

For example, (OrderShipped, push, en-US) might be:

"Your {{product}} just shipped! Estimated delivery: {{eta}}."

The Template Renderer takes the event payload and the template, fills in the placeholders, and produces the final message. Templates are versioned. A/B testing is a matter of choosing between template variants based on user bucket.

Localization means: pick the user's preferred locale, fall back to English if missing. Templates are usually managed by product/marketing through a CMS, not edited in code.

Step 5: User Preferences

Every user has a preference matrix:

For each notification type and each channel, the user has a setting (on/off). Plus quiet hours (no notifications between 11 PM and 7 AM in their local time). Plus per-channel global toggles.

Storage: a fast key-value store. Redis is typical. Key per user, value is a compact JSON or hash. Read on every notification.

{
  "user_id": 42,
  "preferences": {
    "OrderShipped":  {"push": true,  "email": true,  "sms": false},
    "Marketing":     {"push": false, "email": false, "sms": false},
    "SecurityAlert": {"push": true,  "email": true,  "sms": true}
  },
  "quiet_hours":    {"start": "23:00", "end": "07:00", "tz": "America/Los_Angeles"},
  "global_unsubscribe": false
}

The Preference Filter checks each candidate notification: does the user accept this type on this channel right now? If no, drop it.

Critical notifications (security alerts, 2FA) typically bypass quiet hours and even per-type preferences. Users can disable security notifications, but the default is on, and the system warns them.

Step 6: Deduplication

The same event can fire multiple times. A buggy retry, a user double-clicking "ship order," an event re-processed after a crash. Without deduplication, the user gets spammed.

The pattern: every event has a unique fingerprint, computed from (event_type, user_id, key_fields). For example, OrderShipped:42:order_99. Before sending, the system checks if this fingerprint has been seen recently. If yes, skip. If no, mark seen and proceed.

Storage for the dedup cache: Redis with a TTL. Key per fingerprint, value is "1", TTL of 24 hours (or whatever window makes sense for the event type).

Important: deduplication runs after preferences and rendering, but before enqueueing to channel workers. Otherwise you waste work rendering a duplicate.

Idempotent Channel Sends

Even within a channel worker, retries can cause double-sends to the provider. Most providers (APNs, Twilio, SendGrid) accept a client-side message ID for deduplication on their side. Use it. Send the same message ID across retries; the provider handles dedup.

Step 7: Throttling and Batching

If a chatty user gets 50 events in 10 minutes (replies to a popular comment), don't send 50 push notifications. Two strategies:

Per-User Rate Limit

Cap the number of notifications per channel per user per time window. "At most 1 push per 5 minutes per user for type X." Implemented as a token bucket per (user, channel, type) in Redis.

Batching

Instead of dropping over-the-limit notifications, batch them into a single rollup. "12 new replies on your post" instead of 12 individual pushes. Requires holding events for a brief window and aggregating.

Both are common. Real systems use both: per-event rate limits with batching for high-volume types.

Step 8: Priority Queues

A 2FA code MUST deliver in under 5 seconds. A weekly digest can wait an hour. Same channel, very different SLA.

Solution: separate queues by priority within each channel.

Critical queue: 2FA, security alerts, payment failures. Dedicated worker pool with no other work. Sub-second latency target.
Transactional queue: order shipped, your appointment is in 1 hour. Important but not life-or-death. Few-second latency target.
Marketing queue: promotions, recommendations, weekly digest. Eventually delivers. Minutes-to-hours latency target.

Worker pools are sized accordingly. Critical workers are over-provisioned (idle most of the time, ready to absorb spikes). Marketing workers are sized for cost efficiency.

Step 9: Channel Specifics

Push (APNs / FCM)

The trickiest channel. You don't own the delivery infrastructure; Apple (APNs) and Google (FCM) do. Both have:

Device tokens. Each app install generates a token. Tokens expire and rotate. Your system must store the latest token per device per user, refresh on every app launch, and prune dead tokens (provider returns "invalid token" responses).
Rate limits. APNs caps at ~1000 messages/sec per certificate. Workaround: use multiple certificates and round-robin across them. FCM is more permissive but still has fairness throttling per app.
Batching. Both APIs support sending many messages in one connection. The worker should batch within reason.
Quiet hours from the platform. iOS Focus mode, Android Do Not Disturb. The system respects what the OS does.

Email

The most regulated channel. Deliverability is everything. Tactics:

Authentication: SPF, DKIM, DMARC must all pass. Otherwise mail goes to spam.
Reputation: ISPs (Gmail, Outlook) track senders. If too many users mark your mail as spam, your reputation drops and future mail is filtered.
Bounces: hard bounces (invalid addresses) must be removed from the send list immediately. Soft bounces (full inbox) retried later.
Unsubscribe link: legally required (CAN-SPAM, GDPR). Must be honored within 10 days.
List warming: new senders ramp up volume gradually so ISPs trust them.

Most teams use SES, SendGrid, Mailgun, or Postmark to handle reputation and infrastructure. The notification system writes "send this email" jobs; the provider does the heavy lifting.

SMS

Expensive ($0.005-$0.05 per message). Per-country regulations vary wildly. Number reputation matters. Twilio, Vonage, MessageBird are typical providers. Strict rate limits per number; production systems pool many numbers and rotate.

In-App Feed

The simplest channel mechanically. The notification is just a record in a database, displayed when the user opens the app. Storage: Cassandra or similar, sharded by user_id, ordered by sent_at. New rows trigger a count badge ("3 unread") and the user clears them by viewing.

Step 10: Storage Choices

Notification Log: every notification ever sent, for audit and "view history" features. Cassandra is typical. Sharded by user_id, ordered by sent_at descending. Schema includes notification_id, user_id, type, channel, status, sent_at, delivered_at, opened_at, clicked_at.

User Preferences: Redis (hot path, read on every send) backed by a SQL database (source of truth). Cache TTL of a few minutes; invalidated on user changes.

Templates: versioned in object storage or a CMS. Cached in-process inside the Renderer. Updates propagate within minutes.

Device Tokens: SQL or key-value, keyed by device_id. Includes user_id, token, platform, last_used.

Dedup Cache: Redis with TTL. Hot, ephemeral, fast.

Delivery State: tracking which notifications were delivered, opened, clicked. Often in a separate analytics database (ClickHouse, Druid). The notification_id is the join key back to the log.

Step 11: Retry, Failure, and Dead Letters

Providers fail. APNs has occasional outages. SES has occasional rejection storms. Twilio has occasional regional issues.

The pattern: each channel worker retries on transient failures with exponential backoff. After N failures (typically 3-5), the message goes to a dead letter queue (DLQ). DLQ messages are inspected by ops; permanent failures (invalid tokens) get cleaned up; recoverable failures get re-enqueued after the issue is fixed.

Important: retries must be idempotent. If the worker retries and the previous attempt actually went through, the user must not see two notifications. Use a unique message ID at the provider level (mentioned earlier).

Step 12: The Notification Storm Problem

This is the worst-case scenario. A bug causes one event to fire 10,000 times for the same user. Without protection, the user gets 10,000 notifications. They unsubscribe forever. The PR is bad. The on-call engineer is awake all night.

Defenses, layered:

Per-user rate limits (covered earlier). Already cap most spam.
Per-event volume monitoring. If "OrderShipped" fires more than 1000 times in a minute for the same order_id, alert and quarantine.
Global circuit breaker. If the notification service detects an anomalous burst (10x normal volume in 1 minute), pause non-critical types until ops investigates.
Replay buffer. Events are buffered briefly before processing, allowing late-arriving duplicate detection.
Manual override. Operators have a "stop everything" button for catastrophic failures.

Every mature notification system has been burned by this. The defenses get added incident by incident.

Step 13: Edge Cases and Gotchas

Quiet Hours and Time Zones

"Don't notify between 11 PM and 7 AM" is per-user time zone, not server time zone. Each user has a tz. The system must compute "is it quiet hours for this user right now?" before sending.

For non-critical notifications during quiet hours, the system either delays (wait until quiet hours end) or drops (user already saw nothing happen). Most teams choose delay for transactional and drop for marketing.

Invalid Tokens

Push tokens go stale. App uninstalled, OS upgraded, etc. The push provider returns "invalid token" responses. The worker must process these and remove the dead tokens, otherwise you keep trying to deliver to ghosts and waste throughput.

Provider Outages

If APNs is down, queue fills up. Critical notifications must not be lost. Solutions: long retention on the queue, fallback to email if push fails for too long, alerts to ops.

Privacy and PII

Notifications contain user data. The notification log must be access-controlled. PII fields (full names, addresses) might be redacted in some logs. Templates that include sensitive info must consider where the rendered text lives.

Compliance: Unsubscribe Honored Quickly

When a user unsubscribes from email, the unsubscribe must propagate to the preferences system within minutes. Otherwise emails sent in the next hour are technically violations.

Step 14: Recap of Key Decisions

Decouple producers from channels via an event bus. One event, many possible channel deliveries.
Templates with versioning and locale. Marketing changes copy without engineering deploys.
Preferences as a fast read. Redis-backed, checked on every send.
Per-event fingerprint dedup. Stops same-event-twice spam.
Per-user, per-type rate limits. Stops chatty-user spam.
Priority queues per channel. 2FA codes don't wait behind weekly digests.
Idempotent channel sends. Retries don't cause double delivery.
Notification storm defenses. Layered, because they will all fail at some point.
Provider-aware workers. Each channel worker knows its provider's quirks (APNs cert pools, SMS number rotation, email reputation).

The One Thing to Remember

A notification system is more about respecting users than about technical scale. The hard problems are not throughput or fan-out; they are preferences, deduplication, batching, quiet hours, compliance, and storm prevention. Get those wrong and users disable notifications entirely, which is worse than not having them at all. The architecture (event bus, layered pipeline, per-channel queues) exists to make these policies easy to iterate on without rewriting infrastructure. Build the layers right and you can change copy, add channels, A/B test, and clean up storms without ever touching the messaging layer. Build them wrong and every change cascades through your codebase.