The Problem

Your transactional database holds the source of truth. But many other systems need to see changes: a search index, a data warehouse, a cache, a downstream microservice, an audit log. The naive way is to either dual-write from your application (write to DB AND publish event) or periodically poll the database for changes.

Both have problems. Dual-writes risk inconsistency (DB succeeds, event fails, or vice versa). Polling is slow and misses deletes.

Change Data Capture (CDC) solves this by reading the database's transaction log directly. Every committed change becomes an event you can stream to anyone.

How CDC Works

Every database has a transaction log (Postgres WAL, MySQL binlog, SQL Server transaction log) that records every modification. The database uses this log for crash recovery and replication. CDC reads it for downstream use.

A CDC tool (like Debezium):

1. Connects to the database as a replica.
2. Reads the transaction log entries.
3. For each entry (insert, update, delete), produces an event.
4. Publishes the event to Kafka or another message bus.
5. Tracks its position in the log so it can resume after restart.

The Event Format

A CDC event typically contains:

{
  "op": "u",                    // c=create, u=update, d=delete
  "ts_ms": 1714896000000,
  "before": {                   // previous state (for u, d)
    "id": 42,
    "name": "Old Name"
  },
  "after": {                    // new state (for c, u)
    "id": 42,
    "name": "New Name"
  },
  "source": {
    "db": "myapp",
    "table": "users",
    "lsn": 12345               // log position
  }
}

Consumers can react to specific operations: only inserts, only deletes, only changes to certain columns.

Why CDC Beats Dual-Write

Dual-write: app writes DB, then publishes event. If the app crashes between, the event is lost. Inconsistency.

CDC: only one write (to DB). The event flows from the log automatically. The transaction log is the source of truth. No race conditions, no missed events.

This is sometimes called outbox replacement: CDC obviates the outbox pattern (which exists specifically to fix dual-write inconsistency).

Architecture

CDC Pipeline
Source
Postgres / MySQL
WAL / Binlog
reads log
CDC
Debezium / Maxwell
publishes events
Bus
Kafka
consumed by
Sinks
Search Index
Data Warehouse
Other Services

Major Implementations

Debezium: the popular open-source choice. Connectors for Postgres, MySQL, MongoDB, SQL Server, Oracle, Cassandra. Runs as a Kafka Connect plugin. Production-grade.
Maxwell: simpler, MySQL-only. Output to Kafka, Kinesis, Pub/Sub.
AWS DMS: AWS managed CDC service.
Built-in: some databases have native streaming. PostgreSQL logical replication, MongoDB change streams, DynamoDB Streams.

Common Use Cases

Database replication: Postgres to MySQL, or different versions of the same DB.
Search index sync: Postgres to Elasticsearch. New row in DB = new document in index.
Cache invalidation: when a row changes, push an event that invalidates relevant cache entries.
Microservice integration: the order service updates its DB; the analytics service reads CDC events to update its own state.
Audit logs: every change recorded immutably.
Data warehousing: stream operational changes into the warehouse continuously instead of nightly batch dumps.

The Snapshot Problem

CDC starts capturing from "now." But what about the existing data when you first set up CDC?

Two approaches:

Initial snapshot: CDC reads the entire current state once (a full table scan), produces "create" events for every row. Then switches to log streaming. Standard with Debezium.
Skip snapshot: only stream changes from now on. Existing data is missed; can be backfilled separately.

Schema Changes

What happens when you add a column? Drop one? Rename a table?

CDC captures these as events too (DDL events). Downstream consumers need to handle schema evolution. Common pattern: a schema registry alongside the event stream that consumers reference.

The One Thing to Remember

CDC turns your database's transaction log into a real-time event stream that any downstream system can consume. It eliminates the dual-write problem and is the cleanest way to integrate a transactional database with the rest of your data infrastructure. If you've ever found yourself writing application code to "publish an event after writing to the database," there's a good chance CDC is what you actually want.