Building a Real-Time Data Pipeline

What "Real-Time" Actually Means

Real-time isn't a single thing. It's a spectrum of latency requirements:

Hard real-time: sub-millisecond. Trading systems, robotic control. Different architecture entirely.
Near real-time: seconds. Live dashboards, fraud detection, content moderation. This is what most "real-time data pipelines" mean.
Streaming batch: minutes. Less time-critical analytics that should be fresher than nightly batch.

This article focuses on near-real-time: events arrive at the pipeline, get processed, and become queryable within a few seconds.

The Four-Layer Architecture

Real-Time Pipeline Layers

1. Sources

App events / Database CDC / IoT

ingest

2. Buffer

Kafka / Kinesis / Pub/Sub

processed by

3. Stream Compute

Flink / Spark Streaming / Kafka Streams

written to

4. Sinks

Real-time Store

Analytics DB

Apps / Dashboards

Layer 1: Sources

Where events come from. Three main types:

Application events: your services emit them directly. "User signed up." "Order placed."
Database changes: CDC reads transaction logs and produces events. See the CDC article.
External feeds: partner APIs, IoT devices, webhooks.

The pattern: each source produces events as soon as they happen, fire-and-forget into the buffer. Source services don't worry about who's downstream.

Layer 2: Buffer

The decoupling layer. Almost always Kafka or a managed equivalent (AWS Kinesis, GCP Pub/Sub, Azure Event Hubs).

Why it matters:

Producers and consumers run at different rates. The buffer absorbs spikes.
Multiple consumers read independently from the same source.
Consumers can be added or removed without producer changes.
Replay is possible. Bad processing logic? Reset the consumer offset and reprocess.

Layer 3: Stream Compute

The brains. Where the actual transformation happens. Common operations:

Filtering: drop irrelevant events.
Enrichment: add data from external sources (geo lookup, user profile).
Aggregation: count events per window. "Page views per minute."
Pattern detection: "user did X then Y within 30 seconds." Fraud, funnels.
Joins: merge two streams (clicks + impressions) or a stream with a table (events + user data).

The framework choice (Flink, Spark Streaming, Kafka Streams) depends on complexity, scale, and team expertise. Flink leads for complex stateful workloads. Kafka Streams for simpler ones embedded in apps.

Layer 4: Sinks

Where the processed data lands. Multiple sinks for the same pipeline are common.

Real-time analytics store: ClickHouse, Druid, Pinot. Optimized for fast aggregation queries on freshly-arrived data.
Search index: Elasticsearch. Stream of "document updated" events keeps the index in sync.
Cache: Redis. Materialized state used by serving apps.
Database: for OLTP applications. The pipeline can write rollups or denormalized views.
Lake / Lakehouse: for long-term storage and batch reprocessing. See the Lakehouse article.
Other apps: via webhooks or API calls. The pipeline triggers downstream actions.

Windowing

Most real-time analytics involves windows: counts over the last 1 minute, top items in the last hour, etc.

Tumbling window: non-overlapping. 12:00-12:01, 12:01-12:02. Simple.
Hopping window: fixed size, overlapping. Every 30s, look at the last 1m. More events per window.
Sliding window: moves continuously. Always "the last 1m from right now."
Session window: based on user activity. A "session" is until 30 minutes of inactivity.

Late Data and Watermarks

Events don't always arrive in order. A mobile app might queue events offline and send them later.

Handling: use event time (when it actually happened) not processing time (when it arrived). Frameworks track watermarks: a guess at how late events might be. Windows close based on watermarks. Late events arriving after the watermark either go to a side output or trigger a window update.

Operational Concerns

Backpressure: when downstream is slower than upstream. Kafka absorbs the lag, but you must monitor and provision capacity.
Schema evolution: events change format over time. Use a schema registry; consumers handle multiple versions.
Replay: design the pipeline to be re-runnable. Bug fixed in the code? Reset the offset and reprocess.
Monitoring: consumer lag, throughput, error rates. Lag is the most important metric. If it's growing, something is wrong.
Idempotency: consumers must handle duplicate events safely. See the Idempotency article.

The One Thing to Remember

A real-time data pipeline is a chain of decoupled stages: source, buffer, compute, sink. Get the buffer right (Kafka or equivalent) and most operational issues shrink. Pick a stream framework that matches your team's expertise. Always design for replay; you will need it. The hardest part isn't the latency; it's keeping the pipeline correct, observable, and maintainable as it evolves.