Stream Processing with Kafka - Mohammadali Bazyar

What Kafka Actually Is

Kafka is often described as a "distributed message queue" but that undersells it. It's better thought of as a distributed append-only log. Messages get appended; consumers read sequentially. Multiple consumers can read the same log independently. Logs are persistent; old messages stay around for hours, days, or forever depending on configuration.

This simple primitive (durable log, shared by many readers) turns out to be the right foundation for most real-time data infrastructure.

Core Concepts

Topics

A topic is a named log. orders, page_views, payments. Producers write to topics. Consumers read from topics.

Partitions

Each topic is split into one or more partitions. Each partition is an ordered, immutable sequence of messages. Partitions are the unit of parallelism: different partitions can be read in parallel.

Messages within a partition are strictly ordered. Across partitions, no ordering guarantee. So if you need ordering, related messages must go to the same partition (typically by hashing a key like user_id).

Brokers

Kafka servers. A cluster has many brokers. Each partition lives on one broker (the leader) with copies on others (followers) for redundancy.

Consumer Groups

The killer feature. A "consumer group" is a logical consumer with multiple instances. Kafka distributes partitions across instances in the group: each partition goes to exactly one instance.

This means you can horizontally scale a consumer just by adding more instances. Kafka rebalances automatically.

Multiple consumer groups can read the same topic independently. Each group has its own offset (position in the log). The "search index updater" group reads the same orders topic as the "fraud detection" group, but each tracks its own position.

Architecture

Kafka Topology

Producers

App A writes

App B writes

to topics

Cluster

Broker 1
partitions 0,3

Broker 2
partitions 1,4

Broker 3
partitions 2,5

consumed by

Groups

Group X (search index)

Group Y (analytics)

Group Z (fraud)

Stream Processing on Top

Reading messages and processing them is just the start. Stream processing means doing real work: filtering, aggregating, joining, transforming streams in flight.

Common operations:

Filter: drop events that don't match a predicate. Output a new stream.
Map: transform each event (parse, enrich, format).
Aggregate: compute counts, sums, averages over windows of time.
Join: combine two streams or a stream and a table.
Window: group events into time windows (1 minute, 1 hour) for aggregation.

Frameworks

Kafka Streams: a library that runs inside your application. Lightweight. Java/Kotlin. Great for simple to medium workloads.
Apache Flink: the modern heavyweight. Handles complex topologies, exactly-once semantics, very high throughput. Best-in-class windowing.
Apache Spark Structured Streaming: if you're already on Spark for batch, streaming is a small step.
Apache Beam: a unified model. Can run on Flink, Dataflow, or Spark backends.
ksqlDB: SQL queries on Kafka streams. Easy to start.

Exactly-Once Semantics

The hard problem: ensure each message is processed exactly once even with failures. Kafka supports it through transactional writes and idempotent producers.

The pattern: a stream processor reads, processes, and writes results in a single transaction. If anything fails, the whole thing is aborted; nothing is committed. On retry, the message is reprocessed without duplication.

Note: exactly-once is within Kafka's ecosystem. Once you write to an external system (a non-Kafka database, a third-party API), you're back to at-least-once unless that system also supports transactions.

Common Patterns

Event sourcing: the database is a Kafka topic. State is derived by replaying events.
Stream-stream joins: "find users who clicked an ad and then bought within 5 minutes."
Materialized views: a stream of events maintains a continuously-updated key-value store. Reads hit the store, not the stream.
CDC sink: Kafka topics fed by Change Data Capture; downstream systems consume database changes as events.

Operational Concerns

Retention: how long messages live. From hours to forever. Storage cost vs replay capability.
Compaction: alternative to retention. Keep the latest message per key, drop older ones. Useful for "current state" topics.
Partition count: hard to change later. Pick generously upfront based on expected parallelism.
Schema evolution: producers and consumers must agree on message format. Use a schema registry (Confluent's, AWS Glue Schema Registry).

The One Thing to Remember

Kafka is the durable, distributed log that decouples producers from consumers. Stream processing turns that log into a continuously-running computation. Together they enable nearly all modern real-time data architectures: analytics, fraud detection, recommendation freshness, microservice integration. Most companies that scale eventually run a Kafka cluster, even if they didn't plan to.