What This Article Is About

Most Kafka articles explain how Kafka works internally: brokers, partitions, replication, the log abstraction. That's important context, but it's not what makes Kafka useful in practice. What makes Kafka useful is that a small set of patterns, applied to the right scenarios, solves problems that would otherwise require a custom messaging system, custom durability guarantees, custom ordering rules, and a lot of operational pain.

This article skips the deep internals and goes straight to the practical question: given a real situation, how would you actually use Kafka?

Each scenario below covers the problem, why Kafka fits (or doesn't), concrete topic and partition design, producer and consumer pseudocode, and the gotchas that bite in production. At the end, a "when NOT to use Kafka" section because reaching for Kafka when you don't need it is one of the most expensive mistakes in distributed systems.

A Tiny Kafka Primer (Just Enough to Follow Along)

If you've never touched Kafka, here's the absolute minimum you need to follow the rest:

Topic: a named feed of messages, like a folder. Producers write to topics; consumers read from topics.

Partition: each topic is split into N partitions (you pick N when you create the topic). Each partition is an append-only log. Messages within a partition have a strict order. Across partitions there is no order.

Key-based partitioning: when you produce a message, you can give it a key. Kafka hashes the key and uses the result to pick a partition. Same key always goes to the same partition. This is how you get "all events for user 42 land in the same partition, in order".

Offset: position in a partition. Each consumer tracks where it is. Restart a consumer, it picks up from its last committed offset.

Consumer group: N consumers in a group share the work. Kafka assigns partitions across them. Add a consumer to scale up reading; remove one and the partitions rebalance to the rest. This is how horizontal scaling works.

Retention: messages stay in Kafka for as long as the retention policy says (default 7 days, can be days, weeks, forever). Consumers can replay history by rewinding their offset.

Topic with 3 Partitions and 2 Consumer Groups
Producers (write)
App A
App B
key-hashed to partition
Topic: orders
Partition 0
Partition 1
Partition 2
read independently
Consumer Group: billing
C1 (P0, P1)
C2 (P2)
Consumer Group: analytics
C1 (P0)
C2 (P1, P2)

That's it. With those primitives, the patterns below all become concrete.

Scenario 1: Order Processing with Fanout to Multiple Services

The problem: an ecommerce site. When a user places an order, several things must happen: charge the card, update inventory, send a confirmation email, notify the warehouse, update the analytics dashboard, post to the user's order history.

If the API endpoint that handles "place order" tries to do all of these synchronously, it's slow, brittle, and tightly coupled. If the email service is down, orders fail. If a new requirement appears (loyalty points, fraud check), you edit the order endpoint again. The endpoint becomes a god method.

Why Kafka fits: Kafka decouples the order endpoint from all the downstream consumers. Order is placed once and an event is published. Every interested service consumes the event independently.

Order Event Fanout
Order API
publish OrderPlaced
orders.events topic
independent consumer groups read
Billing
Inventory
Email
Warehouse
Analytics
Loyalty

Topic design:

Topic name: orders.events
Partition count: 12 to 24 to start (room to scale consumers later, since you can't increase partitions without messy rebalancing).
Partition key: order_id if events for the same order should arrive in order at consumers. Or customer_id if you want all events for a customer co-located.
Retention: 7 to 30 days. Long enough to replay if a downstream service had a bug.

Producer (Python pseudocode):

def place_order(request):
    order = create_order_in_db(request)

    event = {
        "event_type": "OrderPlaced",
        "event_id": str(uuid.uuid4()),
        "order_id": order.id,
        "customer_id": order.customer_id,
        "items": order.items,
        "total": order.total,
        "timestamp": now_iso(),
    }

    producer.send(
        topic="orders.events",
        key=str(order.id).encode(),
        value=json.dumps(event).encode(),
    )

    return {"order_id": order.id}

Consumer (Email service):

consumer = KafkaConsumer(
    "orders.events",
    group_id="email-service",
    enable_auto_commit=False,
)

for msg in consumer:
    event = json.loads(msg.value)
    if event["event_type"] != "OrderPlaced":
        consumer.commit()
        continue

    if email_already_sent(event["event_id"]):
        consumer.commit()
        continue

    send_order_confirmation_email(event)
    record_email_sent(event["event_id"])
    consumer.commit()

Gotchas:

Idempotency is on you. Kafka can deliver the same message twice in failure scenarios. The email consumer must track "did I already send this?" using event_id. Same for billing (don't double-charge).

The "transactional outbox" pattern. What if the DB insert succeeds but the Kafka publish fails? You've created an order with no event. The fix: write the event to an "outbox" table in the same DB transaction, then a separate publisher reads the outbox and ships to Kafka with at-least-once delivery. Or use Kafka transactions if your stack supports them. Don't try to use the order endpoint as the publisher and call it a day.

Schema evolution. Six months from now you'll add a new field to OrderPlaced. Old consumers must still parse the new event. Use a schema registry (Avro, Protobuf, JSON Schema) and follow backwards-compatibility rules: only add optional fields, never remove or rename.

Partition key choice has consequences. If you key by order_id you get even distribution but lose the guarantee that all events for a customer are processed in order. Pick based on what consumers actually need.

Scenario 2: Change Data Capture (Database to Search Index)

The problem: your app uses Postgres. You want a search feature backed by Elasticsearch (or OpenSearch, or Algolia). When data changes in Postgres, the search index must update within seconds.

The naive approach: in every place the code modifies data, also update Elasticsearch. Painful: every developer must remember; failures get out of sync; bulk DB operations bypass it entirely; bugs are silent until someone notices wrong search results.

Why Kafka fits: Kafka's log model is exactly the shape of "every change to a database". Tools like Debezium tail the Postgres write-ahead log and publish each row change as a Kafka event. Anyone who needs the latest data subscribes.

CDC Pipeline
Postgres (source of truth)
WAL (write-ahead log)
Debezium connector
publish row changes
db.public.products topic
consumers
Elasticsearch sync
Cache invalidator
Data warehouse

Topic design:

One topic per source table is the Debezium default: db.public.products, db.public.users, etc.
Partition key: the row's primary key. This way, all changes to product 42 land in the same partition, in order. Consumers see updates in the right sequence.
Retention: long. You may want to bootstrap a new consumer from the beginning of time. Use log compaction (next pattern below) or set retention to months.

Consumer (Elasticsearch sync, simplified):

for msg in consumer:
    change = json.loads(msg.value)
    op = change["op"]  # "c" (create), "u" (update), "d" (delete)
    pk = change["after"]["id"] if op != "d" else change["before"]["id"]

    if op in ("c", "u"):
        es_client.index(index="products", id=pk, body=change["after"])
    elif op == "d":
        es_client.delete(index="products", id=pk)

    consumer.commit()

Gotchas:

Schema changes in the source DB. Add a column in Postgres, the Debezium event format changes. Use a schema registry and a versioned topic naming policy.

Initial snapshot. When you first start CDC, you need history (every existing row), not just changes from now on. Debezium handles this with a "snapshot phase". Plan for it; on a 100M-row table it takes hours.

Order matters. A delete followed by a re-create of the same row must arrive in order. Keying by primary key handles this. Without keying, the consumer might apply the create before the delete and end up with phantom data.

Backpressure. A 10x burst of writes in Postgres becomes a 10x burst of events. Consumers must keep up or lag accumulates. Monitor consumer lag.

Scenario 3: Activity Tracking and User Behavior Analytics

The problem: a web or mobile app. You want to track every page view, click, scroll, search, and purchase intent for analytics, A/B testing, ML training, recommendations.

Volume is huge: millions of events per minute at scale. Writing each one to a transactional database is wasteful (you don't need ACID for "user clicked button"). Sending each one synchronously to a third-party tool blocks page loads.

Why Kafka fits: Kafka was originally built at LinkedIn for exactly this. High throughput, cheap to write, batched, durable. Many independent consumers (real-time dashboards, batch ETL, ML pipelines) all read the same firehose.

Topic design:

One big topic events.user_activity or per-domain topics (events.web, events.mobile, events.search). Per-domain is more flexible.
Partition count: high (50 to 200) to allow many parallel consumers.
Partition key: user_id or session_id when you want session-level ordering. Sometimes no key, just round-robin, when you don't care about order and want maximum even distribution.
Retention: short for raw events (1-3 days). The data warehouse and lake hold history. Kafka is the bus, not the archive.

Producer (web app, browser-side via API):

// Browser sends batched events to your collector API
function track(event) {
    pendingEvents.push({...event, ts: Date.now()});
    if (pendingEvents.length >= 20) flush();
}

setInterval(flush, 5000);

function flush() {
    const batch = pendingEvents.splice(0);
    if (!batch.length) return;
    fetch("/track", {method: "POST", body: JSON.stringify(batch)});
}

# Collector service
@app.post("/track")
def track(events):
    for e in events:
        producer.send(
            topic="events.web",
            key=e.get("user_id", "").encode() or None,
            value=json.dumps(e).encode(),
        )
    producer.flush()
    return "ok"

Consumers: a real-time aggregation job (Flink or Kafka Streams) computes "active users in the last 5 minutes". A separate batch job sinks events to S3 in Parquet for the data warehouse. A third consumer feeds an A/B test analyzer.

Gotchas:

Time skew. Mobile clients have wrong clocks. Don't trust client timestamps for analytics; record both client time and server-receive time.

PII. User events often contain emails, IPs, behavior. Plan for privacy: tokenize user IDs, time-bound retention, GDPR delete flow (which is hard with append-only logs; one approach: store a "deleted user" tombstone topic and consumers filter).

Bot traffic. 30% of your "users" might be bots. Filter at the producer or in a downstream pipeline. Otherwise your analytics are wrong and you pay to process garbage.

Bursty traffic. Marketing campaign goes live, traffic 10x's. Producers should buffer + batch + retry; brokers should be over-provisioned for peak, not average.

Scenario 4: Audit Log and Event Sourcing

The problem: a financial app, a healthcare app, anything regulated. You need a permanent, immutable record of every state change: "who changed what, when, and from what to what". Auditors will ask. Customers will dispute charges.

Or a different angle: event sourcing. Instead of storing the current state of an order, store every event that happened to it. Reconstruct current state by replaying events. The benefit: full history, the ability to rebuild projections, time-travel debugging.

Why Kafka fits: Kafka topics are append-only logs. Once written, messages are immutable. Configure infinite retention (or very long) and you have a durable event log. Compaction lets you keep the latest state per key while still treating it as an event store.

Topic design:

Topic per aggregate type: accounts.events, orders.events, users.events.
Partition key: aggregate ID (account_id, order_id). Guarantees order per aggregate.
Retention: forever, or use log compaction (Kafka keeps the latest message per key indefinitely; old keys get tombstoned).
Schema: strict. Use Avro or Protobuf with a schema registry. Old events must always be readable; you'll be parsing them years from now.

Event-sourced producer:

def deposit(account_id, amount, idempotency_key):
    event = {
        "event_type": "MoneyDeposited",
        "event_id": idempotency_key,
        "account_id": account_id,
        "amount": amount,
        "timestamp": now_iso(),
        "actor": current_user_id(),
    }
    producer.send("accounts.events", key=account_id, value=event)

Read-side projection (consumer):

# Rebuilds current account balances by replaying events
balances = {}

for msg in consumer:
    event = msg.value
    acct = event["account_id"]
    if event["event_type"] == "MoneyDeposited":
        balances[acct] = balances.get(acct, 0) + event["amount"]
    elif event["event_type"] == "MoneyWithdrawn":
        balances[acct] = balances.get(acct, 0) - event["amount"]
    upsert_balance_view(acct, balances[acct])

Gotchas:

Schema evolution is critical. Events written today must be readable by code written in 2030. Plan for it: never remove fields, only add optional ones, version your event types if you must change semantics.

GDPR / right-to-be-forgotten. Append-only logs can't really delete. Workarounds: encrypt PII per user, then "delete" by destroying the key. Crypto-shredding.

Replay is slow. Rebuilding a projection from 10 years of events takes hours or days. Use snapshots (periodically save the projection state and resume from the latest event after that point).

Don't event-source everything. Event sourcing is great for domains where history matters (money, healthcare, audit). It's overkill for domains where current state is all anyone cares about (a CMS page, a user profile photo). Mixing styles in one app is fine.

Scenario 5: Buffering Bursty Writes (Smoothing Out a Spike)

The problem: your app receives bursty traffic: a viral tweet causes a flood of signups; an IoT fleet phones home every hour at the top of the hour; a Black Friday sale starts. Your downstream system (a relational DB, an external API, a payment processor) cannot keep up with the spike.

Without buffering, you either drop traffic, fail requests, or push backpressure to the user.

Why Kafka fits: Kafka acts as a giant buffer. Producers write fast, consumers process at their own pace. The log absorbs the spike. As long as you have disk and retention, you can lag for hours and catch up.

Topic design:

Topic name: whatever the work is, e.g., signups.pending.
Partition count: enough that consumers can scale out to keep up with sustained throughput.
Retention: long enough to replay if consumers fail completely.
Partition key: depends on whether order matters. For "process this signup", probably not.

Pattern:

# Front-end (fast): write to Kafka, return success
@app.post("/signup")
def signup(request):
    producer.send("signups.pending", value=request.json())
    return {"queued": True}

# Worker (slow): drain Kafka into the DB at sustainable rate
for msg in consumer:
    create_user_in_db(msg.value)
    send_welcome_email(msg.value)
    consumer.commit()

Gotchas:

You changed the user-facing model. Signup is now async. The user gets "we're processing your request" instead of immediate confirmation. This must be acceptable for the domain.

Buffer can fill up too. If the spike is sustained and consumers can't catch up, lag grows without bound. Monitor lag and add capacity (more partitions, more consumers) before it becomes a customer-visible issue.

Consumer parallelism is bounded by partitions. 10 partitions = at most 10 parallel consumers within a group. If you need more parallelism, more partitions, but you can't easily increase partition count later.

Failures during processing. If create_user_in_db fails, what happens? You need a retry policy and probably a dead-letter topic for messages that fail repeatedly.

Scenario 6: Multi-Service Cache Invalidation

The problem: ten microservices each cache parts of the user profile. When a user updates their email, every cache must be invalidated.

Direct calls don't scale: the user-service would need to know about every cache. New caches require user-service code changes.

Why Kafka fits: publish a "UserUpdated" event. Every service subscribes and invalidates its cache. New services subscribe; user-service doesn't change.

Topic: users.events, partition by user_id, short retention (an hour or two; cache invalidation is time-sensitive but not historical).

Gotchas: events can be delayed during incidents. Cache invalidation has a worst-case window equal to your max consumer lag. For very latency-sensitive caches, combine: synchronous invalidation in the same data center plus Kafka events for cross-region or eventual.

Scenario 7: Log and Metric Aggregation

The problem: hundreds of services emit logs and metrics. You need a single pipeline that ships them to Elasticsearch, Datadog, S3 archive, and a security tool, without each service knowing about each destination.

Why Kafka fits: services or agents (Filebeat, Fluentd, Vector, OpenTelemetry collector) write to Kafka. Different sinks read from Kafka. Decouples emission from destination.

Topic design:

One or a few topics: logs.app, logs.access, metrics.raw.
Partition count: high; logs are bursty.
Partition key: usually none (round-robin) since strict order across services doesn't matter much.
Retention: 1-7 days. The downstream sinks have the long-term archive.

Gotchas: log volume can be huge. Kafka brokers need disk and network capacity. Compression (lz4 or zstd) is mandatory. Sample noisy services (don't ship every health-check log).

Scenario 8: Stream Processing and Real-Time Aggregations

The problem: compute "average ride price per city in the last 5 minutes" or "fraud score per transaction in real time" or "active session count by region". Continuous, low-latency, on streams of data.

Why Kafka fits: Kafka is the source of the streams. Stream processing frameworks (Kafka Streams, Apache Flink, Spark Structured Streaming, ksqlDB) consume from Kafka, do windowed aggregations, and write results back to other Kafka topics or to a sink.

Topic design: input topic (raw events), output topic (aggregated results). Partition by the aggregation key (city, user, region) so each consumer handles all events for its keys.

Pseudocode (Kafka Streams style):

builder
  .stream("ride.events")
  .groupBy((k, v) -> v.city)
  .windowedBy(TimeWindows.of(Duration.ofMinutes(5)))
  .aggregate(
      () -> new Stats(),
      (city, ride, stats) -> stats.add(ride.price)
  )
  .toStream()
  .to("ride.stats.5m");

Gotchas:

State is large. Windowed aggregations need to remember in-flight windows. Frameworks back this with their own state stores (RocksDB on disk, replicated to Kafka for fault tolerance).

Late-arriving events. Network delays mean "5 minutes ago" events sometimes arrive now. Frameworks handle this with watermarks and grace periods. Tune them.

Exactly-once semantics. If your consumer crashes mid-aggregation, you don't want to double-count. Modern Kafka and stream processors support exactly-once, but it costs latency and complexity. Use it when correctness matters; skip it when at-least-once with idempotent operations is fine.

Scenario 9: Async Microservice Workflows (Saga Pattern)

The problem: "place order" requires charging a card, reserving inventory, and scheduling delivery. Each is a different service. Some can fail. You can't run a distributed transaction across all of them.

Why Kafka fits: the saga pattern coordinates a multi-step workflow as a sequence of events. Each step is a service consuming an event, doing its work, and publishing the next event (or a compensating event on failure).

Order Saga
orders.events
OrderPlaced
Billing
PaymentSucceeded or PaymentFailed
Inventory
InventoryReserved or InventoryFailed
Delivery
OrderCompleted (or compensating events)

Gotchas:

Compensation is hard. If billing succeeded but inventory failed, you must refund. If inventory succeeded but delivery scheduling failed, you must release the inventory. Each forward step needs a designed compensating step.

Sagas are not transactions. Intermediate states are visible. A user might briefly see "payment processed" before the order completes. Design the UI for it.

Choreography vs orchestration. Choreography: each service knows what events to emit next. Orchestration: a central coordinator drives the saga. Orchestration is easier to reason about; choreography scales better. Both are valid.

Scenario 10: Data Pipeline to a Data Lake or Warehouse

The problem: get production data into Snowflake, BigQuery, Databricks, or a S3-based lake for analytics, ML, and BI. You don't want to hammer the production DB with analytical queries.

Why Kafka fits: Kafka is the connective tissue. Producers (apps, CDC, log shippers) write to Kafka. Sink connectors (Kafka Connect, Debezium, Confluent's S3 sink, custom Flink jobs) batch and write to the warehouse.

Pattern:

Apps emit events to Kafka.
CDC topics provide DB-side data.
A connector writes to S3 in Parquet, partitioned by date.
Snowflake / Athena / Databricks reads from S3.

Gotchas: file size matters for query performance (target 128 MB to 1 GB Parquet files). Late data needs a strategy (re-write a partition, use Iceberg or Delta to handle updates). Schema evolution again: lake tables must handle older event versions.

Patterns That Show Up Everywhere

Idempotent Consumer

Kafka guarantees at-least-once by default. The same message can be delivered twice. Every consumer needs to be idempotent: track an event ID (UUID, business ID, deterministic hash) and skip if already processed. Maintain the seen-IDs in a database, Redis, or Kafka itself (with compaction).

Dead Letter Topic (DLT)

A message that fails repeatedly should not block the consumer forever. After N failures, route it to a orders.events.dlq topic for manual inspection. Resolve the cause, then optionally reprocess.

Retry Topics

Variant: an intermediate orders.events.retry topic for transient failures. Consumers route failed messages there with a delay header; a separate consumer picks them up after the delay and retries. Useful when downstream services are temporarily unavailable.

Schema Registry

Use one. Avro or Protobuf with backwards/forwards-compatibility checks enforced at write time. Without a registry, schemas drift and old consumers break in mysterious ways.

Outbox Pattern

Already mentioned. Atomic DB write + Kafka publish is impossible without a 2PC or transactional outbox. Use the outbox pattern for any "make a DB change AND emit an event" flow.

CQRS

Command-Query Responsibility Segregation. Writes go to one model (the source of truth, often event-sourced via Kafka). Reads use materialized views built from those events. Each view optimized for its specific query pattern. Kafka is the connective layer.

Compaction for "Latest State" Topics

Default Kafka deletes by time. Compaction deletes by key: only the latest message per key is retained. Use it when the topic represents "the current state of all keys" (user profiles, account balances, feature flags). New consumers can bootstrap by reading the compacted topic from the beginning.

Tiered Storage

For very long retention, modern Kafka (KIP-405, also Confluent Tiered Storage) offloads old segments to object storage (S3, GCS). Hot data on broker disks, cold data in object storage, transparently. Affordable infinite retention.

Topic and Partition Design Cheatsheet

QuestionAnswer
How many partitions?Start with 12-24. Need more parallel consumers later? Bigger upfront. Hard to add later without rebalance.
What partition key?Whatever entity needs ordered processing (user_id, order_id, account_id). No key = round-robin.
One topic or many?Per logical event domain. Don't put unrelated event types in one topic just to "save".
Retention period?Match the use case. Buffer: hours. Logs: days. Audit/event sourcing: forever (with compaction or tiered storage).
Replication factor?3 in production. Survives one broker loss.
Acks setting?acks=all for durability. acks=1 for speed at risk of rare data loss.
Compression?Always. zstd is the modern default. Saves disk and bandwidth at minor CPU cost.

When NOT to Use Kafka

Kafka is overkill or wrong for many situations. Common anti-patterns:

Request-response. Kafka is one-way. If you need a synchronous "ask and wait for an answer", use HTTP or gRPC. People build "request topics + response topics" patterns and they're almost always painful.

Low volume. 100 events per day? Use a database table or a simple queue (SQS, RabbitMQ, even Postgres LISTEN/NOTIFY). Kafka's operational complexity is not worth it for tiny workloads.

Single producer, single consumer, no fanout. If only one service produces and one service consumes, that's a queue, not a stream. SQS or RabbitMQ are simpler.

Sub-millisecond latency requirements. Kafka end-to-end latency is typically 5-50ms. For HFT or real-time bidding, that's too slow. Look at specialized in-memory systems.

Background job queue with priorities, delays, and dead-letter retries. Kafka has no native priorities, no native delayed delivery, and the DLQ pattern is DIY. Celery, Sidekiq, or BullMQ are designed for this.

Small teams without ops capacity. A self-hosted Kafka cluster is non-trivial. ZooKeeper (legacy) or KRaft, brokers, monitoring, backups, capacity planning. Use a managed service (Confluent Cloud, AWS MSK, Aiven, Redpanda Cloud) or a simpler tool. Don't run Kafka yourself unless you have the ops chops.

The "we'll need streams someday" trap. Don't introduce Kafka because it might be useful later. Introduce it when you have a concrete problem it solves. The cost of carrying it (people, infrastructure, learning curve) is real.

Operational Realities

Monitoring is mandatory. Watch consumer lag (the number of messages consumers are behind). Alerts: lag growing, consumers not committing, broker disk filling, ISR shrinkage.

Capacity planning. Disk = retention period * throughput * replication factor. Network = peak producer rate + peak consumer rate * fanout. Forecast for your highest-traffic week, not average.

Rebalancing is a stop-the-world event. When a consumer joins or leaves a group, partitions reassign. During the rebalance, the group pauses. For modern Kafka, cooperative rebalancing reduces this; still, frequent rebalances hurt throughput.

Don't share clusters across orgs without quotas. A bad producer can saturate the cluster and harm everyone. Use Kafka quotas to bound per-client throughput.

Disaster recovery. Multi-region replication (MirrorMaker 2, Confluent Replicator) is a project of its own. For most teams, single-region with strong replication factor is sufficient. Mission-critical systems need cross-region.

Upgrades. Kafka clusters have lots of moving parts (brokers, ZooKeeper / KRaft controllers, schema registry, connectors). Upgrade in a careful, staged way. Read release notes for breaking changes.

Security. Default Kafka has no authentication. Production must enable TLS, SASL (Kerberos, SCRAM, or OAUTHBEARER), and ACLs. Encrypt at rest if compliance requires.

Choosing Kafka vs Alternatives

KafkaRabbitMQSQS / SNSRedpandaPulsar
ModelDistributed logSmart broker, queueManaged queue / pub-subKafka-compatible logDistributed log + queue
ReplayYesLimitedNoYesYes
ThroughputVery highModerateHigh (managed)Very highVery high
Ops burdenHigh (or use managed)ModerateNone (managed)Lower than KafkaHigher than Kafka
Best forStreams, fanout, logWork queue, RPCAsync tasks, simple fanoutKafka-replacementMulti-tenant streams

Pick Kafka when you have multiple consumer groups, need replay, or are operating at scale. Pick a managed queue when you have simple async workloads. Pick RabbitMQ when you want sophisticated routing (topic exchanges, headers exchanges, RPC patterns). Redpanda is a faster Kafka-compatible drop-in for many setups.

Edge Cases and Gotchas

You cannot easily increase partitions. Adding partitions changes the hash mapping and breaks per-key ordering. Plan your partition count upfront with growth in mind.

Hot partitions. If your partition key is skewed (one user generating 90% of traffic), one partition becomes a bottleneck. Solution: composite keys, or salt the key for high-volume entities.

Message size limits. Default broker limit is 1 MB. Bigger payloads can be configured but expensive. Pattern: store the blob in S3 and pass a reference in the Kafka message.

Consumer rebalances during deploys. Every rolling restart triggers rebalances. Use static membership and cooperative rebalancing to minimize disruption.

Long-running message processing. If a consumer takes longer than the session timeout to process, Kafka thinks it's dead and rebalances. Either tune session timeout up, or break work into shorter units.

Auto-commit is dangerous. If you auto-commit before processing finishes, a crash loses messages. Always commit after processing succeeds.

Delivery is at-least-once by default. Idempotent consumer is mandatory. Exactly-once is achievable but costs latency and adds complexity.

Don't treat Kafka as a database. It's not optimized for random access by key. Use Kafka for the log; use a real DB or KV store for queries.

Tombstones for compaction. To "delete" a key in a compacted topic, write a message with a null value. The compactor will eventually remove the key entirely.

The "single producer per partition" rule for ordered writes. If multiple producers write to the same partition concurrently, ordering between them is broker-receive order, not your application order. For strict ordering, route writes for a key through a single producer process.

The One Thing to Remember

Kafka is not a message queue, a database, or a job system. It's a distributed, replayable log that excels at three things: fanning out events to many independent consumers, buffering bursts so producers and consumers can run at different speeds, and providing a durable, ordered history of state changes that downstream systems can replay. Almost every successful Kafka use case is a variation on those three. The most common Kafka mistake is reaching for it when a simple queue or HTTP call would do; the second most common is underestimating the operational cost of running it. Start with one or two well-designed topics, partitioned by the entity that needs ordered processing, with a schema registry from day one and idempotent consumers everywhere. Add patterns (DLT, retry topics, outbox, compaction, exactly-once) only when a real problem demands them. The systems that get the most out of Kafka are the ones that resist the temptation to over-engineer it.