How Kafka Partitions Decide Order and Durability

This is Part 4 of the Kafka Mastery series. Parts 1 to 3 covered when to use Kafka, how to run it locally, and how to design a multi-service event flow on top of it. This part goes one layer down. Partitions, offsets, replication, min.insync.replicas, acks, and unclean.leader.election.enable decide whether the cluster preserves order, preserves data, and tells operators when it cannot.

The article uses Apache Kafka 3.9 in KRaft mode, with a three-broker local cluster for the durability demo.

What A Partition Actually Is

A partition is the unit Kafka stores and orders data in. It is an append-only, immutable log of records. Each record has a position in that log called the offset. Offsets are per partition. They start at zero and never go backward.

A topic is a named collection of partitions. The topic itself does not order anything. Ordering exists per partition, not per topic. This is the single most important sentence in this article.

plaintext

Topic: order.created
 
  Partition 0: [r0, r1, r2, r3, r4, ...]   <- ordered
  Partition 1: [r0, r1, r2, r3, ...]        <- ordered
  Partition 2: [r0, r1, r2, ...]             <- ordered
 
  Across partitions: no order at all

Each partition lives on one broker (the leader), with copies on other brokers (the followers). Producers write to the leader. Followers replicate from the leader. Consumers read from the leader.

The Partition Key Decides Ordering

When a producer publishes a record with a key, Kafka hashes the key and uses the result to pick a partition. All records with the same key land on the same partition, and therefore preserve order with respect to each other.

java

kafkaTemplate.send("order.created", order.id().toString(), event);

Three common partition-key strategies, and what each gives up:

By entity ID (for example, aggregate_id or order_id): preserves per-entity ordering. The right default for any system where a single entity has a sequence of events that must be processed in order.
By event type: preserves no useful ordering. Spreads load by category, which is rarely what the business needs.
Random or null key: spreads load uniformly across partitions. Maximum throughput. No ordering guarantee anywhere.

If the system has any sentence of the form "A must happen before B for this entity," the partition key must be the entity ID. Anything else is hoping the race never lands on the wrong side.

The Failure Scenario: Wrong Partition Key

A team partitions an order.events topic by event type. OrderCreatedEvent lands on partition 1. OrderCancelledEvent lands on partition 2. The reasoning at design time was "this gives us better parallel consumption."

Under load, two events for the same order arrive seconds apart: a creation and a cancellation. They go to different partitions. Two consumer threads pick them up in parallel. Partition 2 happens to be slightly ahead in processing, so the cancellation runs first. The order is cancelled, then created. The system processes the cancelled order as active. Customer support gets a complaint that an order they cancelled went out for fulfilment.

The fix is the partition key, not the consumer code. Per-entity ordering only exists if all events for that entity share a partition. Repartitioning a live topic is an operational project, not a config change. Choose the key correctly the first time.

Replication: Leaders, Followers, And The ISR

Each partition has a configured replication factor. RF3 means one leader and two followers across three brokers.

A follower is "in sync" if it has caught up to the leader within a configured lag tolerance. The set of in-sync replicas, including the leader, is the ISR.

Three settings interact to define what durability actually means:

replication.factor on the topic
min.insync.replicas on the topic
acks on the producer

Any one of these alone is insufficient. The combination is what guarantees data survives broker failure.

`acks`, `min.insync.replicas`, And What They Promise Together

acks controls when the producer considers a write complete:

acks=0: the producer does not wait for any acknowledgment. The fastest. Loses data on any leader failure during the write.
acks=1: the producer waits for the leader to write to its local log. Loses data if the leader fails before followers replicate.
acks=all: the producer waits until every replica in the ISR acknowledges the write. Combined with min.insync.replicas, this is the durable choice.

min.insync.replicas tells the broker how many replicas must be in the ISR for the write to be allowed at all. With acks=all and min.insync.replicas=2 and RF3, a write is durable as long as at least two of the three replicas are healthy and in sync.

The production target is RF3 with min.insync.replicas=2 and acks=all. That tolerates one broker failure cleanly. It refuses writes if two brokers are down, instead of accepting writes that cannot be replicated.

yaml

spring:
  kafka:
    producer:
      acks: all
      properties:
        enable.idempotence: true

bash

kafka-topics.sh --bootstrap-server localhost:9092 \
  --create \
  --topic order.created \
  --partitions 6 \
  --replication-factor 3 \
  --config min.insync.replicas=2

Named Failure Modes The Producer Will Surface

When durability settings disagree with cluster state, Kafka does not silently degrade. It throws exceptions with names worth recognizing.

NotEnoughReplicasException. The ISR is below min.insync.replicas before the append is even attempted. The broker refuses the write. This is correct behavior. It is also the producer's signal that the cluster is degraded.
NotEnoughReplicasAfterAppendException. The append happened, but the broker could not get enough in-sync replicas to acknowledge it within the timeout. The write may or may not survive a leader failure. The producer should treat it as a failure and retry under idempotence.
Acknowledged data loss with weaker acks. With acks=0 or acks=1, a write can be acknowledged to the producer and then disappear if the leader fails before replication. There is no exception. The data is just gone. This is what acks=all exists to prevent.
Write unavailability when MISR exceeds live ISR. If the topic is configured with min.insync.replicas=2 and only one replica is in sync, every write fails. The producer sees NotEnoughReplicas. This is durability working as designed: refuse the write rather than accept one that cannot be replicated.
Truncation under unclean leader election. When unclean.leader.election.enable=true (the dangerous setting) and all in-sync replicas are unavailable, Kafka can promote an out-of-sync replica to leader. That replica's log is shorter. Records that existed on the old leader are silently dropped on the new one. Consumers do not see an error. The data is gone. The correct production setting is unclean.leader.election.enable=false. Writes block until an in-sync replica is available, which is the right trade in any system that values correctness over uptime.

The named exceptions matter. NotEnoughReplicasAfterAppend in the logs is not a Kafka bug. It is Kafka telling the operator that the cluster's durability assumptions are no longer being met.

The Local Demo: A Cluster That Can Actually Demonstrate ISR Failure

A single-broker local cluster cannot demonstrate any of this. The durability tag (part-04-storage) replaces the single-broker compose file from Part 2 with a cluster that has three brokers and three dedicated controllers in KRaft mode.

yaml

services:
  controller-1: { image: confluentinc/cp-kafka:7.8.0, ... } # process.roles=controller
  controller-2: { image: confluentinc/cp-kafka:7.8.0, ... } # process.roles=controller
  controller-3: { image: confluentinc/cp-kafka:7.8.0, ... } # process.roles=controller
  kafka-1:     { image: confluentinc/cp-kafka:7.8.0, ... } # process.roles=broker
  kafka-2:     { image: confluentinc/cp-kafka:7.8.0, ... } # process.roles=broker
  kafka-3:     { image: confluentinc/cp-kafka:7.8.0, ... } # process.roles=broker
  kafka-ui:    { image: provectuslabs/kafka-ui:latest, ... }

Why dedicated controllers, not combined-mode nodes: the demo below stops two brokers to drop the ISR below min.insync.replicas=2. In a combined broker,controller setup, those same two nodes are also two of three controllers. Stopping them takes the controller quorum down with them, and the producer fails because the cluster is leaderless rather than because of the durability rule we are trying to demonstrate. Splitting the roles keeps the controller quorum healthy while the broker side fails the way the article describes.

Default replication factors for internal topics (__consumer_offsets, __transaction_state) are raised back to 3 so they match what the article teaches.

A single-broker compose file is preserved as part-04-storage-local for fast iteration on examples that do not need durability semantics. It is explicitly labelled local-only and uses RF1 / MISR1.

A demonstration script for the durability story:

Create a topic with RF3 and min.insync.replicas=2. All three brokers are in the ISR.
Start a producer with acks=all.
Stop one broker. The ISR drops to 2. The producer continues; writes are still acknowledged because the ISR meets min.insync.replicas.
Stop a second broker. The ISR drops to 1, below min.insync.replicas=2. The producer's next write fails with NotEnoughReplicas. Controller quorum is unaffected because the controllers are separate nodes.
Restart a broker. The follower catches up and rejoins the ISR. The producer recovers.

The lesson lands the moment the producer fails on step 4. The system is not broken. It is doing exactly what was asked of it.

A Demo For Ordering

The same tag ships two consumers reading from the same topic, configured to print the partition each record was read from. The producer publishes ten events for two order IDs (five each). With keying by order_id, all five events for each order land on the same partition and arrive in order at the consumer that owns it. With a random key, the same ten events scatter across partitions, and a consumer that processes them in arrival order will routinely see them out of business order.

This is a 30-line demo. It is also the fastest way to make the partition-key lesson stick.

Retention And Compaction

Kafka does not store events forever by default.

Time-based retention (retention.ms): records older than the threshold are deleted. The default is 7 days.
Size-based retention (retention.bytes): records beyond the size limit are deleted, oldest first.

A class of bug worth naming: a consumer that goes down for longer than retention misses records permanently. Kafka does not page operators when retention deletes a record. The cluster looks healthy. The consumer is just behind the floor of the log. This is a real failure mode for systems that allow long consumer outages, and an argument for monitoring oldest-unread offsets, not just lag.

Compaction (cleanup.policy=compact) is the alternative for keyed-state topics. Kafka keeps the latest record per key and discards older versions. This requires every record to have a key. It is the right choice for "current state per entity" topics and the wrong choice for event-history topics.

Practical Partition Count Guidance

Choosing partition count up front is a tradeoff:

Too few partitions caps consumer-group parallelism. A consumer group cannot use more consumers than partitions usefully.
Too many partitions increases broker memory, controller load, and rebalance time. It also fragments throughput per partition.

A reasonable starting heuristic: enough partitions to support peak parallel consumption with headroom. Six to twelve partitions per topic is a defensible default for most application topics on a small cluster. Increasing later is possible. Decreasing requires creating a new topic and migrating, which is operational work.

Suggested Module Shape

kafka-series - storage and durability

kafka-series/

docker-compose.cluster.ymlrequired# 3-broker KRaft cluster

docker-compose.local.yml# 1-broker local fast path (labelled)

topic-setup/

create-topics.sh# RF3, MISR2, partitions=6

ordering-demo/

OrderingProducer.java# keyed vs random publish

OrderingConsumer.java# logs partition + offset

durability-demo/

DurabilityProducer.java# acks=all, idempotent

run-broker-failure.sh# stop / start brokers

directory.kt.yaml / .gradle.xml.ts / .jsother

What Most People Get Wrong

"Kafka guarantees ordering." No, it guarantees ordering within a partition. Cross-partition ordering does not exist. If the system needs per-entity ordering, the partition key is the lever.
"Replication factor 3 means durable." RF3 alone is not enough. Without acks=all and min.insync.replicas=2, RF3 just means there happen to be replicas. The producer is not waiting for them to acknowledge.
"unclean.leader.election.enable=true is fine, it improves availability." It improves availability by quietly losing data. In a payment-class system, that is the wrong trade. Refuse writes instead.
"Write a topic and forget about it." Retention is a real expiry. Consumers that fall behind retention permanently miss records. Monitor for it.

Where We Go Next

Part 5 moves up from storage to consumption. Consumer groups are how Kafka turns one log into parallel work, and rebalancing is the operational risk that comes with that. The article covers consumer-group mechanics, lag as a reliability metric, the cost of a stop-the-world rebalance, and the configuration that turns a routine deploy from a 30-second pause into a three-minute outage if no one tuned it.