
How Kafka Partitions Decide Order and Durability
Series
Kafka Mastery
4 of 6 in the series
A series about using Kafka in real JVM systems, with each article anchored on a concrete failure mode or design decision.
Article 1
Choosing Between Kafka, RabbitMQ, and REST
Article 2
Building a Minimal Kafka System You Can Actually Debug
Article 3
From REST Calls to Event Flows
Article 4
How Kafka Partitions Decide Order and Durability
Article 5
Consumer Groups, Lag, and Rebalancing in Kafka
Article 6
Delivery Semantics and Idempotency in Kafka
Kafka's storage model is where ordering, throughput, and durability are all decided. Get the partition key wrong, or the replication settings wrong, and the failure modes have names worth recognizing.
This is Part 4 of the Kafka Mastery series. Parts 1 to 3 covered when to use Kafka, how to run it locally, and how to design a multi-service event flow on top of it. This part goes one layer down. Partitions, offsets, replication, min.insync.replicas, acks, and unclean.leader.election.enable decide whether the cluster preserves order, preserves data, and tells operators when it cannot.
The article uses Apache Kafka 3.9 in KRaft mode, with a three-broker local cluster for the durability demo.
What A Partition Actually Is
A partition is the unit Kafka stores and orders data in. It is an append-only, immutable log of records. Each record has a position in that log called the offset. Offsets are per partition. They start at zero and never go backward.
A topic is a named collection of partitions. The topic itself does not order anything. Ordering exists per partition, not per topic. This is the single most important sentence in this article.
Topic: order.created
Partition 0: [r0, r1, r2, r3, r4, ...] <- ordered
Partition 1: [r0, r1, r2, r3, ...] <- ordered
Partition 2: [r0, r1, r2, ...] <- ordered
Across partitions: no order at allEach partition lives on one broker (the leader), with copies on other brokers (the followers). Producers write to the leader. Followers replicate from the leader. Consumers read from the leader.
The Partition Key Decides Ordering
When a producer publishes a record with a key, Kafka hashes the key and uses the result to pick a partition. All records with the same key land on the same partition, and therefore preserve order with respect to each other.
kafkaTemplate.send("order.created", order.id().toString(), event);Three common partition-key strategies, and what each gives up:
- By entity ID (for example,
aggregate_idororder_id): preserves per-entity ordering. The right default for any system where a single entity has a sequence of events that must be processed in order. - By event type: preserves no useful ordering. Spreads load by category, which is rarely what the business needs.
- Random or null key: spreads load uniformly across partitions. Maximum throughput. No ordering guarantee anywhere.
If the system has any sentence of the form "A must happen before B for this entity," the partition key must be the entity ID. Anything else is hoping the race never lands on the wrong side.
The Failure Scenario: Wrong Partition Key
A team partitions an order.events topic by event type. OrderCreatedEvent lands on partition 1. OrderCancelledEvent lands on partition 2. The reasoning at design time was "this gives us better parallel consumption."
Under load, two events for the same order arrive seconds apart: a creation and a cancellation. They go to different partitions. Two consumer threads pick them up in parallel. Partition 2 happens to be slightly ahead in processing, so the cancellation runs first. The order is cancelled, then created. The system processes the cancelled order as active. Customer support gets a complaint that an order they cancelled went out for fulfilment.
The fix is the partition key, not the consumer code. Per-entity ordering only exists if all events for that entity share a partition. Repartitioning a live topic is an operational project, not a config change. Choose the key correctly the first time.
Replication: Leaders, Followers, And The ISR
Each partition has a configured replication factor. RF3 means one leader and two followers across three brokers.
A follower is "in sync" if it has caught up to the leader within a configured lag tolerance. The set of in-sync replicas, including the leader, is the ISR.
Three settings interact to define what durability actually means:
replication.factoron the topicmin.insync.replicason the topicackson the producer
Any one of these alone is insufficient. The combination is what guarantees data survives broker failure.
acks, min.insync.replicas, And What They Promise Together
acks controls when the producer considers a write complete:
acks=0: the producer does not wait for any acknowledgment. The fastest. Loses data on any leader failure during the write.acks=1: the producer waits for the leader to write to its local log. Loses data if the leader fails before followers replicate.acks=all: the producer waits until every replica in the ISR acknowledges the write. Combined withmin.insync.replicas, this is the durable choice.
min.insync.replicas tells the broker how many replicas must be in the ISR for the write to be allowed at all. With acks=all and min.insync.replicas=2 and RF3, a write is durable as long as at least two of the three replicas are healthy and in sync.
The production target is RF3 with min.insync.replicas=2 and acks=all. That tolerates one broker failure cleanly. It refuses writes if two brokers are down, instead of accepting writes that cannot be replicated.
spring:
kafka:
producer:
acks: all
properties:
enable.idempotence: truekafka-topics.sh --bootstrap-server localhost:9092 \
--create \
--topic order.created \
--partitions 6 \
--replication-factor 3 \
--config min.insync.replicas=2Named Failure Modes The Producer Will Surface
When durability settings disagree with cluster state, Kafka does not silently degrade. It throws exceptions with names worth recognizing.
NotEnoughReplicasException. The ISR is belowmin.insync.replicasbefore the append is even attempted. The broker refuses the write. This is correct behavior. It is also the producer's signal that the cluster is degraded.NotEnoughReplicasAfterAppendException. The append happened, but the broker could not get enough in-sync replicas to acknowledge it within the timeout. The write may or may not survive a leader failure. The producer should treat it as a failure and retry under idempotence.- Acknowledged data loss with weaker
acks. Withacks=0oracks=1, a write can be acknowledged to the producer and then disappear if the leader fails before replication. There is no exception. The data is just gone. This is whatacks=allexists to prevent. - Write unavailability when MISR exceeds live ISR. If the topic is configured with
min.insync.replicas=2and only one replica is in sync, every write fails. The producer seesNotEnoughReplicas. This is durability working as designed: refuse the write rather than accept one that cannot be replicated. - Truncation under unclean leader election. When
unclean.leader.election.enable=true(the dangerous setting) and all in-sync replicas are unavailable, Kafka can promote an out-of-sync replica to leader. That replica's log is shorter. Records that existed on the old leader are silently dropped on the new one. Consumers do not see an error. The data is gone. The correct production setting isunclean.leader.election.enable=false. Writes block until an in-sync replica is available, which is the right trade in any system that values correctness over uptime.
The named exceptions matter. NotEnoughReplicasAfterAppend in the logs is not a Kafka bug. It is Kafka telling the operator that the cluster's durability assumptions are no longer being met.
The Local Demo: A Cluster That Can Actually Demonstrate ISR Failure
A single-broker local cluster cannot demonstrate any of this. The durability tag (part-04-storage) replaces the single-broker compose file from Part 2 with a cluster that has three brokers and three dedicated controllers in KRaft mode.
services:
controller-1: { image: confluentinc/cp-kafka:7.8.0, ... } # process.roles=controller
controller-2: { image: confluentinc/cp-kafka:7.8.0, ... } # process.roles=controller
controller-3: { image: confluentinc/cp-kafka:7.8.0, ... } # process.roles=controller
kafka-1: { image: confluentinc/cp-kafka:7.8.0, ... } # process.roles=broker
kafka-2: { image: confluentinc/cp-kafka:7.8.0, ... } # process.roles=broker
kafka-3: { image: confluentinc/cp-kafka:7.8.0, ... } # process.roles=broker
kafka-ui: { image: provectuslabs/kafka-ui:latest, ... }Why dedicated controllers, not combined-mode nodes: the demo below stops two brokers to drop the ISR below min.insync.replicas=2. In a combined broker,controller setup, those same two nodes are also two of three controllers. Stopping them takes the controller quorum down with them, and the producer fails because the cluster is leaderless rather than because of the durability rule we are trying to demonstrate. Splitting the roles keeps the controller quorum healthy while the broker side fails the way the article describes.
Default replication factors for internal topics (__consumer_offsets, __transaction_state) are raised back to 3 so they match what the article teaches.
A single-broker compose file is preserved as part-04-storage-local for fast iteration on examples that do not need durability semantics. It is explicitly labelled local-only and uses RF1 / MISR1.
A demonstration script for the durability story:
- Create a topic with RF3 and
min.insync.replicas=2. All three brokers are in the ISR. - Start a producer with
acks=all. - Stop one broker. The ISR drops to 2. The producer continues; writes are still acknowledged because the ISR meets
min.insync.replicas. - Stop a second broker. The ISR drops to 1, below
min.insync.replicas=2. The producer's next write fails withNotEnoughReplicas. Controller quorum is unaffected because the controllers are separate nodes. - Restart a broker. The follower catches up and rejoins the ISR. The producer recovers.
The lesson lands the moment the producer fails on step 4. The system is not broken. It is doing exactly what was asked of it.
A Demo For Ordering
The same tag ships two consumers reading from the same topic, configured to print the partition each record was read from. The producer publishes ten events for two order IDs (five each). With keying by order_id, all five events for each order land on the same partition and arrive in order at the consumer that owns it. With a random key, the same ten events scatter across partitions, and a consumer that processes them in arrival order will routinely see them out of business order.
This is a 30-line demo. It is also the fastest way to make the partition-key lesson stick.
Retention And Compaction
Kafka does not store events forever by default.
- Time-based retention (
retention.ms): records older than the threshold are deleted. The default is 7 days. - Size-based retention (
retention.bytes): records beyond the size limit are deleted, oldest first.
A class of bug worth naming: a consumer that goes down for longer than retention misses records permanently. Kafka does not page operators when retention deletes a record. The cluster looks healthy. The consumer is just behind the floor of the log. This is a real failure mode for systems that allow long consumer outages, and an argument for monitoring oldest-unread offsets, not just lag.
Compaction (cleanup.policy=compact) is the alternative for keyed-state topics. Kafka keeps the latest record per key and discards older versions. This requires every record to have a key. It is the right choice for "current state per entity" topics and the wrong choice for event-history topics.
Practical Partition Count Guidance
Choosing partition count up front is a tradeoff:
- Too few partitions caps consumer-group parallelism. A consumer group cannot use more consumers than partitions usefully.
- Too many partitions increases broker memory, controller load, and rebalance time. It also fragments throughput per partition.
A reasonable starting heuristic: enough partitions to support peak parallel consumption with headroom. Six to twelve partitions per topic is a defensible default for most application topics on a small cluster. Increasing later is possible. Decreasing requires creating a new topic and migrating, which is operational work.
Suggested Module Shape
What Most People Get Wrong
- "Kafka guarantees ordering." No, it guarantees ordering within a partition. Cross-partition ordering does not exist. If the system needs per-entity ordering, the partition key is the lever.
- "Replication factor 3 means durable." RF3 alone is not enough. Without
acks=allandmin.insync.replicas=2, RF3 just means there happen to be replicas. The producer is not waiting for them to acknowledge. - "
unclean.leader.election.enable=trueis fine, it improves availability." It improves availability by quietly losing data. In a payment-class system, that is the wrong trade. Refuse writes instead. - "Write a topic and forget about it." Retention is a real expiry. Consumers that fall behind retention permanently miss records. Monitor for it.
Where We Go Next
Part 5 moves up from storage to consumption. Consumer groups are how Kafka turns one log into parallel work, and rebalancing is the operational risk that comes with that. The article covers consumer-group mechanics, lag as a reliability metric, the cost of a stop-the-world rebalance, and the configuration that turns a routine deploy from a 30-second pause into a three-minute outage if no one tuned it.