Consumer Groups, Lag, and Rebalancing in Kafka

This is Part 5 of the Kafka Mastery series. Part 4 covered how Kafka stores data and how durability is configured. This part moves up the stack to how Kafka serves it. Consumer groups are how a log becomes parallel work, lag is the reliability signal that tells you whether the work is keeping up, and rebalancing is the cost most teams discover during their first rolling deploy.

The article uses Apache Kafka 3.9 client defaults and Spring Kafka 3.3.

Consumer Groups: The Unit Of Parallel Consumption

A consumer group is a label. Multiple consumer instances joining the same group split the partitions of a subscribed topic between them. Each partition is read by exactly one consumer in the group at a time.

plaintext

Topic: order.created (6 partitions)
 
Consumer group: payment-service (3 instances)
 
  instance A -> partitions 0, 1
  instance B -> partitions 2, 3
  instance C -> partitions 4, 5

Two rules follow:

A consumer group cannot use more consumers than partitions usefully. A fourth instance in the example above would sit idle.
Two different groups reading the same topic do not interfere. The Order Service and the Notification Service can both consume payment.completed independently, each tracking its own committed offsets.

This is why event log fan-out is free in Kafka. Adding a new consumer is creating a new group. The producer does not know or care.

Committed Offsets: The Real Source Of Truth

Each consumer in a group commits the offset it has finished processing for each partition it owns. On restart or rebalance, consumption resumes from the committed offset, not from where the consumer stopped reading.

plaintext

partition 3
  records: [r0 r1 r2 r3 r4 r5 r6 r7 r8 ...]
                       ^committed=4
                                ^read=7

If the consumer crashes after reading r5, r6, r7 but before committing past r4, the new owner of partition 3 will reprocess r5, r6, r7. That is at-least-once delivery, and it is the default. The implications for idempotency are the entire point of Part 6.

Lag: The Most Important Operational Metric

Consumer lag is the difference between the latest offset on a partition and the consumer group's committed offset on that partition.

plaintext

lag(partition, group) = latestOffset(partition) - committedOffset(partition, group)

Three properties make lag the metric to watch:

It is unitless on its own and meaningful in context. Lag of 200 on a topic doing 5 messages per second is a 40-second backlog. On a topic doing 500 messages per second, it is barely measurable.
It rises before failure becomes visible to users. A consumer that slows down because of a downstream API issue accumulates lag long before the first customer complaint.
It is independent of consumer health checks. A consumer can be healthy by every JVM and HTTP probe, and still be falling behind because its processing is slow.

In time-sensitive systems, lag is not just a performance metric. It is a reliability signal. A payment consumer that lags by ten minutes is a payment consumer that is ten minutes late charging customers, even if every dashboard in the application is green.

Two Sources Of Lag, Different Tools

Per-client fetch lag. Spring Kafka exports kafka.consumer.fetch.manager.records.lag through Micrometer. This is per consumer instance, per partition currently assigned to it. The right tool for diagnosing one client: is this instance keeping up with the partitions it owns right now.
Consumer-group lag. Computed externally, by querying the broker for the latest offset and the group's committed offset on every partition of the topic. Tools include Kafka Exporter, Burrow, and Cruise Control. The right tool for the whole group: is the group as a unit keeping up across all partitions and committed offsets.

The two metrics answer different questions. Client metrics are diagnostic, scoped to one consumer's currently assigned partitions. Group lag is the alerting source of truth, scoped to the entire group regardless of which member owns which partition right now. On-call alerts belong on group lag.

Part 12 returns to this with concrete metrics and a Grafana dashboard. For this article, the rule is: alert on group-level lag from an external exporter, not on per-client lag.

What Triggers A Rebalance

A rebalance is the process of reassigning partitions across the consumers in a group. It is triggered by:

A consumer joining the group (new instance starting up)
A consumer leaving the group (instance shutting down or crashing)
The broker considering a consumer dead (no heartbeat or no poll within the configured timeout)
The set of partitions changing (topic added, partitions added, subscription changed)

Rebalances are part of healthy operation. They become a problem when they happen often or last too long.

The Rebalance Cost: Stop-The-World

Under the classic eager rebalance protocol, every consumer in the group revokes all of its partitions, then receives a new assignment. While the rebalance is in flight, no consumer in the group is processing any record from any partition. Consumption stops.

A typical rebalance lasts a few seconds in a small group with a healthy network. It can last much longer when:

The group has many consumers, each with state to commit before revocation
Session timeout is high and the broker takes time to declare a missing consumer dead
Multiple rebalances stack on top of each other during a rolling deploy

A rolling restart of three instances under default settings can produce three sequential rebalances. If each takes 30 to 60 seconds, the group is paused for nearly three minutes. A consumer that processes financial events does not get to be paused for three minutes during deploys.

Cooperative Incremental Rebalancing

The cooperative protocol changes the cost calculation. Instead of every consumer revoking everything and waiting, only the partitions that need to move are revoked, and they move incrementally. Consumers keep processing partitions that did not change owners.

In Kafka 3.9 client versions, the default partition.assignment.strategy is [RangeAssignor, CooperativeStickyAssignor]. The list is order-sensitive. The group negotiates the first strategy that every member supports, so as long as RangeAssignor is in the list on any member, the eager protocol wins.

The migration depends on where the group is starting from:

From the Kafka 3.9 default [RangeAssignor, CooperativeStickyAssignor]: one rolling bounce. Each instance comes back with partition.assignment.strategy: CooperativeStickyAssignor (range removed). Once every member has rejoined under the new config, the group negotiates cooperative sticky.
From an older Range-only or eager-only configuration: two rolling bounces. The first bounce adds CooperativeStickyAssignor after the existing eager assignor so the group still negotiates the eager protocol but every member knows about the cooperative one. The second bounce removes the eager assignor.

For new applications, configuring CooperativeStickyAssignor directly skips the migration entirely.

yaml

spring:
  kafka:
    consumer:
      group-id: payment-service
      auto-offset-reset: earliest
      properties:
        partition.assignment.strategy: org.apache.kafka.clients.consumer.CooperativeStickyAssignor
        session.timeout.ms: 45000
        heartbeat.interval.ms: 15000
        max.poll.interval.ms: 300000

Static Membership

For long-running consumers (always-on services that do not autoscale), static membership reduces unnecessary rebalances. Setting group.instance.id per instance lets the broker treat a brief disconnection as the same member returning, instead of a member leaving and a new one joining.

yaml

spring:
  kafka:
    consumer:
      properties:
        group.instance.id: payment-service-pod-1
        session.timeout.ms: 45000

The trade is that a genuinely dead instance takes longer to be replaced. Use static membership where consumer churn is rare. Avoid it for autoscaled groups where instances come and go regularly.

The Heartbeat And Poll-Interval Triple

Three settings control when the broker considers a consumer alive:

session.timeout.ms: how long the broker waits without a heartbeat before declaring the consumer dead. Default 45,000 ms in recent versions.
heartbeat.interval.ms: how often the consumer sends heartbeats. Should be roughly one third of session.timeout.ms.
max.poll.interval.ms: how long the consumer can take between calls to poll(). If processing the previous batch takes longer than this, the broker declares the consumer dead and rebalances. Default 300,000 ms (five minutes).

The trap is that processing time, not idle time, is the variable that breaks max.poll.interval.ms. A slow downstream call can easily push processing past five minutes during a bad stretch. Part 8 returns to this in detail. For now, the rule is to size max.poll.interval.ms and max.poll.records together, based on actual processing time, not on the defaults.

The Failure Scenario: Rebalance Storm During A Rolling Restart

A payment-processing consumer group has three instances, each handling two partitions of a six-partition topic. The team rolls a deploy: stop instance A, wait for it to come back, then B, then C.

With default settings and the eager protocol still in play because nothing was tuned:

A stops. The group rebalances. B and C take A's partitions. 30 to 60 seconds of paused consumption.
A starts. The group rebalances again. Partitions move back. Another 30 to 60 seconds paused.
B stops. Rebalance. 30 to 60 seconds.
B starts. Rebalance. 30 to 60 seconds.
C stops. Rebalance.
C starts. Rebalance.

Six rebalances. Each lasts long enough to matter. Total paused time is two to three minutes.

Meanwhile, the producer keeps publishing. By the time consumption resumes, there is a backlog. The consumer fires every record at the downstream payment API as fast as it can. The payment API, which was sized for steady-state load, starts timing out. Some payments retry. Some land twice (Kafka is at-least-once). Customer support gets calls.

The on-call engineer sees the symptom (payment timeouts) before the cause (rebalance storm). The cause is invisible without a rebalance-rate metric. The symptom is half a Kafka problem and half a downstream protection problem, which is why this article also recommends:

A rebalance-rate alert (rebalances per minute above a small threshold)
Lag alerts at two thresholds: a soft alert for early warning, a hard alert for paging
A consumer-side circuit breaker or rate limiter for downstream calls during catch-up

Implementation Pattern

java

@Configuration
public class KafkaConsumerConfig {
 
    @Bean
    public ConcurrentKafkaListenerContainerFactory<String, OrderCreatedEvent>
        kafkaListenerContainerFactory(ConsumerFactory<String, OrderCreatedEvent> consumerFactory) {
 
        ConcurrentKafkaListenerContainerFactory<String, OrderCreatedEvent> factory =
            new ConcurrentKafkaListenerContainerFactory<>();
        factory.setConsumerFactory(consumerFactory);
 
        // One container thread per partition assigned. Concurrency above the
        // partition count is wasted.
        factory.setConcurrency(3);
 
        // Manual ack so a slow downstream call does not silently get committed
        // before processing finished.
        factory.getContainerProperties().setAckMode(ContainerProperties.AckMode.MANUAL_IMMEDIATE);
 
        return factory;
    }
}

java

@KafkaListener(topics = "order.created", groupId = "payment-service")
public void onOrderCreated(OrderCreatedEvent event, Acknowledgment ack) {
    paymentService.charge(event);
    ack.acknowledge();
}

The full Spring Kafka production configuration story (commit modes, ack modes, error handling, retry topics) is in Part 8. This article keeps the focus on what consumer groups, lag, and rebalances actually do.

Suggested Module Shape

kafka-series - consumer groups

kafka-series/

payment-service/

src/main/java/com/example/payment/

KafkaConsumerConfig.javaentry# cooperative sticky, manual ack

OrderCreatedListener.java# @KafkaListener

application.yml# session, heartbeat, poll-interval triple

ops/

lag-alert.example.yml# group lag alert thresholds

rebalance-alert.example.yml# rebalances/min threshold

directory.kt.yaml / .gradle.xml.ts / .jsother

What Most People Get Wrong

"Lag is a performance metric." In time-sensitive systems, lag is a reliability metric. A consumer that lags by an hour has dropped a service-level objective whether or not anyone has noticed.
"Rebalances are instantaneous." They are not. The eager protocol pauses every consumer in the group. The cost compounds during rolling deploys.
"More consumers means more throughput." Up to the partition count, yes. Beyond it, extra consumers sit idle. Throughput follows partition count, not consumer count.
"Per-client lag tells me my group is healthy." Per-client lag tells you about a single instance. Group lag is the metric that reflects what the system is actually doing. Alert on the group.

Where We Go Next

Part 6 deals with delivery semantics directly. At-least-once is the practical default, and it requires consumer-side idempotency to be safe. Kafka's exactly-once support exists for Kafka-only read-process-write flows and stops at Kafka topic boundaries. The article covers the three idempotency lanes (Kafka-only EOS, database-and-Kafka through the Outbox Pattern, external side effects through deduplication stores) and is honest about which one each problem actually needs.