Q: What are the key metrics to monitor in a Kafka cluster?

Answer:

Monitoring is essential for maintaining a healthy Kafka cluster. Here are the critical metrics organized by component.

Broker Metrics

MetricWhat It Tells YouAlert Threshold
UnderReplicatedPartitionsPartitions where followers are behind the leader> 0 for sustained period
ActiveControllerCountNumber of active controllers in the clusterShould always be exactly 1
OfflinePartitionsCountPartitions with no leader (completely unavailable)> 0 = critical
RequestHandlerAvgIdlePercentHow busy the broker's request handler threads are< 20% = broker overloaded
NetworkProcessorIdlePercentNetwork thread utilization< 30% = network bottleneck
LogFlushLatencyMsTime to flush logs to diskSpikes indicate disk issues

Producer Metrics

MetricWhat It Tells YouAlert Threshold
record-send-rateMessages sent per secondSudden drop = producer issue
record-error-rateFailed sends per second> 0 = investigate
batch-size-avgAverage batch sizeToo small = suboptimal batching
request-latency-avgAvg time broker takes to respond> 100ms = potential issue

Consumer Metrics

MetricWhat It Tells YouAlert Threshold
records-lag-maxMaximum lag across all partitionsConsistently increasing
records-consumed-rateMessages consumed per secondSudden drop = consumer issue
commit-latency-avgTime to commit offsetsSpikes indicate issues
rebalance-rateHow often the group rebalancesHigh rate = configuration issue

Monitoring Stack

Kafka (JMX Metrics)
    ↓
Prometheus (JMX Exporter)
    ↓
Grafana (Dashboards + Alerts)

Popular Tools:

  • Prometheus + JMX Exporter: Industry standard for metric collection.
  • Grafana: Visualization and alerting.
  • Burrow: LinkedIn's tool specifically for consumer lag monitoring.
  • Kafka Manager / AKHQ: Web UI for cluster management.
  • Confluent Control Center: Commercial monitoring (Confluent Platform).

Critical Alerts to Set Up

# Example Prometheus alerting rules
groups:
  - name: kafka-alerts
    rules:
      - alert: KafkaOfflinePartitions
        expr: kafka_server_replicamanager_offline_partitions_count > 0
        for: 1m
        labels:
          severity: critical

      - alert: KafkaConsumerLagHigh
        expr: kafka_consumer_group_lag > 10000
        for: 5m
        labels:
          severity: warning

      - alert: KafkaUnderReplicatedPartitions
        expr: kafka_server_replicamanager_under_replicated_partitions > 0
        for: 5m
        labels:
          severity: warning

[!TIP] In interviews, the most impactful metrics to mention are UnderReplicatedPartitions (replication health), consumer lag (processing health), and OfflinePartitionsCount (availability). These cover the three biggest operational concerns: data durability, throughput, and uptime.