Kafka Interview Prep

A comprehensive collection of Apache Kafka interview questions and answers, covering architecture, producers, consumers, reliability guarantees, and production operations.

Topics covered:

  • Core architecture (brokers, topics, partitions, ZooKeeper/KRaft)
  • Producers (acks, idempotency, partitioning)
  • Consumers (consumer groups, offsets, rebalancing)
  • Reliability (delivery semantics, replication, ISR)
  • Performance (zero-copy, batching, compression)
  • Kafka Connect & Streams
  • Operations (retention, compaction, monitoring)

Q: What is Apache Kafka and why would you use it?

Answer:

Apache Kafka is an open-source distributed event streaming platform originally developed at LinkedIn and donated to the Apache Software Foundation. It is designed for high-throughput, fault-tolerant, real-time data streaming.

What Kafka Does

At its core, Kafka is a distributed commit log. Producers write messages (events) to topics, and consumers read those messages. Unlike traditional message queues, messages in Kafka are persisted to disk and can be replayed.

Why Use Kafka?

1. Decoupling of Systems Instead of service A calling service B directly (tight coupling), A publishes an event to Kafka. Any service interested in that event subscribes independently.

Without Kafka:  OrderService → PaymentService → EmailService → AnalyticsService
                (chain of synchronous calls, one failure breaks everything)

With Kafka:     OrderService → [Kafka Topic: "orders"]
                                    ├── PaymentService (consumer)
                                    ├── EmailService (consumer)
                                    └── AnalyticsService (consumer)

2. Extreme Throughput Kafka handles millions of messages per second with latency under 10ms. LinkedIn processes over 7 trillion messages/day through Kafka.

3. Durability & Replayability Messages are written to disk and replicated across brokers. Consumers can re-read old messages (e.g., replay the last 7 days of events to rebuild a search index).

4. Ordering Guarantees Messages within a single partition are strictly ordered — essential for event sourcing and log-based architectures.

Common Use Cases

  • Event-driven microservices — publish domain events between services.
  • Real-time analytics — stream clickstream data to dashboards.
  • Log aggregation — centralize logs from thousands of servers.
  • Change Data Capture (CDC) — stream database changes (via Debezium + Kafka Connect).
  • ETL pipelines — replace batch processing with real-time streaming.

[!NOTE] Kafka is NOT a traditional message queue (like RabbitMQ or SQS). It's a distributed log that happens to be excellent at messaging. The key difference: messages aren't deleted after consumption — they're retained based on a time or size policy.

Q: Explain Brokers, Topics, and Partitions in Kafka.

Answer:

These are the three fundamental building blocks of Kafka's architecture.

Broker

A broker is a single Kafka server. A Kafka cluster consists of multiple brokers (typically 3+). Each broker:

  • Stores data on disk
  • Serves producer and consumer requests
  • Participates in replication
  • Is identified by a unique integer ID

Brokers are designed so that no single broker holds all the data for a topic — data is distributed across brokers via partitions.

Topic

A topic is a named category/feed to which messages are published. Think of it as a table in a database or a folder in a filesystem.

Topics: "user-signups", "order-events", "payment-transactions"

Topics are multi-subscriber — many consumer groups can read from the same topic independently without affecting each other.

Partition

Each topic is split into one or more partitions. A partition is an ordered, immutable sequence of messages (an append-only log). Each message within a partition gets a sequential ID called an offset.

Topic: "orders" (3 partitions)

Partition 0: [msg0] [msg1] [msg2] [msg3] [msg4] → 
Partition 1: [msg0] [msg1] [msg2] →
Partition 2: [msg0] [msg1] [msg2] [msg3] →

Why Partitions Matter

1. Parallelism Each partition can be consumed by a different consumer in a consumer group. More partitions = more consumers = higher throughput.

2. Ordering Messages are strictly ordered WITHIN a partition, but there is no ordering guarantee ACROSS partitions. If ordering matters for a specific entity (e.g., all events for user X), you must ensure all events for that entity go to the same partition using a partition key.

3. Distribution Partitions are spread across brokers. For a topic with 6 partitions on a 3-broker cluster, each broker holds ~2 partitions.

How They Relate

Kafka Cluster
├── Broker 0
│   ├── orders-partition-0 (Leader)
│   └── orders-partition-2 (Follower)
├── Broker 1
│   ├── orders-partition-1 (Leader)
│   └── orders-partition-0 (Follower)
└── Broker 2
    ├── orders-partition-2 (Leader)
    └── orders-partition-1 (Follower)

[!IMPORTANT] Choosing the right number of partitions is a critical design decision. Too few = throughput bottleneck. Too many = increased memory usage, slower leader elections, and longer recovery times. A common starting point is number of partitions = desired throughput / throughput per partition (usually a few MB/s per partition).

Q: What is the difference between ZooKeeper and KRaft mode?

Answer:

This is a hot interview topic because Kafka is in the middle of a major architectural transition.

ZooKeeper Mode (Legacy)

Historically, Kafka relied on Apache ZooKeeper — a separate distributed coordination service — to manage cluster metadata:

  • Broker registration (which brokers are alive)
  • Controller election (one broker is the "controller" that manages partition leadership)
  • Topic configuration (partition count, replication factor, ACLs)
  • Consumer group offsets (in older versions; now stored in Kafka itself)
┌─────────────────────┐
│  ZooKeeper Ensemble  │  (3-5 separate servers)
│  ┌───┐ ┌───┐ ┌───┐  │
│  │ZK1│ │ZK2│ │ZK3│  │
│  └───┘ └───┘ └───┘  │
└─────────┬───────────┘
          │ metadata
┌─────────▼───────────┐
│    Kafka Cluster     │
│  ┌──┐ ┌──┐ ┌──┐     │
│  │B0│ │B1│ │B2│     │
│  └──┘ └──┘ └──┘     │
└─────────────────────┘

KRaft Mode (New, ZooKeeper-Free)

Starting with Kafka 3.3 (production-ready in 3.5+), Kafka can run without ZooKeeper using an internal Raft-based consensus protocol called KRaft (Kafka Raft).

In KRaft mode, a subset of Kafka brokers act as controllers and manage all metadata internally using a replicated metadata log (__cluster_metadata topic).

┌─────────────────────────────┐
│        Kafka Cluster        │
│  ┌────────┐  ┌──┐  ┌──┐    │
│  │B0 (ctrl)│  │B1│  │B2│   │
│  │B1 (ctrl)│  └──┘  └──┘   │
│  │B2 (ctrl)│                │
│  └────────┘                 │
│  (controllers embedded)     │
└─────────────────────────────┘

Why the Migration?

ConcernZooKeeperKRaft
Operational complexitySeparate cluster to deploy, monitor, upgradeAll-in-one, no external dependency
Partition limit~200K partitions (ZK bottleneck)Millions of partitions
Controller failover10-30 seconds (ZK session timeout)Seconds (Raft leader election)
Metadata propagationAsynchronous, eventual consistencyReplicated log, strongly consistent
SecuritySeparate ACL systemUnified with Kafka's security

Current Status

  • ZooKeeper mode: Deprecated as of Kafka 3.5. Will be removed entirely in Kafka 4.0.
  • KRaft mode: Production-ready. All new deployments should use KRaft.

[!TIP] In interviews, mentioning the KRaft migration shows you're up-to-date with the Kafka ecosystem. If asked "how does Kafka manage metadata?", mention both modes and note that ZooKeeper is being phased out.

Q: How does Replication work in Kafka? What is the ISR?

Answer:

Replication is how Kafka achieves fault tolerance. Each partition is replicated across multiple brokers.

Key Concepts

Replication Factor: The number of copies of each partition. A replication factor of 3 means every partition has 3 replicas across 3 different brokers.

Leader Replica: One replica is designated the leader. All producer writes and consumer reads go through the leader.

Follower Replicas: The remaining replicas continuously fetch new messages from the leader to stay in sync. They don't serve client requests (by default).

Topic: "payments" (replication-factor=3)

Broker 0: [Partition 0 - LEADER]    [Partition 1 - Follower]
Broker 1: [Partition 0 - Follower]  [Partition 1 - LEADER]
Broker 2: [Partition 0 - Follower]  [Partition 1 - Follower]

What is the ISR (In-Sync Replicas)?

The ISR is the set of replicas that are "caught up" with the leader — they have replicated all messages within the allowed lag threshold (replica.lag.time.max.ms, default 30s).

Partition 0:
  Leader (Broker 0):   offset 100
  Follower (Broker 1): offset 99   ← In ISR (close enough)
  Follower (Broker 2): offset 85   ← NOT in ISR (too far behind)

ISR = {Broker 0, Broker 1}

Why ISR Matters

The ISR directly affects data durability and availability:

  1. With acks=all: The producer considers a write successful only when ALL replicas in the ISR have acknowledged it. If the ISR shrinks to just the leader, acks=all effectively becomes acks=1.

  2. Leader Election: When a leader fails, the new leader is chosen from the ISR (by default). This ensures no data loss because ISR members have all committed messages.

  3. min.insync.replicas: A critical safety net. If set to 2 (with replication-factor=3), the producer will refuse to write if the ISR drops below 2 replicas. This prevents data loss scenarios.

Common Production Configuration

# Topic-level
replication.factor=3
min.insync.replicas=2

# Producer-level
acks=all

This means:

  • 3 copies of every partition.
  • At least 2 must acknowledge before a write is confirmed.
  • If 2 brokers die, writes are rejected (protecting data integrity over availability).

[!CAUTION] Setting unclean.leader.election.enable=true allows an out-of-sync replica to become leader when all ISR members are dead. This guarantees availability but risks data loss because the new leader may be missing messages. In most production systems, this is set to false.

Q: What are the different acks settings and how do they affect durability?

Answer:

The acks (acknowledgements) producer configuration controls how many brokers must confirm receipt of a message before the producer considers the write successful. It's the primary knob for trading off between durability and latency.

acks=0 (Fire and Forget)

The producer does not wait for any acknowledgement. It sends the message and immediately considers it delivered.

  • Durability: None. Messages can be lost silently.
  • Latency: Lowest possible.
  • Use case: Metrics, logs, or any data where occasional loss is acceptable.

acks=1 (Leader Acknowledgement)

The producer waits for the leader replica to write the message to its local log and acknowledge. Followers may not have replicated it yet.

  • Durability: Message is lost if the leader crashes before followers replicate.
  • Latency: Low.
  • Use case: General-purpose, acceptable for most non-critical workloads.

acks=all (or acks=-1) (Full ISR Acknowledgement)

The producer waits for all replicas in the ISR to acknowledge. This is the strongest durability guarantee.

  • Durability: Message survives as long as at least one ISR replica survives.
  • Latency: Highest (waiting for multiple replicas).
  • Use case: Financial transactions, order processing, anything where data loss is unacceptable.

Visual Comparison

Producer → [Broker 0: Leader] → [Broker 1: Follower] → [Broker 2: Follower]

acks=0:  Producer sends, doesn't wait.        Risk: Total loss
acks=1:  Producer waits for Leader ACK.        Risk: Leader dies before replication
acks=all: Producer waits for ALL ISR ACKs.     Risk: Only if ALL replicas die

The min.insync.replicas Safety Net

acks=all alone has a subtle trap: if the ISR shrinks to just the leader (all followers are lagging), then acks=all effectively becomes acks=1.

The fix is combining it with min.insync.replicas:

acks=all
min.insync.replicas=2  # At least 2 replicas must ACK
replication.factor=3

If fewer than 2 replicas are in the ISR, the producer receives a NotEnoughReplicasException and the write is rejected — preventing the silent durability downgrade.

[!TIP] The gold standard production config is acks=all + min.insync.replicas=2 + replication.factor=3. This tolerates one broker failure while guaranteeing no data loss.

Q: What is an Idempotent Producer in Kafka?

Answer:

An idempotent producer guarantees that even if a message is sent multiple times (due to retries), it is written to the Kafka log exactly once per partition. This eliminates duplicate messages caused by network errors.

The Problem Without Idempotency

  1. Producer sends message A to broker.
  2. Broker writes message A and sends an ACK.
  3. The ACK is lost due to a network glitch.
  4. Producer thinks the write failed, so it retries message A.
  5. Broker writes message A againduplicate.

How Idempotency Works

When enabled, Kafka assigns each producer a unique Producer ID (PID) and each message gets a sequence number per partition.

The broker tracks the latest sequence number for each PID+partition pair. If a message arrives with a sequence number that has already been written, the broker silently discards the duplicate and returns a success ACK.

Producer (PID=42) → Partition 0:
  Msg(seq=0) → Written ✅
  Msg(seq=1) → Written ✅
  Msg(seq=1) → Duplicate, discarded! (but ACK sent) ✅
  Msg(seq=2) → Written ✅

Enabling Idempotency

# Producer config
enable.idempotence=true

# These are automatically set when idempotence is enabled:
acks=all
retries=Integer.MAX_VALUE
max.in.flight.requests.per.connection=5  # (was 1 in older versions)

[!NOTE] Since Kafka 3.0, enable.idempotence=true is the default. You don't need to explicitly set it in newer versions.

Scope and Limitations

FeatureIdempotent ProducerTransactional Producer
Dedup scopeSingle partition, single sessionCross-partition, cross-session
Survives restart❌ (new PID on restart)✅ (uses transactional.id)
Use casePrevent network-retry duplicatesExactly-once across partitions

Idempotency alone does NOT provide exactly-once semantics across multiple partitions or consumer-producer chains. For that, you need transactions (covered in the Reliability section).

[!TIP] In interviews, the key insight is: idempotency prevents duplicates from retries within a single producer session. It does NOT prevent duplicates from application-level retries (e.g., your service crashes and replays the same business logic). For that, you need application-level deduplication or Kafka transactions.

Q: How does Kafka decide which partition a message goes to?

Answer:

The partition assignment strategy determines message ordering and parallelism.

Partitioning Strategies

1. Key-Based Partitioning (Default when key is provided) When a message has a key, Kafka applies murmur2(key) % numPartitions to determine the partition. All messages with the same key always go to the same partition, guaranteeing ordering for that key.

producer.send(new ProducerRecord<>("orders", "user-123", orderEvent));
// All events for "user-123" go to the same partition → strict ordering

2. Round-Robin (Default when key is null, Kafka < 2.4) Messages without a key are distributed across partitions in a round-robin fashion.

3. Sticky Partitioning (Default when key is null, Kafka ≥ 2.4) Instead of round-robin per message, the producer "sticks" to one partition for the duration of a batch, then switches. This significantly improves batching efficiency and throughput.

Round-Robin:  P0, P1, P2, P0, P1, P2  (small batches, many network calls)
Sticky:       P0, P0, P0, P1, P1, P1  (full batches, fewer network calls)

4. Custom Partitioner You can implement your own partitioning logic:

public class GeoPartitioner implements Partitioner {
    @Override
    public int partition(String topic, Object key, byte[] keyBytes,
                         Object value, byte[] valueBytes, Cluster cluster) {
        String region = extractRegion(key);
        if ("us-east".equals(region)) return 0;
        if ("eu-west".equals(region)) return 1;
        return 2; // default
    }
}

The Repartitioning Trap

[!CAUTION] If you add partitions to an existing topic, the key-to-partition mapping changes (murmur2(key) % newNumPartitions). Messages for the same key may suddenly go to a different partition, breaking ordering guarantees for in-flight data. This is why you should plan partition counts carefully upfront.

Choosing the Right Strategy

StrategyKey Provided?OrderingUse Case
Key-based hash✅ YesPer-key orderingUser events, order processing
Sticky (null key)❌ NoNoneLogs, metrics, high-throughput
CustomEitherCustom logicGeo-routing, priority lanes

Q: How do Batching and Compression work in Kafka producers?

Answer:

Batching and compression are Kafka's two most impactful performance optimizations on the producer side.

Batching

Instead of sending each message individually over the network, the producer accumulates messages into batches and sends them together. This dramatically reduces network overhead.

Key Configuration:

batch.size=16384          # Max batch size in bytes (16 KB default)
linger.ms=5               # Max time to wait for a batch to fill before sending

How it works:

  1. Producer receives a send() call.
  2. The message is added to a batch buffer for the target partition.
  3. The batch is sent when either batch.size is reached or linger.ms expires — whichever comes first.
linger.ms=0 (default):  Send immediately, tiny batches, many network calls.
linger.ms=5:            Wait up to 5ms to fill the batch, fewer calls, higher throughput.
linger.ms=100:          Wait up to 100ms, maximum batching, added latency.

[!TIP] For high-throughput systems, set linger.ms=5-20 and batch.size=65536 (64KB) or higher. The small latency increase is usually negligible compared to the throughput gain.

Compression

Kafka supports compressing message batches before sending them over the network. This reduces:

  • Network bandwidth (often 50-80% reduction)
  • Disk storage on brokers (compressed data stays compressed on disk)

Producer Config:

compression.type=snappy   # Options: none, gzip, snappy, lz4, zstd

Comparison:

AlgorithmSpeedRatioCPUBest For
noneFastest1:1NoneLow-volume topics
snappyFast~2:1LowGeneral-purpose (recommended)
lz4Fast~2:1LowHigh-throughput, balanced
zstdMedium~3:1MediumBest ratio, bandwidth-constrained
gzipSlow~3:1HighLegacy, avoid in new systems

How They Work Together

Application: send(msg1), send(msg2), send(msg3), send(msg4)
                      │
        ┌─────────────▼──────────────┐
        │      Batch Accumulator     │
        │  [msg1, msg2, msg3, msg4]  │
        │   (wait for linger.ms or   │
        │    batch.size reached)     │
        └─────────────┬──────────────┘
                      │ compress batch
        ┌─────────────▼──────────────┐
        │   Compressed Batch         │
        │   (e.g., snappy: 60% smaller) │
        └─────────────┬──────────────┘
                      │ single network call
                      ▼
                   Broker

Broker-Side Compression

The broker stores batches in the same compressed format they were received. It does NOT decompress and recompress. This means compression set by the producer extends to both network transfer AND disk storage — a double win.

[!NOTE] If the broker's compression.type differs from the producer's, the broker will decompress and recompress, causing significant CPU overhead. It's best to let the producer control compression and set the broker to compression.type=producer (the default).

Q: How do Consumer Groups and Offsets work in Kafka?

Answer:

Consumer groups are Kafka's mechanism for parallel consumption and load balancing.

Consumer Groups

A consumer group is a set of consumers that cooperate to consume messages from a topic. Each partition is assigned to exactly one consumer within a group. This ensures each message is processed once per group.

Topic "orders" (4 partitions)
Consumer Group "payment-service" (3 consumers):

  Consumer A ← Partition 0, Partition 1
  Consumer B ← Partition 2
  Consumer C ← Partition 3

Key Rule: If the number of consumers exceeds the number of partitions, the extra consumers sit idle.

4 partitions, 6 consumers:
  Consumer A ← Partition 0
  Consumer B ← Partition 1
  Consumer C ← Partition 2
  Consumer D ← Partition 3
  Consumer E ← IDLE ❌
  Consumer F ← IDLE ❌

Multiple Consumer Groups

Different consumer groups consume the same topic independently. Each group maintains its own offset position — they don't interfere with each other.

Topic "orders" (3 partitions) →
  Consumer Group "payment-service"  → reads all 3 partitions independently
  Consumer Group "email-service"    → reads all 3 partitions independently
  Consumer Group "analytics"        → reads all 3 partitions independently

This is what makes Kafka a publish-subscribe system, not just a queue.

Offsets

An offset is a sequential integer that uniquely identifies each message within a partition. Offsets are how consumers track what they've already read.

Partition 0: [0] [1] [2] [3] [4] [5] [6] [7] [8]
                                  ↑
                          Consumer's current offset = 5
                          (has read 0-4, will read 5 next)

Where Are Offsets Stored?

Consumer offsets are committed to an internal Kafka topic called __consumer_offsets (50 partitions by default). This is a regular compacted topic managed by Kafka itself.

# Check committed offsets for a group
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
    --describe --group payment-service

Output:

GROUP             TOPIC    PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG
payment-service   orders   0          1024            1030            6
payment-service   orders   1          987             987             0

[!IMPORTANT] The LAG column is critical for monitoring. It shows how many unprocessed messages are waiting. Increasing lag = consumer can't keep up = potential backpressure issue.

Q: What are the different Offset Commit Strategies?

Answer:

How and when a consumer commits its offset determines what happens when the consumer crashes and restarts. This directly affects delivery semantics.

Auto-Commit (Default)

Offsets are committed automatically at a fixed interval, regardless of whether messages have been processed.

enable.auto.commit=true        # Default
auto.commit.interval.ms=5000   # Every 5 seconds

The Problem:

  1. Consumer fetches messages at offset 100-110.
  2. Auto-commit fires, committing offset 110.
  3. Consumer crashes while processing message 105.
  4. Consumer restarts, reads from offset 110 → messages 105-109 are lost.

This creates at-most-once semantics: messages can be lost, but are never reprocessed.

Manual Commit (Synchronous)

The application explicitly commits after successfully processing messages.

consumer.poll(Duration.ofMillis(100));
// Process messages...
consumer.commitSync(); // Blocks until commit is confirmed

Trade-off: If the consumer crashes after processing but before committing, it will reprocess those messages on restart → at-least-once semantics (duplicates possible, but no data loss).

Manual Commit (Asynchronous)

Same as synchronous but non-blocking:

consumer.commitAsync((offsets, exception) -> {
    if (exception != null) {
        log.error("Commit failed", exception);
    }
});

Trade-off: Higher throughput, but if the commit fails silently, you may reprocess messages.

Best Practice: Sync + Async Hybrid

try {
    while (true) {
        ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
        for (ConsumerRecord<String, String> record : records) {
            processRecord(record);
        }
        consumer.commitAsync(); // Fast, non-blocking for normal flow
    }
} catch (Exception e) {
    log.error("Consumer error", e);
} finally {
    consumer.commitSync(); // Guaranteed commit on shutdown
    consumer.close();
}

Commit Granularity

You can also commit offsets for specific partitions:

Map<TopicPartition, OffsetAndMetadata> offsets = new HashMap<>();
offsets.put(new TopicPartition("orders", 0), new OffsetAndMetadata(lastOffset + 1));
consumer.commitSync(offsets);

Summary

StrategyDelivery SemanticsRisk
Auto-commitAt-most-onceMessage loss after crash
Manual after processingAt-least-onceDuplicate processing after crash
Transactional (EOS)Exactly-onceHighest complexity

[!TIP] Most production systems use manual commits with at-least-once semantics and design their consumers to be idempotent (processing the same message twice produces the same result). This is simpler and more reliable than attempting exactly-once.

Q: What is Consumer Rebalancing and why can it be problematic?

Answer:

A rebalance is the process of redistributing partition assignments among consumers in a group. It's triggered when the group membership changes.

What Triggers a Rebalance?

  1. A consumer joins the group (new instance deployed).
  2. A consumer leaves the group (instance crashes or shuts down).
  3. A consumer fails to send a heartbeat within session.timeout.ms.
  4. A consumer's poll() calls take longer than max.poll.interval.ms.
  5. Partitions are added to the subscribed topic.

Why Rebalancing is Problematic

During a rebalance, all consumers in the group stop processing. This causes a processing pause (sometimes called "stop the world") that can last from milliseconds to minutes depending on the group size.

Normal operation:
  Consumer A ← P0, P1  (processing)
  Consumer B ← P2, P3  (processing)

Consumer B crashes → REBALANCE triggered:
  All consumers STOP processing
  Coordinator reassigns partitions
  Consumer A ← P0, P1, P2, P3 (resumes)
  
Total pause: seconds to minutes

Rebalance Strategies

1. Eager Rebalancing (Default in older versions) All consumers give up ALL partition assignments, then get new ones. Maximum disruption.

2. Cooperative (Incremental) Rebalancing (Kafka ≥ 2.4) Only the partitions that need to move are revoked and reassigned. Other consumers continue processing without interruption.

partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor
Consumer B crashes:
  Consumer A: keeps P0, P1 (no pause!) + receives P2, P3
  Only P2, P3 are "moved" — minimal disruption

Avoiding Unnecessary Rebalances

Tune these consumer configs:

# Time before a consumer is considered dead (default: 45s)
session.timeout.ms=45000

# Heartbeat interval (should be ~1/3 of session.timeout)
heartbeat.interval.ms=15000

# Max time between poll() calls before consumer is evicted
max.poll.interval.ms=300000   # 5 minutes

# Reduce records per poll if processing is slow
max.poll.records=500

[!CAUTION] The most common cause of unnecessary rebalances is slow message processing. If processing a batch of messages takes longer than max.poll.interval.ms, Kafka assumes the consumer is dead and triggers a rebalance — even though it's still alive and processing. Either speed up processing, reduce max.poll.records, or increase max.poll.interval.ms.

Static Group Membership (Kafka ≥ 2.3)

Assign a fixed identity to each consumer using group.instance.id. When a consumer restarts, it rejoins with the same identity and gets its previous partitions back — no rebalance triggered during brief restarts.

group.instance.id=consumer-host-1

Q: What are the different delivery semantics in Kafka?

Answer:

This is one of the most important Kafka interview questions. There are three delivery guarantees, and understanding the trade-offs is essential.

1. At-Most-Once

Messages may be lost but are never reprocessed. The consumer commits the offset before processing the message.

1. Fetch message at offset 42
2. Commit offset 43 ✅
3. Process message... CRASH 💥
4. Restart → reads from offset 43 → message 42 is LOST

When to use: Metric collection, logging — where occasional loss is acceptable and speed matters most.

2. At-Least-Once

Messages are never lost but may be duplicated. The consumer commits the offset after processing the message.

1. Fetch message at offset 42
2. Process message ✅
3. Commit offset 43... CRASH 💥 (commit failed)
4. Restart → reads from offset 42 → message 42 is PROCESSED AGAIN

When to use: Most production systems. Design consumers to be idempotent (safe to process twice).

Idempotent consumer pattern:

void processOrder(OrderEvent event) {
    // Check if already processed using a deduplication store
    if (processedIds.contains(event.getId())) {
        return; // Skip duplicate
    }
    executeBusinessLogic(event);
    processedIds.add(event.getId());
}

3. Exactly-Once Semantics (EOS)

Messages are processed exactly once — no loss, no duplicates. This is the hardest to achieve and requires specific Kafka features.

How Kafka achieves EOS:

  1. Idempotent Producer (enable.idempotence=true) — prevents duplicate writes.
  2. Transactions (transactional.id) — atomic writes across multiple partitions.
  3. Consumer read_committed isolation — consumers only see committed transactional messages.
# Producer
enable.idempotence=true
transactional.id=my-transaction-id

# Consumer
isolation.level=read_committed

When to use: Financial systems, inventory management, or when consuming from one topic, processing, and producing to another topic atomically (the "consume-transform-produce" pattern).

Summary

SemanticData Loss?Duplicates?ComplexityUse Case
At-most-once✅ Possible❌ NoLowMetrics, logs
At-least-once❌ No✅ PossibleMediumMost production systems
Exactly-once❌ No❌ NoHighFinancial, critical data

[!IMPORTANT] Exactly-once in Kafka is scoped to the Kafka ecosystem (producer → broker → consumer within Kafka). It does NOT guarantee exactly-once when writing to external systems (like a database). For end-to-end exactly-once with external systems, you need idempotent consumers or two-phase commit patterns.

Q: How do Kafka Transactions work?

Answer:

Kafka transactions enable atomic writes across multiple partitions and topics. They are the foundation for exactly-once semantics (EOS) in the "consume-transform-produce" pattern.

The Problem

Imagine a stream processing pipeline that reads from topic A, transforms the data, and writes to topic B while also committing consumer offsets. Without transactions, a crash mid-pipeline could result in:

  • Data written to topic B but offset not committed → duplicates on retry
  • Offset committed but data not written to topic B → data loss

How Transactions Work

// 1. Configure transactional producer
Properties props = new Properties();
props.put("transactional.id", "order-processor-1");
props.put("enable.idempotence", "true");

KafkaProducer<String, String> producer = new KafkaProducer<>(props);

// 2. Initialize transactions (called once)
producer.initTransactions();

try {
    // 3. Begin transaction
    producer.beginTransaction();

    // 4. Consume from input topic
    ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));

    for (ConsumerRecord<String, String> record : records) {
        // 5. Process and produce to output topic
        String result = transform(record.value());
        producer.send(new ProducerRecord<>("output-topic", record.key(), result));
    }

    // 6. Commit consumer offsets AS PART OF the transaction
    producer.sendOffsetsToTransaction(
        getOffsetsToCommit(records),
        consumer.groupMetadata()
    );

    // 7. Commit transaction (atomic: either ALL writes + offset commit succeed, or NONE)
    producer.commitTransaction();

} catch (Exception e) {
    // 8. Abort transaction (all writes are rolled back)
    producer.abortTransaction();
}

What Happens Atomically

When commitTransaction() succeeds, ALL of the following are committed together:

  1. All send() messages to output topics.
  2. The consumer offset commit.

If anything fails, abortTransaction() rolls back everything — the output messages are marked as "aborted" and the consumer offsets are not updated.

Consumer Side: read_committed

For consumers to properly participate in transactions:

isolation.level=read_committed
Isolation LevelBehavior
read_uncommitted (default)Consumer sees ALL messages, including those from aborted transactions
read_committedConsumer only sees messages from committed transactions

The transactional.id

  • Must be unique per producer instance but stable across restarts.
  • When a producer with the same transactional.id restarts, Kafka "fences" the old producer — any pending transactions from the old instance are aborted.
  • This prevents "zombie" producers from causing duplicates.

[!TIP] Kafka Streams uses transactions internally to provide exactly-once semantics out of the box. You just set processing.guarantee=exactly_once_v2 and the framework handles all the transactional plumbing automatically.

Q: Why is Kafka so fast?

Answer:

Kafka achieves extraordinary throughput (millions of messages/second) through several deliberate architectural decisions.

1. Sequential I/O (Append-Only Log)

Kafka writes messages to disk in a strictly sequential, append-only fashion. It never does random disk seeks.

Sequential disk writes are shockingly fast — often 600 MB/s+ on modern SSDs, compared to ~100 KB/s for random writes. This is because the OS can fully leverage disk write-ahead buffers and avoid head movement on HDDs.

2. Zero-Copy (sendfile)

When a consumer reads data, the normal path involves 4 copies:

Disk → Kernel Buffer → User Space → Socket Buffer → NIC

Kafka uses the Linux sendfile() system call to skip user space entirely:

Disk → Kernel Buffer → NIC  (zero-copy, 2 copies instead of 4)

This eliminates context switches and memory copies, reducing CPU usage and increasing throughput dramatically.

3. Page Cache (OS-Level Caching)

Kafka does NOT manage its own in-memory cache. Instead, it relies on the OS page cache. When data is written to disk, the OS caches it in RAM. When consumers read recent data, it's served directly from the page cache — essentially a free, automatically managed in-memory read cache.

Hot data (recent):  Served from OS page cache (RAM speed)
Cold data (old):    Read from disk (still fast due to sequential reads)

This is why Kafka's performance is often counter-intuitive: it writes to "disk" but reads from "memory."

4. Batching + Compression

Producers batch many messages together and optionally compress them. This means:

  • Fewer network round trips
  • Less disk I/O (one write for many messages)
  • Smaller on-disk footprint

5. Partitioning (Horizontal Scaling)

Each partition is an independent log. Multiple partitions can be read/written in parallel across different brokers and consumers. Adding partitions and brokers scales throughput linearly.

6. No Per-Message Acknowledgment to Consumers

Unlike RabbitMQ (which tracks ACK per message), Kafka consumers simply track their offset position. There's no per-message bookkeeping on the broker side, which eliminates enormous overhead.

Summary

TechniqueBenefit
Sequential I/OFast disk writes, no seeks
Zero-copy (sendfile)Minimal CPU for data transfer
Page cacheHot data served from RAM
BatchingAmortized network/disk overhead
CompressionLess bandwidth and storage
Partition parallelismLinear horizontal scaling
Offset-based trackingNo per-message broker state

[!TIP] In interviews, the two killer points are sequential I/O and zero-copy. These are what fundamentally differentiate Kafka's performance from traditional message brokers that rely on random I/O and per-message routing.

Q: What is Consumer Lag and how do you handle backpressure?

Answer:

Consumer lag is the difference between the latest message offset in a partition (log-end offset) and the consumer's current committed offset. It tells you how far behind a consumer is.

Measuring Lag

kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
    --describe --group my-service

GROUP       TOPIC    PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG
my-service  orders   0          4500            5000            500
my-service  orders   1          3200            3200            0
my-service  orders   2          2800            4100            1300

Healthy: LAG = 0 or near-zero and stable. Unhealthy: LAG is increasing over time = consumer can't keep up with producers.

Causes of Growing Lag

  1. Slow processing — each message takes too long (external API calls, heavy computation).
  2. Insufficient consumers — fewer consumers than partitions.
  3. Frequent rebalances — consumers pausing during rebalancing.
  4. GC pauses — long garbage collection stops in JVM-based consumers.
  5. Skewed partitions — one partition has significantly more data due to hot keys.

Strategies to Handle Backpressure

1. Scale consumers horizontally Add more consumer instances (up to the number of partitions):

Before: 2 consumers for 6 partitions (3 partitions each)
After:  6 consumers for 6 partitions (1 partition each)

2. Increase partitions More partitions = more parallelism. But be cautious of the repartitioning trap (breaks key ordering for existing data).

3. Optimize processing

  • Process messages asynchronously (decouple consumption from processing).
  • Use batch processing instead of one-at-a-time.
  • Cache external service responses.

4. Tune consumer configs

max.poll.records=100        # Fewer records per poll = less time per batch
fetch.min.bytes=1           # Don't wait for large fetches
max.poll.interval.ms=600000 # More time allowed between polls

5. Dead Letter Queue (DLQ) If a specific message consistently fails processing, send it to a DLQ topic instead of blocking the consumer:

try {
    processMessage(record);
} catch (Exception e) {
    producer.send(new ProducerRecord<>("orders.dlq", record.key(), record.value()));
    // Continue processing next message
}

Monitoring Lag

Critical metrics to alert on:

  • kafka.consumer.lag — absolute lag (messages behind).
  • kafka.consumer.lag_rate — rate of lag change (is it growing?).
  • Consumer group stateSTABLE, REBALANCING, DEAD.

Tools: Burrow (LinkedIn), Kafka Lag Exporter (Prometheus), or built-in kafka-consumer-groups.sh.

[!CAUTION] A sudden spike in consumer lag often precedes a production incident. Set up alerts for when lag exceeds a threshold (e.g., >10,000 messages) OR when lag is consistently increasing over a 5-minute window.

Q: What is Kafka Connect?

Answer:

Kafka Connect is a framework for reliably streaming data between Kafka and external systems (databases, search indexes, filesystems, cloud services) without writing any code.

How It Works

Kafka Connect runs as a separate, scalable cluster of worker processes. You configure data pipelines using JSON configurations — no Java code required.

                    Kafka Connect
External Source ──▶ [Source Connector] ──▶ Kafka Topic
Kafka Topic    ──▶ [Sink Connector]   ──▶ External Sink

Source Connectors

Read data from an external system and write it to Kafka topics.

{
    "name": "postgres-source",
    "config": {
        "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
        "database.hostname": "db.example.com",
        "database.port": "5432",
        "database.dbname": "orders_db",
        "topic.prefix": "cdc"
    }
}

This captures every INSERT/UPDATE/DELETE from Postgres and streams it to topics like cdc.public.orders, cdc.public.users.

Sink Connectors

Read data from Kafka topics and write it to an external system.

{
    "name": "elasticsearch-sink",
    "config": {
        "connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
        "topics": "orders",
        "connection.url": "http://es.example.com:9200",
        "type.name": "_doc"
    }
}
ConnectorDirectionUse Case
Debezium (PostgreSQL/MySQL)SourceChange Data Capture (CDC)
JDBC ConnectorSource/SinkGeneric SQL database sync
ElasticsearchSinkSearch indexing
S3 SinkSinkData lake / archival
BigQuery SinkSinkAnalytics warehouse
File StreamSource/SinkCSV/log file ingestion

Standalone vs Distributed Mode

ModeWorkersUse Case
Standalone1Development, testing
DistributedMultipleProduction (fault-tolerant, scalable)

In distributed mode, if a worker dies, its connectors are automatically reassigned to surviving workers.

Why Not Just Write a Custom Producer/Consumer?

  • Built-in offset tracking — Connect tracks source positions automatically.
  • Fault tolerance — automatic failover in distributed mode.
  • Schema evolution — integrates with Schema Registry.
  • Configurable transforms — Single Message Transforms (SMTs) for lightweight data manipulation.
  • No code to maintain — just JSON config.

[!TIP] In interviews, Debezium + Kafka Connect for CDC is a particularly strong topic. It's the industry standard for streaming database changes (e.g., syncing a PostgreSQL write-model to an Elasticsearch read-model in real-time).

Q: What is the difference between Kafka Streams, Apache Flink, and Apache Spark Streaming?

Answer:

All three are stream processing frameworks, but they serve different niches.

Kafka Streams

A lightweight client library (not a cluster/framework) for building stream processing applications that read from and write to Kafka.

StreamsBuilder builder = new StreamsBuilder();
builder.stream("input-topic")
    .filter((key, value) -> value.contains("important"))
    .mapValues(value -> value.toUpperCase())
    .to("output-topic");

KafkaStreams streams = new KafkaStreams(builder.build(), props);
streams.start();

Key characteristics:

  • Runs as a regular Java application — no separate cluster to deploy.
  • Exactly-once semantics built in.
  • Supports stateful operations (aggregations, joins, windowing) with local state stores (RocksDB).
  • Scales by simply running more instances of the application.

A distributed stream processing framework designed for complex, low-latency event processing at massive scale.

Key characteristics:

  • Runs on its own cluster (JobManager + TaskManagers).
  • True event-time processing with watermarks.
  • Advanced windowing (tumbling, sliding, session windows).
  • Exactly-once semantics via checkpointing.
  • Can process both streams and batch data (unified model).
  • Supports multiple languages (Java, Scala, Python, SQL).

Apache Spark Streaming (Structured Streaming)

A micro-batch streaming engine built on top of Spark. It processes data in small batches rather than true record-at-a-time streaming.

Key characteristics:

  • Runs on a Spark cluster.
  • Processes streams as a series of small batch jobs.
  • Shares Spark's batch processing ecosystem (MLlib, SQL, DataFrames).
  • Higher latency than Flink (seconds vs milliseconds).

Comparison

FeatureKafka StreamsFlinkSpark Streaming
DeploymentLibrary (no cluster)Dedicated clusterSpark cluster
LatencyLow (ms)Very low (ms)Higher (seconds)
ModelTrue streamingTrue streamingMicro-batch
State managementRocksDB (local)Managed state + checkpointsSpark state store
Exactly-once✅ (Kafka-only)✅ (with any source)
Source/SinkKafka onlyKafka, HDFS, DBs, etc.Kafka, HDFS, DBs, etc.
ComplexityLowMedium-HighMedium
Best forKafka-centric microservicesComplex CEP, large-scaleBatch + streaming unified

When to Use What?

  • Kafka Streams: Your data is in Kafka and goes back to Kafka. You want simplicity and don't want to manage a separate cluster.
  • Flink: You need sub-millisecond latency, complex event processing (CEP), or reading from non-Kafka sources.
  • Spark Streaming: Your team already uses Spark for batch and wants to add streaming. Latency of seconds is acceptable.

[!NOTE] Apache Flink is increasingly becoming the industry standard for large-scale stream processing. Many companies are migrating from Spark Streaming to Flink for its true streaming model and lower latency.

Q: What is Schema Registry and why is Avro commonly used with Kafka?

Answer:

The Problem: Schema Evolution

In a microservices architecture, producers and consumers are developed by different teams and deployed at different times. What happens when the producer changes the message format (adds a field, renames one, changes a type)?

Without schema management, the consumer breaks because it can't deserialize the new format.

Schema Registry

The Confluent Schema Registry is a centralized service that stores and manages schemas for Kafka message keys and values. It ensures that producers and consumers agree on the data format.

Producer → Schema Registry: "Here's my schema, give me an ID"
           Schema Registry → "Schema ID: 42"
Producer → Kafka: [Schema ID: 42] + [Serialized Data]
                       ...
Consumer ← Kafka: [Schema ID: 42] + [Serialized Data]
Consumer → Schema Registry: "What schema is ID 42?"
           Schema Registry → Returns the schema
Consumer: Deserializes data using the schema

Why Avro?

Apache Avro is a binary serialization format that is the dominant choice for Kafka messages. It pairs perfectly with Schema Registry.

Avro schema example:

{
    "type": "record",
    "name": "OrderEvent",
    "namespace": "com.example",
    "fields": [
        {"name": "orderId", "type": "string"},
        {"name": "amount", "type": "double"},
        {"name": "currency", "type": "string", "default": "USD"},
        {"name": "timestamp", "type": "long"}
    ]
}

Why not JSON?

FeatureJSONAvroProtobuf
SizeLarge (text + keys)Compact (binary, no keys)Compact (binary)
SchemaNone (schema-less)RequiredRequired
SpeedSlow (parsing text)Fast (binary)Fast (binary)
Schema evolutionManualBuilt-inBuilt-in
Human readable✅ Yes❌ No❌ No

Avro messages are typically 50-70% smaller than JSON because they don't include field names — only the values, referenced by schema position.

Compatibility Modes

Schema Registry enforces compatibility rules when a schema evolves:

ModeRule
BACKWARD (default)New schema can read old data. Allows: adding fields with defaults, removing fields.
FORWARDOld schema can read new data. Allows: removing fields, adding optional fields.
FULLBoth backward and forward compatible.
NONENo compatibility checks.

Example of backward-compatible change:

// v1
{"name": "orderId", "type": "string"}
{"name": "amount", "type": "double"}

// v2 (backward compatible: new field has a default)
{"name": "orderId", "type": "string"}
{"name": "amount", "type": "double"}
{"name": "currency", "type": "string", "default": "USD"}  // ← NEW

[!TIP] In interviews, mentioning Avro + Schema Registry together shows you understand production Kafka. The key insight: Schema Registry acts as a contract between services, preventing breaking changes from deploying to production.

Q: What is the difference between Retention and Log Compaction in Kafka?

Answer:

Kafka keeps data on disk and provides two distinct retention strategies for controlling how long messages are stored.

Time/Size-Based Retention (Default)

Messages are deleted after a configured time period or when the log exceeds a size limit.

# Time-based: Delete messages older than 7 days
log.retention.hours=168        # (default: 168 hours = 7 days)

# Size-based: Delete oldest messages when partition log exceeds 1GB
log.retention.bytes=1073741824

# Segment file size (retention is applied per segment)
log.segment.bytes=1073741824   # 1GB per segment file

How it works: Kafka stores messages in segment files. When a segment's age exceeds retention.hours (or the total log size exceeds retention.bytes), the entire segment file is deleted.

Partition 0:
  segment-0.log (2 days old) ← DELETED when > 7 days
  segment-1.log (1 day old)
  segment-2.log (current, active)

Log Compaction

Instead of deleting messages by time, Kafka keeps only the latest value for each unique key. It's as if you have a table where each key's row is updated in-place.

log.cleanup.policy=compact

Before compaction:

Offset  Key      Value
0       user-1   {"name": "Alice"}
1       user-2   {"name": "Bob"}
2       user-1   {"name": "Alice Smith"}  ← newer value for user-1
3       user-3   {"name": "Charlie"}
4       user-2   null                     ← tombstone (delete marker)

After compaction:

Offset  Key      Value
2       user-1   {"name": "Alice Smith"}  ← latest value kept
3       user-3   {"name": "Charlie"}
                                          ← user-2 deleted (tombstone)

When to Use Which?

StrategyPolicyUse Case
Time/Size Retentiondelete (default)Event streams, logs, metrics (you care about events over time)
Log CompactioncompactState snapshots, CDC changes, config updates (you care about latest state per key)
Bothcompact,deleteCompacted but also enforce a time limit on old keys

Real-World Examples

Compaction:

  • __consumer_offsets — Kafka's internal topic for consumer offsets (only latest offset per group/partition matters).
  • CDC topics (Debezium) — latest row state per primary key.
  • User profile cache — latest profile per user ID.

Time retention:

  • Clickstream events, application logs, order events.

[!IMPORTANT] Compaction is not immediate. A background thread called the "log cleaner" periodically compacts segments. Between compactions, both old and new values for a key may exist. Never rely on compaction for real-time deduplication — it's an eventual cleanup mechanism.

Q: What are the key metrics to monitor in a Kafka cluster?

Answer:

Monitoring is essential for maintaining a healthy Kafka cluster. Here are the critical metrics organized by component.

Broker Metrics

MetricWhat It Tells YouAlert Threshold
UnderReplicatedPartitionsPartitions where followers are behind the leader> 0 for sustained period
ActiveControllerCountNumber of active controllers in the clusterShould always be exactly 1
OfflinePartitionsCountPartitions with no leader (completely unavailable)> 0 = critical
RequestHandlerAvgIdlePercentHow busy the broker's request handler threads are< 20% = broker overloaded
NetworkProcessorIdlePercentNetwork thread utilization< 30% = network bottleneck
LogFlushLatencyMsTime to flush logs to diskSpikes indicate disk issues

Producer Metrics

MetricWhat It Tells YouAlert Threshold
record-send-rateMessages sent per secondSudden drop = producer issue
record-error-rateFailed sends per second> 0 = investigate
batch-size-avgAverage batch sizeToo small = suboptimal batching
request-latency-avgAvg time broker takes to respond> 100ms = potential issue

Consumer Metrics

MetricWhat It Tells YouAlert Threshold
records-lag-maxMaximum lag across all partitionsConsistently increasing
records-consumed-rateMessages consumed per secondSudden drop = consumer issue
commit-latency-avgTime to commit offsetsSpikes indicate issues
rebalance-rateHow often the group rebalancesHigh rate = configuration issue

Monitoring Stack

Kafka (JMX Metrics)
    ↓
Prometheus (JMX Exporter)
    ↓
Grafana (Dashboards + Alerts)

Popular Tools:

  • Prometheus + JMX Exporter: Industry standard for metric collection.
  • Grafana: Visualization and alerting.
  • Burrow: LinkedIn's tool specifically for consumer lag monitoring.
  • Kafka Manager / AKHQ: Web UI for cluster management.
  • Confluent Control Center: Commercial monitoring (Confluent Platform).

Critical Alerts to Set Up

# Example Prometheus alerting rules
groups:
  - name: kafka-alerts
    rules:
      - alert: KafkaOfflinePartitions
        expr: kafka_server_replicamanager_offline_partitions_count > 0
        for: 1m
        labels:
          severity: critical

      - alert: KafkaConsumerLagHigh
        expr: kafka_consumer_group_lag > 10000
        for: 5m
        labels:
          severity: warning

      - alert: KafkaUnderReplicatedPartitions
        expr: kafka_server_replicamanager_under_replicated_partitions > 0
        for: 5m
        labels:
          severity: warning

[!TIP] In interviews, the most impactful metrics to mention are UnderReplicatedPartitions (replication health), consumer lag (processing health), and OfflinePartitionsCount (availability). These cover the three biggest operational concerns: data durability, throughput, and uptime.

Q: When would you choose Kafka over RabbitMQ or SQS?

Answer:

This is a common architectural decision question. Each messaging system serves different primary use cases.

Apache Kafka

A distributed event streaming platform designed as a durable commit log.

Best for:

  • High-throughput event streaming (millions of msg/sec)
  • Event sourcing and CQRS architectures
  • Log aggregation
  • Real-time analytics pipelines
  • When consumers need to replay old messages
  • When multiple independent consumers need the same data

RabbitMQ

A traditional message broker that implements AMQP (Advanced Message Queuing Protocol).

Best for:

  • Request/reply patterns (RPC over messaging)
  • Complex routing logic (topic exchanges, headers, fanout)
  • Priority queues (some messages should be processed first)
  • When messages should be deleted after consumption
  • Smaller scale (thousands, not millions, of msg/sec)
  • When you need per-message acknowledgement and fine-grained delivery control

Amazon SQS

A fully managed message queue on AWS.

Best for:

  • Teams that don't want to operate infrastructure
  • Simple producer-consumer patterns
  • Variable/bursty workloads (auto-scales transparently)
  • Dead letter queue support out of the box
  • When tight AWS integration is needed (Lambda triggers, IAM)

Comparison Table

FeatureKafkaRabbitMQSQS
ModelDistributed logMessage brokerManaged queue
ThroughputMillions/secThousands/secVariable (managed)
Message retentionDays/weeks (configurable)Until consumed4 days (max 14)
Replay✅ Yes❌ No❌ No
OrderingPer-partitionPer-queueFIFO variant only
Consumer groups✅ Built-inManual (competing consumers)✅ Built-in
Delivery semanticsAt-least-once, exactly-onceAt-least-once, at-most-onceAt-least-once
Complex routing❌ Topic-based only✅ Exchanges, bindings, headers❌ Simple
Priority queues❌ No✅ Yes❌ No
OperationsSelf-managed or Confluent CloudSelf-managed or CloudAMQPFully managed
ProtocolCustom binaryAMQP, STOMP, MQTTHTTP/SQS API

Decision Framework

Need replay / event sourcing?          → Kafka
High throughput (>100K msg/sec)?       → Kafka
Multiple independent consumers?       → Kafka
Complex routing (headers, priorities)? → RabbitMQ
Request/reply (RPC) pattern?           → RabbitMQ
Don't want to manage infrastructure?  → SQS (or Confluent Cloud/CloudAMQP)
Simple queue, AWS ecosystem?           → SQS

[!NOTE] These are not mutually exclusive. Many production architectures use Kafka for event streaming (inter-service communication) AND SQS/RabbitMQ for task queues (background job processing). Using the right tool for each specific use case is the mark of a mature architecture.