Q: What is the difference between Retention and Log Compaction in Kafka?

Answer:

Kafka keeps data on disk and provides two distinct retention strategies for controlling how long messages are stored.

Time/Size-Based Retention (Default)

Messages are deleted after a configured time period or when the log exceeds a size limit.

# Time-based: Delete messages older than 7 days
log.retention.hours=168        # (default: 168 hours = 7 days)

# Size-based: Delete oldest messages when partition log exceeds 1GB
log.retention.bytes=1073741824

# Segment file size (retention is applied per segment)
log.segment.bytes=1073741824   # 1GB per segment file

How it works: Kafka stores messages in segment files. When a segment's age exceeds retention.hours (or the total log size exceeds retention.bytes), the entire segment file is deleted.

Partition 0:
  segment-0.log (2 days old) ← DELETED when > 7 days
  segment-1.log (1 day old)
  segment-2.log (current, active)

Log Compaction

Instead of deleting messages by time, Kafka keeps only the latest value for each unique key. It's as if you have a table where each key's row is updated in-place.

log.cleanup.policy=compact

Before compaction:

Offset  Key      Value
0       user-1   {"name": "Alice"}
1       user-2   {"name": "Bob"}
2       user-1   {"name": "Alice Smith"}  ← newer value for user-1
3       user-3   {"name": "Charlie"}
4       user-2   null                     ← tombstone (delete marker)

After compaction:

Offset  Key      Value
2       user-1   {"name": "Alice Smith"}  ← latest value kept
3       user-3   {"name": "Charlie"}
                                          ← user-2 deleted (tombstone)

When to Use Which?

StrategyPolicyUse Case
Time/Size Retentiondelete (default)Event streams, logs, metrics (you care about events over time)
Log CompactioncompactState snapshots, CDC changes, config updates (you care about latest state per key)
Bothcompact,deleteCompacted but also enforce a time limit on old keys

Real-World Examples

Compaction:

  • __consumer_offsets — Kafka's internal topic for consumer offsets (only latest offset per group/partition matters).
  • CDC topics (Debezium) — latest row state per primary key.
  • User profile cache — latest profile per user ID.

Time retention:

  • Clickstream events, application logs, order events.

[!IMPORTANT] Compaction is not immediate. A background thread called the "log cleaner" periodically compacts segments. Between compactions, both old and new values for a key may exist. Never rely on compaction for real-time deduplication — it's an eventual cleanup mechanism.