Q: What is Kafka Connect?

Answer:

Kafka Connect is a framework for reliably streaming data between Kafka and external systems (databases, search indexes, filesystems, cloud services) without writing any code.

How It Works

Kafka Connect runs as a separate, scalable cluster of worker processes. You configure data pipelines using JSON configurations — no Java code required.

                    Kafka Connect
External Source ──▶ [Source Connector] ──▶ Kafka Topic
Kafka Topic    ──▶ [Sink Connector]   ──▶ External Sink

Source Connectors

Read data from an external system and write it to Kafka topics.

{
    "name": "postgres-source",
    "config": {
        "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
        "database.hostname": "db.example.com",
        "database.port": "5432",
        "database.dbname": "orders_db",
        "topic.prefix": "cdc"
    }
}

This captures every INSERT/UPDATE/DELETE from Postgres and streams it to topics like cdc.public.orders, cdc.public.users.

Sink Connectors

Read data from Kafka topics and write it to an external system.

{
    "name": "elasticsearch-sink",
    "config": {
        "connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
        "topics": "orders",
        "connection.url": "http://es.example.com:9200",
        "type.name": "_doc"
    }
}
ConnectorDirectionUse Case
Debezium (PostgreSQL/MySQL)SourceChange Data Capture (CDC)
JDBC ConnectorSource/SinkGeneric SQL database sync
ElasticsearchSinkSearch indexing
S3 SinkSinkData lake / archival
BigQuery SinkSinkAnalytics warehouse
File StreamSource/SinkCSV/log file ingestion

Standalone vs Distributed Mode

ModeWorkersUse Case
Standalone1Development, testing
DistributedMultipleProduction (fault-tolerant, scalable)

In distributed mode, if a worker dies, its connectors are automatically reassigned to surviving workers.

Why Not Just Write a Custom Producer/Consumer?

  • Built-in offset tracking — Connect tracks source positions automatically.
  • Fault tolerance — automatic failover in distributed mode.
  • Schema evolution — integrates with Schema Registry.
  • Configurable transforms — Single Message Transforms (SMTs) for lightweight data manipulation.
  • No code to maintain — just JSON config.

[!TIP] In interviews, Debezium + Kafka Connect for CDC is a particularly strong topic. It's the industry standard for streaming database changes (e.g., syncing a PostgreSQL write-model to an Elasticsearch read-model in real-time).