System Design
Why Use Kafka?
Understand when Kafka and similar messaging systems become useful: moving from direct service calls to durable event-driven communication.
The Short Version
Before Kafka: A Small Simple System
Imagine a small online store. At first, the checkout service does everything directly.
Initial checkout flow
- User places order
- Checkout service saves order
- Checkout service charges payment
- Checkout service sends confirmation email
This is simple and completely reasonable when the system is small. There may be only one backend service, one database, and a small amount of traffic.
The System Starts Growing
Over time, more features get added around the same order event.
More things now happen after an order
- Charge payment
- Send email
- Update inventory
- Notify warehouse
- Update loyalty points
- Send analytics event
- Trigger fraud review
- Notify shipping service
The checkout service now has to know about many other systems. Every new feature adds another dependency to checkout.
The Problem With Direct Calls
A direct synchronous design may start looking like this:
Synchronous checkout
This creates several problems:
- The checkout request becomes slower as more calls are added.
- If one downstream service is slow, checkout may become slow.
- If one downstream service is down, checkout logic gets messy.
- The checkout service becomes tightly coupled to many systems.
- Adding a new consumer requires changing checkout again.
The Kafka Version
With Kafka, checkout can publish an event instead of directly calling every downstream system.
One Event, Many Consumers
Now checkout only needs to say:
Other services that subscribe to the order-created event are notified when an order is created. They can then independently decide what they need to do with that event.
Why This Helps
Loose Coupling
The producer does not need to know every service that reacts to the event.
Better Responsiveness
Checkout can finish faster instead of waiting for every downstream action.
Independent Consumers
Email, inventory, analytics, and warehouse logic can evolve separately.
Replay
A consumer can reprocess old events if it needs to rebuild state or recover from a bug.
Scalability
High-volume event streams can be partitioned and processed by multiple consumers.
Durability
Events are stored in Kafka for a retention period instead of disappearing immediately.
When Kafka Is Probably Overkill
Kafka is powerful, but it is not automatically the right answer.
- The system is small and synchronous calls are simple.
- You only have one producer and one consumer.
- You do not need replay or durable event history.
- Kafka introduces operational and development complexity that may outweigh its benefits for a small system.
- A simple queue or direct call would solve the problem.
What Operational Complexity Means
Kafka is powerful, but it adds complexity that a small system may not need.
- You must run and monitor Kafka brokers.
- You must manage topics, partitions, retention, and disk usage.
- You must track consumer lag to know if consumers are falling behind.
- Consumers may process messages more than once, so idempotency matters, complicating the code.
- Failures become asynchronous and harder to debug than a direct API call.
- Developers must understand offsets, retries, dead letter queues, ordering, and consumer groups.
What Interviewers Are Looking For
You Understand the Why
Kafka is not just an API. It solves coupling, fan-out, replay, buffering, and high-volume event processing problems.
You Know Event-Driven Design
Services publish facts about things that happened, and other services react independently.
You Know the Tradeoff
Kafka improves decoupling but introduces async behavior, eventual consistency, and operational complexity.
You Can Explain Growth
You can start with a simple system and explain the point where direct calls become painful.