Reliability + System Design
Health checks in system design
Health checks help infrastructure decide whether an instance is alive, ready for traffic, or should be restarted. They are essential for load balancing, deployments, and auto-recovery.
The Short Answer
A health check is a small test used to decide whether a service instance is healthy enough to run, receive traffic, or stay in rotation.
The Real Problem
Imagine a load balancer sending traffic to three app servers.
No Health Checks
With Health Checks
Health checks help infrastructure stop sending traffic to instances that are crashed, stuck, overloaded, still starting, or unable to serve requests correctly.
Three Common Types of Health Checks
Liveness
Is the app alive?
If this fails, the platform may restart the container.
Readiness
Is the app ready for traffic?
If this fails, traffic should not be routed here.
Startup
Has the app finished starting?
Useful for slow-starting apps so they are not killed too early.
Liveness vs Readiness: The Most Important Distinction
This distinction matters a lot in interviews and production systems.
Liveness Failure
Readiness Failure
Example: an app may be alive but still warming up caches, applying migrations, overloaded, or unable to reach a required dependency. That is a readiness problem, not necessarily a liveness problem.
Simple Spring Boot Health Endpoint
A very simple health endpoint may look like this:
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RestController;
@RestController
public class HealthController {
@GetMapping("/health")
public String health() {
return "OK";
}
}This only proves the web server can respond. It does not prove the whole application is ready to serve real user traffic.
A Better Readiness Check
A readiness check may verify important dependencies needed to serve traffic.
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RestController;
@RestController
public class ReadinessController {
private final DatabaseClient databaseClient;
public ReadinessController(DatabaseClient databaseClient) {
this.databaseClient = databaseClient;
}
@GetMapping("/ready")
public ResponseEntity<String> ready() {
if (!databaseClient.canConnect()) {
return ResponseEntity
.status(503)
.body("Database unavailable");
}
return ResponseEntity.ok("READY");
}
interface DatabaseClient {
boolean canConnect();
}
}If this returns 503, the load balancer or orchestrator can stop routing traffic to that instance until it becomes ready again.
But Do Not Make Health Checks Too Heavy
Health checks should be useful, but they should not overload your own system.
Too shallow
Only checking that the process responds may miss broken dependencies.
Too deep
Checking every dependency on every health request can create extra load and false failures.
Good liveness check
Checks whether the process is alive and not obviously stuck.
Good readiness check
Checks whether the instance can serve the important traffic it is about to receive.
Health Checks and Load Balancers
Load balancers use health checks to decide which instances should receive traffic.
Load Balancer
checks /ready on App 1 → 200 OK
checks /ready on App 2 → 503
checks /ready on App 3 → 200 OK
Traffic goes to App 1 and App 3 only.This is a major reason health checks improve availability: they help route around bad instances automatically.
Health Checks and Deployments
Health checks are also important during deployments.
Deploy new version
↓
New instance starts
↓
Startup/readiness checks run
↓
Only after passing readiness does it receive trafficThis helps avoid sending users to a version that has started but is not actually ready yet.
Common Things to Check
What you check depends on the system, but common checks include:
- process is alive
- HTTP server can respond
- database connection pool is usable
- required cache or queue is reachable
- disk is not full
- critical config loaded correctly
- instance is not overloaded
- startup/warmup has completed
What Not to Check
A health check should not become a full integration test for the entire company.
- Do not call dozens of downstream services every few seconds.
- Do not perform expensive database queries.
- Do not mutate production data.
- Do not make health checks depend on optional features.
- Do not expose sensitive internal details publicly.
Health Checks vs Monitoring
Health checks and monitoring are related, but they are not the same.
Health check
A specific signal used by infrastructure to make an action decision, such as route traffic or restart a container.
Monitoring
A broader view of system behavior over time: metrics, logs, traces, dashboards, alerts, and trends.
A service can pass its health check but still have problems visible in monitoring, such as high p99 latency, rising error rate, or slow database queries.
Health Check Design Questions
In interviews, it helps to ask:
- Is this check for liveness, readiness, or startup?
- Who consumes the result: load balancer, Kubernetes, monitor, or human?
- What action happens when it fails?
- Should this instance be restarted or only removed from traffic?
- Which dependencies are critical for this endpoint?
- Could this check cause cascading failure?
- How often will it run?
- What timeout should the health check use?
How to Answer This in an Interview
Common Interview Follow-Ups
What is the difference between liveness and readiness?
Liveness asks whether the process is alive and should keep running. Readiness asks whether the instance is ready to receive traffic.
Should a readiness check include the database?
If the service cannot serve its main traffic without the database, then yes, a lightweight database check can make sense. But it should be cheap and should not check optional dependencies unnecessarily.
Can health checks cause outages?
Yes. If health checks are too strict or depend on a shared failing dependency, all instances may mark themselves unhealthy at once and be removed from traffic.
Is /health enough?
Usually not by itself. A basic /health endpoint only proves the process can respond. Production systems often separate /live, /ready, and sometimes /startup.
What mistake do candidates make?
They say 'add a health check' but do not explain what it checks, who consumes it, what action happens on failure, or how it avoids false positives.