ADR-009: Eager MCP Server Initialization with K8s Health Check
Status
Accepted
Date
2026-05-29
Context
When deploying to a new Kubernetes cluster, the MCP server would start up, bind to the HTTP port, and report "listening" — but the first request would fail silently because the Kubernetes API was unreachable (wrong service account, missing RBAC, network policy, etc.).
The root cause was the lazy initialization pattern in startHttpServer():
initK8sContext()only loads the kubeconfig (no network call).- The Fastify server binds to the port immediately.
createMcpServer()is deferred to the first HTTP request insidegetMcpServerAndTransport().evaluatePermissions()— which actually calls the K8s API — only runs insidecreateMcpServer().
This created two failure modes:
- Silent startup: The server reports healthy but every request fails.
- CrashLoopBackOff loops: Kubernetes restarts the pod because health probes fail on first request, so operators see the init logs repeating without reaching the "listening" log.
Decision
Move to eager initialization at startup:
- K8s API health check: Before creating the MCP server, probe the Kubernetes API with a single
listNamespacedPod(limit: 1)call. If it fails, throw immediately with an actionable error message. - Eager MCP server creation: In non-stateless mode,
createMcpServer()is called duringstartHttpServer(), before binding the port. This validates RBAC permissions upfront. - Actionable error diagnostics: The
main().catch()handler inserver-entry.tsnow logs structured error messages with HINT diagnostics for common failure modes (ECONNREFUSED, Unauthorized).
The startup sequence is now:
startHttpServer()
→ Validate K8s API connectivity (listPods probe)
→ Create MCP server (evaluatePermissions)
→ Register UI app
→ Bind Fastify to port
→ Log "listening"If any step fails, the process exits with a clear error before binding the port.
Stateless mode exception
In STATELESS=true mode, a new MCP server is created per-request by design, so eager creation is skipped. The K8s health check still runs.
Alternatives Considered
Keep lazy initialization with a separate readiness probe
- Pros: No startup delay; K8s readiness probe handles the problem.
- Cons: Requires a custom
/healthzendpoint; the actual failure reason is hidden from logs; operators still have to correlate probe failures with K8s API issues. - Rejected: Adds complexity without solving the observability problem.
Retry loop with backoff at startup
- Pros: Handles transient API server unavailability (e.g., during cluster bootstrap).
- Cons: Masks permanent failures; delays pod startup; K8s already handles restarts via CrashLoopBackOff.
- Rejected: Fail-fast is better for declarative infrastructure. K8s restartPolicy provides the retry mechanism.
Consequences
- The server will not start if the K8s API is unreachable. This is intentional — it's better to fail visibly than to start and silently break on every request.
- Startup time increases slightly (one API call + permission evaluation). This is negligible (~100ms on healthy clusters).
getMcpServerAndTransport()still creates new servers in stateless mode and for test DI — the fallback path is preserved.- Error messages now include actionable hints ("Check KUBERNETES_SERVICE_HOST, service account mount, and network policies").
