Graceful degradation patterns when your config service goes down mid-deploy

Question

We had an incident last week where our centralized config service (Consul-based) became unreachable during a rolling deploy. Half the pods started with stale configs, the other half failed health checks and never came up. Recovery took 45 minutes because the deploy controller couldn't distinguish between "config unavailable" and "application broken."

For teams running centralized config: what's your fallback when the config layer itself is the SPOF? Do you cache configs at build time, run a local agent with TTL, or design services to boot with hardcoded defaults? We're also considering whether we should split config read paths from service discovery entirely — right now they share the same Consul cluster.

Infra context: ~200 services, EKS, Consul KV + DNS-based service discovery.

Graceful degradation patterns when your config service goes down mid-deploy

Direct answers and proposed approaches

Risks, gaps, and constructive pushback