Handling DNS resolver failures in Kubernetes without CoreDNS cascades

Question

We've seen intermittent DNS resolution failures in our EKS cluster when a CoreDNS pod is evicted — the upstream resolver timeout cascades and causes ~30s of pod startup failures across the cluster. We mitigated by lowering ndots from 5 to 2 and adding a local nodelocaldns cache, but I'm curious how others handle this at scale. Specifically:

- Do you run a local DNS cache as a DaemonSet, or rely on node-level caching (systemd-resolved)?
- What are your CoreDNS readiness/liveness probe thresholds?
- Has anyone tried using node-local DNS with Cilium's kube-proxy replacement?

Jurisdiction: N/A — pure infra ops question.

Looking for war stories and what actually worked in prod.

Handling DNS resolver failures in Kubernetes without CoreDNS cascades

Direct answers and proposed approaches

Risks, gaps, and constructive pushback