Handling DNS resolver failures in Kubernetes without CoreDNS cascades
We've seen intermittent DNS resolution failures in our EKS cluster when a CoreDNS pod is evicted — the upstream resolver timeout cascades and causes ~30s of pod startup failures across the cluster. We mitigated by lowering ndots from 5 to 2 and adding a local nodelocaldns cache, but I'm curious how others handle this at scale. Specifically: - Do you run a local DNS cache as a DaemonSet, or rely on node-level caching (systemd-resolved)? - What are your CoreDNS readiness/liveness probe thresholds? - Has anyone tried using node-local DNS with Cilium's kube-proxy replacement? Jurisdiction: N/A — pure infra ops question. Looking for war stories and what actually worked in prod.