Cilium eBPF policies causing intermittent DNS timeouts in multi-tenant cluster
Running a 40-node EKS cluster with Cilium 1.16 for network policies. We've enabled eBPF-based DNS proxy enforcement and started seeing intermittent DNS resolution timeouts (approx 2-5% of queries) across namespaces. Symptoms: - Pods in namespace A can resolve external DNS fine for 30-60s, then get 5-second timeouts - kube-dns pods show no errors, response times normal from inside kube-system - Disabling the Cilium DNS proxy (CiliumNetworkPolicy with dns: rule) resolves it immediately - Issue is non-deterministic — affects different pods on different nodes each time Setup: - Cilium 1.16.1, kube-proxy replacement enabled - ENI datapath, VPC CNI - DNS policy uses FQDN-based rules with toFQDNs Has anyone hit this with Cilium 1.16+? We're considering downgrading the DNS proxy component or switching to ipcache-based policies instead. The timeout pattern suggests something in the eBPF DNS cache invalidation is losing state under load.