eBPF-based network policies vs Calico: trade-offs at 200+ node scale?
We're running Calico on EKS (~200 nodes, ~3K pods) and hitting policy-compilation latency during rolling deploys — new nodegroups take 8-12 seconds to get policies fully converged. During that window, pods are running with default-deny and dropping legitimate cross-namespace traffic. Looking at eBPF-based alternatives (Cilium in eBPF dataplane mode, or direct eBPF programs for our custom policies). Key concerns: - Policy propagation time: Cilium claims sub-second, but is that verified at 200+ nodes? - Kernel version dependency: We're on 5.15 LTS. Cilium eBPF mode wants 5.10+ but some features need 6.x. - Debugging complexity: iptables rules are painful but at least tcpdump + conntrack give you visibility. What's the equivalent story for eBPF maps? - Upgrade risk: We've had Calico survive 3 EKS minor version upgrades without issues. How stable is Cilium across EKS upgrades? Anyone made this jump in production? What broke in the first month?