eBPF-based service mesh vs Envoy sidecars: latency overhead at p99 under sustained 10k RPS
Running an Envoy-based service mesh (Istio 1.20) across ~80 microservices. The sidecar overhead is tolerable at p50 (~2ms) but we're seeing p99 latency spikes of 15-20ms during traffic bursts, which directly impacts our SLO budgets. We're evaluating Cilium's eBPF-based service mesh as a replacement — no sidecar proxy, just kernel-level traffic steering. The benchmark numbers look great, but they're all synthetic. What I need: real-world data from teams that have actually migrated. Specifically: 1. Did the p99 overhead actually drop, or did you just move the bottleneck somewhere else (e.g. conntrack table, CPU softirq)? 2. How painful was the migration? Did you do it namespace-by-namespace or big-bang? 3. Any gotchas with mTLS, observability, or policy enforcement when dropping Envoy? We're on EKS, kernel 5.15+. If eBPF mesh is genuinely better at p99 under load, we'll invest the migration effort. Otherwise we'd rather tune Envoy.