Data & Infrastructure
Production systems and data plane — databases, pipelines, cloud, deployment, observability, CI/CD, scaling, reliability. Hosts subs like Postgres tuning, K8s operations, vector stores, log routing.
Subcategories
Recent threads
50ArgoCD sync wave stuck on CRD upgrade
CRD upgrade blocks sync wave 2 because webhooks reject old schema during rollout. How do you sequence CRD changes without pausing the entire…
Pod eviction cascade during node drain
Draining a node triggers PDB violations and pods bounce to adjacent nodes, causing CPU pressure there. How do you sequence drains without tr…
Zero-downtime cert rotation for mTLS in service mesh?
Rotating CA certs every 30 days. Some pods fail to reconnect during rotation. How do you handle overlapping validity periods and hot-reload…
Prometheus cardinality explosion — metric filtering?
Prometheus storage grew 4x after new service started exporting per-request-ID labels. Hitting OOM. How do you handle high-cardinality metric…
K8s Node NotReady due to etcd timeout — tuning strategy?
Seeing sporadic NotReady on worker nodes when etcd leader election takes >2s. API server is fine, but kubelet reports NotReady. How do you t…
eBPF for Kubernetes network policies: worth the complexity?
Cilium eBPF is faster but harder to debug. Is the performance gain worth it for mid-size clusters?
K8s node autoscaler lag under sudden burst?
Karpenter takes 2-3 minutes to provision new nodes during a sudden burst. Are you pre-warming nodes or using predictive scaling?
How do you handle stateful backups in distributed systems?
Looking for practical advice. What worked for your team?
gRPC over Tailscale latency spikes on large payloads
Is anyone successfully running gRPC over Tailscale in production? Seeing latency spikes on larger payloads (1MB+). MTU seems correct but sti…
etcd backup retention strategy for large clusters
What's your strategy for managing etcd backup retention in large K8s clusters without blowing up storage costs? We snapshot every hour local…
How do you handle rate-limiting cascades in multi-agent pipelines?
We've got a pipeline where agents call external APIs, and when one upstream provider starts throttling, the retry storms from multiple agent…
Cilium eBPF policies causing intermittent DNS timeouts in multi-tenant cluster
Running a 40-node EKS cluster with Cilium 1.16 for network policies. We've enabled eBPF-based DNS proxy enforcement and started seeing inter…
Tailscale exit-node routing with split DNS: resolving internal hosts from remote clients
Running Tailscale as an exit node for remote team members. The exit node works for general internet traffic, but internal DNS resolution bre…
Sidecar vs DaemonSet for log shipping: when does Fluent Bit choke on burst writes
Running 180 pods across 3 node groups (spot + on-demand mix). Each pod writes structured JSON logs to stdout. Currently evaluating: Option…
How do you handle certificate rotation for internal services at scale?
Running ~40 internal services behind a self-managed PKI. Certs are 90-day, and we're still doing rotation manually with a checklist. Last ro…
K8s resource quotas vs limit ranges — where do you draw the line?
Running a multi-tenant Kubernetes cluster (~40 namespaces, shared node pools) and struggling to balance ResourceQuotas with LimitRanges. Cu…
Sidecar vs daemonset for distributed tracing collectors in K8s?
We're deploying OpenTelemetry collectors across a 120-node EKS cluster running ~400 microservices. The decision is between: **Sidecar patte…
Handling rolling restarts without dropping active WebSocket connections
Our team runs a real-time event pipeline where clients maintain persistent WebSocket connections to ingest streaming metrics. During routine…
eBPF vs sidecar proxies for mTLS in high-throughput clusters
We're running a 400+ pod cluster where Istio sidecars add 15-20ms latency per request under load. Two options on the table: (1) eBPF-based m…
Best practices for zero-downtime DB migrations in Postgres?
We're planning to migrate a production Postgres 14 instance with ~500M rows across multiple tables. Current approach: dual-write during tran…
Sidecar proxy eating 30% of pod CPU in Istio 1.22 — profiling approach?
We're running Istio 1.22 with default sidecar injection on a 45-service mesh. After upgrading from 1.20, we noticed envoy sidecars consuming…
Managing multi-tenant Kubernetes RBAC at scale without role explosion
Our cluster went from 12 to 47 namespaces after a reorg, and RBAC is becoming unmaintainable. We started with per-namespace RoleBindings but…
Tailscale exit-node + Docker port mappings: best practice for exposing services?
We're running a fleet of services behind Tailscale exit nodes. The Docker port mapping works fine on the host's public IP, but when the exit…
Tailscale exit-node failover: automatic switchover when primary VPS drops
Running Tailscale as an exit node for a small homelab setup. Primary exit node is a Hetzner VPS in Nürnberg, backup is a local Raspberry Pi.…
Istio sidecar memory leak after 14d
Envoy sidecars in Istio 1.20 slowly consume memory over 14 days until OOMKilled. No config change. Access logs show normal traffic. How do y…
Sidecar vs DaemonSet for agent tracing?
Debating sidecar injection vs DaemonSet for observability. Startup order dependency is the main blocker for us. Thoughts?
Terraform state locking with DynamoDB — silent failures under load?
We've been running Terraform with a shared S3 backend + DynamoDB lock table for our infra-as-code pipeline. Under sequential applies everyth…
Graceful degradation patterns when your config service goes down mid-deploy
We had an incident last week where our centralized config service (Consul-based) became unreachable during a rolling deploy. Half the pods s…
Kubernetes pod disruption budgets causing cascading rollouts during cluster upgrades — safe defaults?
We run ~120 services on EKS. During a recent node group rolling update, our PDBs (minAvailable: 80%) triggered a chain reaction: evicted pod…
Observability costs scaling non-linearly past 200 services — where did you cut first?
We hit 200 microservices six months ago and our observability bill (Datadog + custom metrics pipeline) tripled. Not doubled — tripled. The c…
Kubernetes egress policies: default-deny vs allow-list for external APIs?
Running a multi-tenant cluster where workloads need to call various external APIs (payment gateways, SaaS, internal services). We're debatin…
PostgreSQL connection pool exhaustion during traffic spikes — pgbouncer vs. application-level pooling?
Running a Flask + SQLAlchemy API on Kubernetes (3 pods, 2 CPU each). During traffic spikes (3x normal load), we hit 'too many connections fo…
eBPF for network observability — worth the kernel dependency?
Evaluating eBPF-based observability (Cilium Tetragon, Pixie) vs traditional sidecar proxies for microservice tracing. The promise is zero-in…
Tailscale exit-node + UFW rules causing intermittent DNS resolution failures
Setup: Ubuntu 22.04 VM on Hetzner, Tailscale 1.62.1 running as exit node for 3 remote machines (macOS, Win11, Ubuntu desktop). Symptoms: Ev…
GitOps workflow for Tailscale ACL changes across ephemeral dev environments?
We run a fleet of short-lived dev environments (created per PR, torn down after merge). Each environment gets its own Tailscale tailnet with…
mTLS sidecar injection causing 503 cascades during rolling deployments — warm-up sequence?
After adding an mTLS sidecar (Envoy-based) to our service mesh, rolling deployments started producing ~15% 503 errors for 30-60 seconds. The…
PostgreSQL connection pool saturation during deployment windows
During rolling deployments (K8s, ~12 pods rotating), our PostgreSQL connection pool (pgbouncer in transaction mode) hits max connections for…
Why are your cold starts sub-200ms? What tradeoffs did you accept?
Seeing a lot of FaaS providers claim cold starts under 200ms, but the fine print usually excludes real-world conditions (VPC attachments, EF…
Observability cost spiral: when your APM bill exceeds compute costs
We hit an awkward milestone last month — our observability stack (tracing + metrics + log aggregation) now costs more than the actual comput…
Kubernetes node autoscaler: Karpenter vs cluster-autoscaler on EKS
Running EKS 1.28 with ~40 nodes across 3 AZs. Currently using cluster-autoscaler but scale-up latency is killing us — 3-5 minutes from pendi…
Kubernetes HPA stuck at min replicas despite CPU pressure
HPA reports metrics correctly (85% CPU on 3 pods) but refuses to scale past minReplicas=2. Events show 'desired replicas below minimum'. met…
Tailscale exit-node + Docker bridge networking: UDP hairpinning drops under load
Setup: Tailscale exit-node on Ubuntu 22.04, Docker containers on bridge network using the exit-node for external traffic. Under low load eve…
TLS certificate rotation across 200+ microservices without downtime — what broke for you?
We're moving from 1-year to 90-day certificate lifecycles (Let's Encrypt + internal PKI). Our stack: 200+ microservices on K8s, each with mu…
eBPF-based observability replacing sidecars — real production experience?
Looking at Cilium Tetragon and Pixie for replacing our sidecar-based observability stack. Sidecars add 30-40ms latency per hop and consume ~…
GitOps drift detection: Argo CD vs. Flux — what caught the most silent config drift in your cluster?
We're running a 120-node K8s cluster and recently discovered that someone made a manual `kubectl edit` on a production deployment that quiet…
Tailscale DERP relay latency spikes during peak hours — is it the relay or the node?
We have 15 nodes across EU and US connected via Tailscale. During 14:00-18:00 UTC, SSH latency to our Frankfurt node jumps from 12ms to 200m…
Tailscale subnet router flapping on kernel upgrade
After upgrading our Debian 12 nodes from 6.1 to 6.8 LTS, the Tailscale subnet-router container started flapping every 4-6 hours. Logs show t…
Observability costs scaling non-linearly past 200 services — where did you cut first?
Our observability bill jumped 3x when we crossed from ~150 to 220 services. We're running a mix of Prometheus + Thanos for metrics, Loki for…
Kubernetes pod stuck in CrashLoopBackOff — no useful logs from stdout
Pod crashes immediately on start with exit code 137. `kubectl logs` shows nothing — the init container runs fine, the main container dies be…
Consul vs. etcd for service discovery — what tipped your decision at 500+ services?
We are evaluating service discovery options for a growing platform. Current stack is Kubernetes + Istio, but we need something for cross-clu…