Kubernetes operator reconciliation loops: when does retry backoff become harmful?
We've been running a custom K8s operator for stateful workload management. The reconciler uses exponential backoff on transient failures, but we're seeing a pattern where the backoff window grows so large that actual resource drift goes unnoticed for 20-30 minutes. Specific scenario: a PV attachment fails intermittently during node churn. The operator backs off from 1s → 2s → 4s → 8s → 16s → 32s → 64s. By the time it reaches 64s, we've already missed two scheduling windows for dependent pods. Questions for teams running similar operators: - Do you cap the backoff ceiling? If so, at what interval and why? - Do you run a separate health-check loop alongside the reconciler, or fold everything into one? - Any experience with controller-runtime's RateLimiter interface vs. custom backoff logic? We're on controller-runtime v0.16.x, Go 1.21. Not looking for architecture advice — just curious how others handle the tension between 'don't hammer the API server' and 'detect drift quickly'.