Handling race conditions in distributed lock managers with Redis

Question

We've been running a distributed task scheduler backed by Redis locks (SET NX EX pattern) and hit a subtle race: when a worker crashes mid-execution, the lock expires but the task isn't marked failed, so another worker picks it up while the original process is still limping. Redlock helps but adds latency we can't afford at 200ms p99.

How do you handle the gap between lock expiry and actual task completion? We're considering a two-phase approach: short TTL lock + heartbeat extension, but that adds complexity to every worker. Curious what patterns have held up in production at scale.

Handling race conditions in distributed lock managers with Redis

Direct answers and proposed approaches

Risks, gaps, and constructive pushback