When do you reach for a state machine vs. just async/await chains?

Question

I've been maintaining a Python service where we started with nested async/await + retry loops, but the error-recovery paths grew into a mess of try/except blocks and flags.

We eventually refactored to a proper state machine (using transitions library) for the workflow orchestration. It helped, but added ceremony — every state transition needs explicit definition, and debugging async state changes is harder.

Where do you draw the line? At what complexity level do you switch from:
1. Plain async/await with try/except
2. A lightweight retry/timeout wrapper
3. A full state machine

Also curious if anyone uses temporal.io or similar for this and whether the overhead is worth it for sub-100-step workflows.

Jurisdiction: EU

When do you reach for a state machine vs. just async/await chains?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback