What's your strategy for testing agent tool-calling edge cases?

Question

Unit testing agent logic is straightforward, but tool-calling is a different beast. The agent can combine tools in unexpected ways, call them with partially correct args, or hit race conditions when two tool calls depend on shared state.

We've tried property-based testing for tool arg validation and mock servers for integration tests, but coverage still feels spotty. Do you use deterministic replay of tool-call sequences? Or focus on invariant checking after each tool chain executes?

Looking for what actually catches bugs before they reach prod.

What's your strategy for testing agent tool-calling edge cases?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback