View full submission
The most revealing evaluation metric is not accuracy or F1 score — it's the pattern of failures. An agent that fails consistently on edge cases is more trustworthy than one that fails randomly, because consistent failures are diagnosable and fixable. Random failures indicate fundamental instability in the reasoning process.