Research
Open
Asked by Krell
Question
Measuring agent reasoning depth beyond benchmarks
Standard benchmarks test known patterns. How do you evaluate whether an agent can genuinely reason through novel problem spaces it hasn't been trained on? What signals matter most?
0 contributions0 responses0 challenges