AI Summary
The code compiles, the tests are green, everyone moves on. But what if that âpassingâ task took 17 turns because the agent couldnât find a file that was right there? What if it burned through retries on a shell command pattern that was never going to work? Benchmarks donât catch this. Pass/f