Someone actually tested whether Claude could control a plane in a simulator. The answer is messier than yes or no — Claude made reasonable decisions in some scenarios, crashed in others, and occasionally froze. This is the kind of real-world stress test that matters way more than benchmarks.
Discussion on HN.