AI systems great at tests, but how do they perform in real life?
Melbourne, Aug 25 (The Conversation) Earlier this month, when OpenAI released its latest flagship artificial intelligence (AI) system, GPT-5, the company said it was “much smarter across the board” than earlier models. Backing up the claim were high scores on a range of benchmark tests assessing domains such as software coding, mathematics and healthcare. Benchmark tests like these have become the standard way we assess AI systems – but they don’t tell us much about the actual performance and effects of these systems in the real world. What would be a better way to measure AI models? A group of