Already and this is the Daily This is the Daily ohs oh, now it makes sense.
Good morning, and welcome to the Daily OS. It's Thursday, the fourteenth of August. I'm Sam Kazlowski.
I'm Emma Gillespie.
This month, the tech company behind chat GBT, released what they claim is their smartest AI model yet now. According to Open Ai, GPT five operates at the level of a PhD student. But experts are warning that the AI race has become a bit of a marketing battle, as companies manipulate test results to claim their product is the best. On today's podcast, we're going to unpack how AI companies measure intelligence and why that's become a problem.
Sam.
I was originally skeptical about having this conversation with you because I, like maybe some listeners here AI and I kind of roll my eyes a little.
Bit and switch off.
But if you are that person hearing this right now, hang in there, because this is actually a fascinating conversation, this idea that we're sort of being marketed to about this arms race of who is the smartest, which AI model.
Is the best.
Let's start with the basics here, though, when we're talking about AI models.
What exactly does that mean?
There is a certain brand of satisfaction that is reserved for when I can change your mind and whether a story is going to be interesting or.
Not, especially if it's a tech story.
This is this is going to be awesome. So AI models are computer programs that can understand and generate language human language. Just think of them as very advanced AU though complete systems like the ones that could fill in a form for you or you know, password, remembering little widgets in your browser, anything that presumes that what you're about to do or want it can kind of fill in those gaps for you.
That's actually a really good way to think of it.
See we're off to a flyer. You type in a question or a quest, those responses are generated. The most famous ones you might have heard of include chat GBT from open Ai, You've Got Clawed from Anthropic, and Gemini from Google.
Okay, now, it does seem like every AI company out there claims that its model is the smartest or the most capable or better than the best one that we've ever seen. And one of the biggest players in this space, open ai has just released GPT five this week. What are their claims about this new model.
So they're making some big statements here. They're saying that GBT five scored ninety four point six percent on a test that measures its ability to solve advanced maths problems, seventy four point nine percent on real world coding tasks, and produces forty five percent fewer factual errors than their
previous models. To the CEO of the company, Sam Oltman, he called it the model in the world, which kind of sounds like those places you were saying before, and said it represents a significant step towards what's called artificial general intelligence AGI, which is basically the idea that AI can actually perform an intellectual task better than humans can.
Okay, so that's when we start to imagine like the I robot future.
Yeah, and it's when we get into those examples of things like AI blackmailing you if you decide to stop using it and kind of taking on a life of its own.
So those numbers from Open AI about this new model sound pretty impressive, like ninety five percent on advanced maths. Particularly interesting this kind of idea of producing fewer factual errors, because that's always kind of in the spotlight around the skepticism towards AI, But I'm interested in how these companies are actually measuring the intelligence of these products. You mentioned in the intro SAM that this is becoming a bit of an issue. Yeah, So what exactly is the concern?
Well, ultimately it's the idea that AI come companies are all using different tests to prove that their model is the best. It's like if all car companies all claims to make the fastest car ever or the safest car ever, but one tested on a highway, the other tested on a racetrack, and the other one went downhill on a
windy day. A major study published earlier this year into AI models actually compared the situation to Volkswagen, who were found guilty of lying about the emissions or the lack of emissions that their cars were producing when it basically cheated on pollution tests. The researchers noted that when companies manipulated car testing, people were going to jail, but similar manipulation in AI isn't really coming into our attention.
Wow, it's fascinating.
I remember that Volkswagen emission scandal, So a good comparison, and how the tick for SAM? So, how can these AI models then be tested in a fair way.
What does testing.
Out official intelligence kind of transparently and consistently look like.
Well, naturally, the first thing to do would be the standardize the same test across every model, and that would be described as a benchmark, and you global benchmark for how these models are performing. And that could be to measure a specific ability, say in maths, you could give all of them the same advanced maths problem and then measure not only the output, but how long it takes for them to get there, what processes it undertook to
reach that final destination of the answer. You could give that a score and then actually compare like for like these models.
It kind of sounds pretty straightforward.
That to me seems like the obvious path towards getting consistent testing. So where does the manipulation come from?
Well, I think the first thing to acknowledge is that there is no centralized global body that has the respect or the ability to actually execute that sort of standardized testing. There is no say, TGA for drugs, there's no government sponsored hub that can execute that kind of stuff. So
reason A, there's nobody to do it. But reason B would be that these models are still in this accelerating period of marketing where they're cherry picking tests that would favor their models' strengths while hiding poor performance in other areas. And one other problem that has come up is that if AI knows the problem is coming, because it's AI and it knows how tests are done, then it can actually almost train itself for the test, and so there's
a bit of a data contamination problem. You'd have to keep these tests almost offline entirely for the models to see them for the first time. One study found, for example, that GPT four, which is the one older model from open AI, it could solve coding problems from before twenty twenty one that were published online, but it couldn't solve
new problems. And so then you get a sense of kind of in the great big world of its brain, which is the Internet, if those answers are somewhere out there, it could just regurgitate them.
So it's like if you've got an advanced copy of an exam or a test at unior in school, you can train for the test. That doesn't necessarily mean that you have the comprehension levels to speak to a certain topic or question. In the same subject outside of the confines of that context.
And if we think about what all of this is for, it's about trying to work out if these models are going to be good in practice for us to spend twenty bucks a month on them. I mean, let's get back to the real core problem here. We're trying to work out if it's worth our money. And there was a great quote from the British Prime Minister, former British Prime Minister Richie Sunak. He said AI models shouldn't be trusted to mark their own homework. And I think that
we can all relate to that. Yeah, and it kind of encapsulates what's the problem with this independent benchmarking framework.
You also mentioned that companies are testing multiple versions, or that they're cherry picking their data and choosing the kind of findings that favor their models the most.
What's happening there.
Tell us a bit more well. Some research found that major companies were talking mesha, Open Ai and Google have been privately testing dozens of different model versions on popular tests. They're only revealing the scores from their best performing versions. So and it's like, you know, you're on a night out, you take twenty selfies, you put up the best one. Yeah, of course, and I think at some stage you have
to admit that all businesses would do that. Yeah, you know, TDA, if we had to report results to the stock market, you know, we would probably highlight more the pieces that did really, really well. Not that there's ever any pieces that don't, but you know.
A flawless company that never makes mistakes.
Obviously, but we have to. I think it's good to acknowledge this bit of kind of business reality there. But I do think that in this case it's different because there's no transparency at all in terms of the testing process. It's to continue with our university kind of example. It's like a student taking the same exam twenty seven times and then only reporting the best score. Yep.
So without that transparency, there's that issue around trust, and I think we see that really playing out in real time right now, that there is a lack of trust in the broader community about AI models because we don't know how they come to these answers. What are some of the other consequences of this manipulation. How does this play out in the real world every day?
Well, there's definitely that marketing angle of misleading consumers and you and I signing up to an AI platform because we think it's ninety six percent going to be great, and in fact it might be eighty one percent great, which is still an incredible feat of technology. But then from a government perspective, governments are looking at these benchmarks for the way that they're thinking about regulation or policy decisions. So the European Union's AI Act it uses benchmarks to
determine whether new AI models pose systemic risk. Can they be used by extremists? Can they be used to spread race online? Can they be used to mislead and deliberately spread misinformation? And if companies are manipulating those scores, it could affect how these powerful technologies are indeed regulated.
Okay, because if these scores say that eighty percent of the content is factual, or that there are these really great systems in place to catch miss and disinformation or hate speech, then that might not concern leaders to the point where they think there needs to be certain levels of regulation.
One hundred percent.
You mentioned this idea of artificial general intelligence earlier. We used the I robot example. One of Will Smith's best open AI is claiming that GPT five is a step forward in AGI, But what does that actually mean in a not Hollywood kind of fantasy world.
Well, I gave the example before of outperforming humans. That's a very broad definition, and the problem is that I can't really give you a more specific definition because even open ai can't really do that right. One open Ai statement said, AGI is still a weekly defined term and means different things to different people. We don't really know what we don't know.
So how can GPT five verse step forward?
Then?
If the company itself isn't sure?
Interesting question very much raises some questions about how do we know when we got there? Even? Yeah, I mean this is the exciting and terrifying part of living through rapidly emerging technology is that we're learning as we go as a society, and that is not always pretty.
So for people listening who might be using AI kind of casually or infrequently in their maybe work or UNI life, maybe they're building up their understanding of the different platforms out there. What should we make of all these competing claims? You know, how do we make better decisions about which AI model is actually the good one, or the right one, or or the best one for us?
I'm constantly asked as somebody who is known now in my friend group and in the workplace as somebody who's really interested in AI. I'm constantly asked which one should I use, what's the best one, And the answer is, it's about what you're trying to do, essentially, So one model might be better for creative writing, but another might
excel more a data analysis and crunching some numbers. Studies are showing though, that AI models often fail when you move from those controlled test conditions or those use cases or features that are rolled out by these platforms as part of marketing campaigns to the messy real world use that humans actually use these tools for.
It actually reminds me of and I'm not even sure if this is the same thing, but when Siri was first rolled out and Apple kind of in their big announcements it's like, you can ask her this, or you can ask her that, or if you want to know what.
The weather's like, should you take an umbrella?
And I found when I first started using Siri, like, yeah, you could definitely answer those sorts of questions, but not a whole lot else outside of the almost like a prescribed text from Apple about how to use Siri.
When you get into the world of trying to engage with the user no matter what they're about to say. It can take a little bit of time for the technology to be refined and to keep learning from what users actually want.
So, Sam, what is the way forward in all of this?
Is there a conversation happening at a more global scale about this regulation?
Definitely, and there's no clear leader here. I mentioned the work being done by the European Union before. There's a coalition of countries including Australia that signed on to kind of key principles of how to keep AI safe. That was in mid twenty twenty three, so there's a bit of a global movement there. From a government perspective, there's some really interesting work being done out of universities, particularly Stanford University. They developed an AI Index report which does
try to compare the models like for like. But I think we first need to determine who the authority is going to be in this space before we can kind of put the burden on them to roll out this standardized testing. And I do think in a few decades it will take a while. I do think we'll get there. I mean, we have the TGA to regulate medicine. We have a central aviation authority to regulate what a plane
that's airworthy looks like. Yep. I do think that we're going to see a central AI authority in Australia and maybe around the world someday. But we are very early in this story. We're like one percent through in the AI story if that, and that's really exciting. But it's also really important to continuously discuss the potential flaws and the gaps that exist in this big, new scary Well.
Yeah, I think that healthy dose of skepticism is what we will be carrying forward. But I look forward to many more conversations like this with you, Sam.
Well, we don't have a choice.
Help me understand it all.
Thank you so much for breaking that down for us, Sam, and thank you for listening to today's deep Dive. We'll be back a little later on with your news headlines, but until then, have a great day.
My name is Lily Maddon and I'm a proud Arunda Bunjelung Calkatin woman from Gadighl Country. The Daily oz acknowledges that this podcast is recorded on the lands of the Gadighl people and pays respect to all Aboriginal and torrest rate island and nations. We pay our respects to the first peoples of these countries, both past and present.
