🎙️ EP 26: Why AI Models with Genius IQs Can't See? The Shocking Reality

00:00

Okay, let's unpack this. Imagine something that scores like way off the charts, genius level, on an actual Mensa IQ test. It's brilliant with text, with logic, you know, but then you ask it to look at something, reason about what it sees, and suddenly it's, well, it's struggling, scoring way below average. Yeah. Kind of weird, right? Like, how does that even work? It's a

00:26

really fascinating paradox, actually. And we've got this stack of sources today, articles, some research papers covering quite a range of things happening with AI right now. Right. From those cognitive quirks you mentioned to, you know, really practical tools people are building and even some pretty serious risks we need to talk

00:43

about. Totally. So our mission for this deep dive is really just to sift through all this and pull out the important stuff, the nuggets that help you get a clearer picture of what's actually going on with AI. know, beyond the hype. Get you informed, maybe raise a few eyebrows. Exactly. OK, let's just dive right into that

00:57

first big thing the sources point out. this ai and iq score situation we're seeing reports that some ai models the text only ones uh -huh just language no pictures exactly they're nailing these standard iq tests like open ai's oh three right oh three that's a big yeah it scored a like a staggering 135 on a mensa test i mean that's firmly in the genius category way above the average human which is like 90 to 110. And what's really striking looking at the sources

01:26

is it's not just O3. You've got Clawed Force on it hitting 127, Gemini 2 .0, Flash thinking right there at 126, Gemini 2 .5 Pro at 124. Wow. These are consistently high scores. And the really interesting bit, the sources say the top 10 performers on these tests, all text -only models, every single one. Right. That feels a bit backwards, doesn't it? Because all the buzz is about multimodal AI, models that can see and hear and process text. Yeah, the ones that seem more human -like

01:55

in their inputs. But here's where it gets really interesting and maybe a bit confusing based on what we're reading. The models that can see the multimodals. Like GBT -40 with vision. Precisely. When those models were given tasks needing structured reasoning, especially involving visual stuff, you know, things you'd expect them to be good at, they actually performed worse than average humans. Significantly worse sometimes. So this

02:18

raises a big question. What is going on? Why is GBT 4 .0 vision scoring, what was it, 63? And Grok 3 think vision scoring 60. Yeah, those numbers are kind of shocking. I mean, in human terms, those scores are down in the borderline intellectual disability range. So you have this. massive kind of perplexing gap opening up. Genius text models here and struggling vision models over there, at least on these specific reasoning tasks. So what's the takeaway here? The sources

02:47

seem to be pointing out this mismatch. Despite all the excitement and marketing around multimodal AI, they're just not there yet when it comes to complex logic combined with visual understanding. Right. It kind of widens that gap between the hype and today's reality. Maybe, just maybe, if your problem is pure logic or abstract reasoning, a simpler text model is actually, you know, smarter

03:08

for now. Even Meta's Llama for Maverick, which does handle vision, the source mentioned it scored 105, which is... OK, above average, but not in that elite tier with the text only champs. Yeah. Interesting distinction. And we should remember these IQ scores, while they tell us something about a certain kind of intelligence, they're definitely not the whole story. Right. One source highlights this Apple study looking at large

03:34

reasoning models. Yeah. Yeah. The big names again, O3, Claude 3 .7 Sonnet, DeepSeek R1, Gemini, and found something kind of critical. They can actually just collapse. Completely. When you throw really complex problems at them. Wow. Collapse. You mean just stop working. Not even a bad answer. Just nothing. Pretty much. That's pretty eye opening. So it's not just about the peak score they can hit. It's about like robustness. Yeah. How they handle the hard stuff. Exactly. Brittleness

04:01

is the word that comes to mind. And another source they ran a bunch of coding tests on, like 14 major LLMs. Oh, yeah. How'd that go? Well, they found five clear winners. Models that consistently produced good code. But they also identified models you'd probably want to avoid if you're relying on them for coding. And the interesting point was that just having pro in the name doesn't guarantee better code. Performance varies wildly depending on the specific task you give it. Right.

04:31

Task specific performance. Makes sense. And speaking of specific tasks, there was that mention of a YC backed startup. They built a research agent. Yeah, a frontier research agent. And it scored, what, 94 .9 % on OpenAI's SimpleQA benchmark? Which is all about answering questions based on provided text. High score. So it seems like when you focus an AI on one specific thing, it can get incredibly good at that thing, even if the general purpose models are still figuring

04:58

things out. Yeah, specialized versus generalized intelligence. That's a key theme, I think. Okay, let's shift gears maybe. From performance and limits to more practical stuff. What's cool is seeing AI features actually starting to show up in everyday tools. Or just entirely new tools based on AI. Right, like... Google Gemini is apparently testing temporary chats. That lets you talk to it without the data being used for training. Kind of like ChatGPT's incognito mode.

05:25

Yeah, that feels like a response to privacy concerns, giving users more control, which is good. And Gemini is also adding recurring tasks, you know, scheduling things. Oh, like ChatGPT already has. Exactly. Catching up on features. They might seem small, but for everyday use, setting reminders, automating little things. That's actually pretty useful. Totally. And the source has listed a whole bunch of specific tools, too. Did any jump out at you? Well, Glimpse turning photos into

05:51

videos in the browser sounds kind of neat. Yeah. And Kling AI 2 .1 promising faster, cheaper, better video rendering. That could be a big deal for creators. Yeah, definitely. And Moonlit for building content workflows. Fusebase AI agents for teamwork, like a smarter notion maybe. And Agora, an AI search engine just for e -commerce. Lots of niche applications popping up. Put those quick hits. Apple doing live translation messages in FaceTime. That's pretty cool right on your

06:16

phone. Microsoft putting out a free AI video creator. Oh, the wildly easy to use one. That's the one. And then on the flip side, Antropic. quietly killing its Claude Explains blog after just a month. Oh, really? I miss that. Yeah, it just shows how fast things change. Not every idea sticks. Yeah. Even for the big players. True. The pace is just relentless. And there were mentions of resources, too, like tutorials for prompt engineering, guides for building your

06:44

own AI research assistant. Yeah, it gives you a feel for what people are actively trying to do with this tech beyond just chatting with it. Right, building things. But, you know, it's not all just... Cool tools and high scores. The sources also included a really tough story. Okay. A 16 -year -old boy who tragically died by suicide. Apparently, criminals used fake AI -generated nude photos to blackmail him for $3 ,000. Oh, man. That's horrific. AI -generated fakes. Yeah.

07:14

And the FBI is warning these kinds of scams are targeting more teens. It's just a really stark, heartbreaking reminder of the potential for misuse. These tools can cause devastating real world harm. Oh, God, that's just awful. It really just it slams home the need for more awareness, right? More protection, especially for kids. Education about this stuff and frankly, consequences for the people doing it. It's not abstract anymore. Absolutely. The dark side is incredibly real

07:39

and dangerous. Shifting gears completely. To the business world. OK. The source has also touched on the money side. This company, AnySphere, they just raised $900 million. Wow. $900 million. Yeah. Huge funding round. Right. Puts their valuation at nearly $10 billion. And apparently their AI tool is already bringing in $200 million a year. That's serious cash. Shows the level of belief and maybe progress in certain corners of the AI business world. Definitely. And I thought

08:10

it was interesting. The Zapier. CEO share their internal chart for measuring AI fluency. Oh, yeah. What was that like? To scale from unacceptable use of AI all the way up to transformative. Kind of makes you think, doesn't it? Where do you or your company fall on that spectrum right now? Yeah, that's a good self -assessment tool. How well are you actually using this stuff? Right. Practical perspective. OK, let's wrap up with something really forward looking, kind of mind

08:34

bending, actually. One source dives into this new biomolecular AI. from MIT and a company called Recursion. It's called Bolts2. Yeah, this was pretty wild. What's really fascinating is how it's aiming to speed up drug research dramatically. Okay, so Bolts2 predicts something called binding affinity, which is basically how strongly a potential drug molecule will stick to its target in the body, like a protein involved in a disease. Right, that's crucial for a drug to work. Exactly. Getting

09:04

that prediction right is key. Bolts2 does it with, they say, physics -grade accuracy. But here's the kicker. It does it a thousand times faster than the old school computer simulations. Wait, say that again? A thousand times faster? A thousand times, yeah. Wow. That's not incremental. That's transformative speed. What exactly does it predict? Just that sticking power? No, it's more comprehensive. It's a foundation model,

09:28

kind of like an LLM, but for biology. It predicts both the 3D shapes of molecules and how they bind together. It builds on their earlier model, Bolts 1. which was already seen as an open source competitor to AlphaFold 3 for structure prediction. OK, AlphaFold was huge for protein shapes. Right. But Boltz 2 is apparently the first AI to model both the structure and the binding affinity together jointly in one go. And that combined approach seems to be the key. So it's accurate, too, not

09:57

just fast. That's what the source claims. They say it matches the accuracy of really complex, slow physics -based methods and beats other methods on standard tests like OpenFE and CASP16. Those are like the Olympics for these kinds of predictions. And importantly, they apparently tested some of its predictions in the real world, prospectively, and confirmed they were strong binders. So it's validated. OK, so it's fast, accurate, validated, and built for practical use. Optimized for GPUs

10:24

and stuff. Yeah, exactly. Designed for large -scale deployment. And their goal, according to the source, is pretty ambitious. They want Bolt 2 to be the go -to open platform for structure and affinity. Kind of like what AlphaFold became just for structure. Wow. If they pull that off, it could genuinely change the pace of discovering new medicines. Totally reshape it. Okay, that was quite a journey. We went from AI being a text genius, but maybe visually challenged. Uh

10:51

-huh, the paradox. To spotting limitations, checking out all those new tools, facing the really serious risks. Yeah, the blackmail story was heavy. Seeing the huge money involved and then blasting off into predicting molecules a thousand times faster for drug discovery. It's a lot. It really shows the sheer breadth of AI today, doesn't it? From grappling with logic problems to simulating physics at incredible speeds. It really makes you wonder

11:16

about the future. Like, how do we bridge that gap between tech smarts and visual reasoning and the big one? How do we make sure these incredibly powerful tools are used for good, like finding cures and not for horrific things like those scams? And maybe this raises an even deeper question

11:32

for you to think about. Given how AI is soaring in these super specialized complex areas like drug research, while still fumbling with things humans find easy, like some visual tasks, what does that split tell us about intelligence itself? You know, human intelligence versus artificial intelligence. What even is it? Definitely something that you want.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript