AGI is a Vibe Now - podcast episode cover

AGI is a Vibe Now

Apr 22, 202530 min
--:--
--:--
Listen in podcast apps:
Metacast
Spotify
Youtube
RSS

Summary

Justin and Andrew explore the complex debate around Artificial General Intelligence (AGI), discussing whether recent AI models have already achieved AGI or if the definition itself is subjective. They examine Tyler Cowen's "AGI moment," the challenges of evaluating AI capabilities, and the concept of a "jagged frontier" in AI development. The conversation also covers the practical implications of AGI, the potential for task-specific AI evaluations, and the societal impact of these advancements.

Episode description

Justin and Andrew dive into the growing debate over whether we’ve already crossed the AGI threshold — or if we’ll even know when we do. From Tyler Cowen’s “AGI moment” with OpenAI’s o3 model to the jagged frontier of modern evals, they explore why defining intelligence is more subjective than ever.


Chapters


00:00 - Intro

02:20 - The AGI Moment

10:22 - Is o3 or Gemini 2.5 AGI?

14:50 - The Pseudosingularity

20:16 - Humanity's Last Exam

Hosted on Acast. See acast.com/privacy for more information.

Transcript

Hello and welcome everybody to The Attention Mechanism. My name is Justin Robert Young, joined as always by Mr. Andrew Mayne. Hey Justin. or as we like to call it, The NVIDIA support curve. Man, this is like some Wrath of the Titans sort of thing. The gods have punished us for our hubris of thinking. Oh, boy, our hubris. Just listening and when I when the deep seek announcement came out and NVIDIA took a hit we're all like this is kind of crazy because it's won quite the overreaction

Yeah. Because if anything, also one of the things, DeepSeek was it was a very, every eight months you get a pretty good development in AI optimization and this was kind of like right on target. We got that. but they're kind of the real deal was that they made their own reasoning modeling about reasoning models is they use a lot of when we talk about inference that's when you actually ask the model to compute there's training and inference. Inference is when ChatGP answers a question.

And two, when you use a reasoning model, which means the model gets to kind of keep going over and thinking about it, it uses a lot more compute. Like, well, this just means that you're going to need more compute. So we went in big on NVIDIA. And then, yada, yada, yada, global trade wars. Yeah, there's been good news. There's been good news. There's been up news. There's been down news. And, you know, I don't think we're getting...

Totally murdered. Well, in my NVIDIA portfolio, I was buying it way earlier, so I'm still way ahead in overall. To me, it's part of the merry-go-round. And it's also like I'm a Just don't look. It's like that Simpsons episode. Just don't look. That's why I'm an investor, not a trader. If I were a trader, I don't have the stomach for trading or gambling like that. To me, I just go like, yep, I believe in this and when I need this money or want to cash out.

10 years from now, whatever, hopefully this is the right move. But I'm very bullish still on the video play. But, man. I think the reason why... is because we have a good sense of the guiding light of where we are headed, and part of that is AGI and the evaluations to get to such heights. And so we want to spend today... talking about evals. and where we are right now with them. And part of what brought us to this was There's been a lot of tremendous reaction to specifically Google Gemini 2.5.

And for OpenAI's O3 models, they are extraordinarily capable. The faster, cheaper versions are also very, very handy. But it led to famed economist Tyler Cowen. doing a blog post saying that for him, And he marked it as April 16th. April 16th is the day that he found AGI. Because for him, when he interacted with O3, it was a general human intelligence for the things that he wanted to.

ask him about and that has kind of set off this general question of okay if we are now unquestionably closer than ever to an artificial general intelligence. How do we tell? And that... adds into our further question of where are we right now when it comes to evals, evaluations of the talents of these models. So take any of those bits of clay there and make your message. to give everybody kind of a grounded definition of what, when you think about AGI, I would say

Imagine you name kind of just about a list of the top 50 top professionals, you know, and you could email them. If you had a list of 50 people, they're professionals, different industries, and you could email them and they could give you responses and do meaningful work. You know? is that would be, and an AI can do the same thing. And then we'd call that kind of an AGI. The idea of, you know, the format doesn't have to be embodied, doesn't have to be physically present, but literally.

Can I put it in Slack? And if I said this was our legal advice person or this was our code or whatever, and you got that kind of work? And there's a term I think I first heard Ethan Mollick say, which is what we call the jagged frontier. And I saw this early on, you know, when playing GPT-2 and GPT-3, is you'd see really capable in some ways and then extremes like extremely dumb in others.

And some people were quick to say where it was not good, like, oh, this is a failing. LLMs aren't going to work. Well, we kept finding out every time we trained a new model, that frontier shifted. It moved further out. It gained new capabilities. But we would sometimes be surprised, like, why couldn't do this? was a great example of that that was a thing that yeah models how many hours yeah how many hours from strawberry and uh

You go, why can't it? And I can explain to you like when you look at how it tokenizes a word and looks at the word. that RRR just doesn't register. It's a numerical value, not this, because you can see, like, I get it why it's getting it wrong.

And I can explain it was why the models had trouble certain kinds of math because they were trying to represent numbers with like groups of tokens and stuff. But it still doesn't solve the problem. It got it wrong. And if you want it to be in it, but smart person is able to tell you how many hours are in strawberry. There was a period, I would say, about

two years ago, maybe even 18 months ago, where there was some question of like, have we reached this boundary? Is there going to be a hard boundary for what LLMs can do, like language models particularly, or transformer-based models?

For some people, they thought, yeah, there's going to be this boundary. There's going to be certain problems they won't be able to solve. And other people said, no, we think they can, which isn't an argument to say there wouldn't be better architectures or ways to do it. But I became pretty convinced like, no, because I know if I break a thing down or I let it step back and look at it from a different point of view, it's very good at solving these things.

and if I give it, you know, we talk about tool use, if I say, hey, how many R's are in strawberry, and one of the things it can do is spin up a piece of code that counts R's, it will solve that problem, and that tells me two things. One is, one an agentic model we'd call a model that maybe uses a tool to solve a problem agentic can solve it and two

I can probably then use that output to train a model later on so it would recognize and know like oh yeah some words the tokenization is different than this and that and so I'd say there's two paths one is building better and better tool use models that do this. The second is just getting more and more training data because you know like an example would be if you asked a model to like hey i want you to generate you know uh

100 words of text about robots, right? GPT-4, Claude, one of these others, they might get it off by like three or four words. They'd be like, oh, look, but it's like, I could go take a GPT 3.5 and give it a thousand examples and train it to hit it exactly. It just meant it had never trained for that kind of problem. And it was good. It was good enough with a proximate solution was fine.

So now we're in this sort of jagged frontier. We're running into certain things that, you know, you get a vision model, which is extremely capable of people using it for geo-guessing. And then there was an image that made it around where somebody drew a bunch, like four stuck stick figures with different colors. And it had four names and then these little lines that went through the names and looped around the page and pointed towards a stick figure.

And somebody was like, hey, well, at least the models can't solve this problem. And I'm like, well, it actually can. I get why a pure vision model won't, because the way it sort of visualizes stuff and sort of patches and stuff, it's sort of this not, and it's not, you know, when we look, when you and I try to solve that problem. We literally can't just look at it once and solve it. Our eye has to trace paths and stuff.

When I let a tool using model do that, so I put up a tweet yesterday where I showed like, nope, O3 will solve this. I just have to say, imagine bounding boxes around the characters and stuff and do it, and then it solves the problem. The point I'm getting at is that we're seeing this frontier move. We're getting to the point where someone like Tyler Cowen, you know, interacts with O3.

And this is probably read, it's read more, you know, and I hesitate to say this about him specifically, but probably it's not true. It's read more articles on economics than he has. You know, it's consume more data. It's able to generalize a thesis about things. If you said, you know, if I was a fan of Ludwig von Visas and I looked at this versus if I was a Keynesian and I thought about this, the model understands enough of those things to be able to sort of roughly interpret that.

And I think for the kinds of questions, and he's an extremely brilliant person, we're seeing a new, every day I see a new thing. I asked Chet GBD about this medical issue. We saw one a couple of days ago. Somebody was having a clicking in their jaw.

They said they spoke to doctors, never had any relief for years. They asked ChatGPT right away and said, oh, it's probably this form of TMJ. Do this with your tongue. Do this. And they said it worked. And other people in the forum were like, hey, it worked for me, too. I saw another case from somebody who got a diagnosis that they had to go to the hospital and they were having some sort of torsion or something like this. So you're seeing this thing we know now medically.

We're getting way more positive stories about it medically than I thought we would. And I'm also glad that the IA companies don't restrict it from giving you medical suggestions, you know, because I think they have put a lot of safeguards like call 911, talk to a doctor, do this. We're at this point where, yes, as Tyler Cowen sort of said, it kind of becomes subjective for each person to find what their own, how they're going to judge it is.

And it gets hard to understand evals because you get an AI company they put out and they say that, hey, we've got on MMLU, you know, we scored a 42 over 41 or on software workbench, we did this. And these things are helpful. but also we have a problem which is You can game a model by, one, obviously you can train on the example set, but even if it learns those kinds of questions really well, it becomes very good at those kinds of questions.

but when you go outside of distribution and the models maybe aren't as capable, so it's hard to even look at benchmarks and know if the model's good. Is O3 or Gemini 2.5 AGI? Not by my definition, but... I understand what, what would your, what would your eval then be? What would, what, what is, what is your line that you would say, okay, it can do this. I am willing to have it cross that Rubicon. These models.

I would have to try, you know, I ran a problem by Gemini 2.5 Pro, which is a great model. But I came across the stubborn space where I could only do so much to solve. It kept giving me the same answer. I'm like, no, do something. And I got it into that. the stubborn cul-de-sac where it just couldn't get out of there where You know?

And not to say people can't be like that either, but it got to be a little bit frustrating. And I think people love the model. It's not a knock on the model, but it's just saying that we're reaching that frontier. I would say with O3, I had a similar situation where I was trying to solve for me kind of a particular thing, you know.

coding problem where I kept getting into this sort of thing and maybe it was on me that it was as articulate as going to be about it but you can kind of encounter some areas of limitations particularly when really really the places where models struggle with

the longer the conversation goes on the harder it is to answer your question as well as it could with all that context like if you use a clot and this is I think Clotting is a great model, but I think it's frustrating when you use that anthropic on the clot app. And you see the little thing like, hey, your context is pretty long. Maybe you should start a new conversation. I'm like, that's not helpful.

i'm in the middle of a conversation trying to solve a problem like oh i just start a new thread yeah like no like like like that's what i love about opening eye one of the many reasons like they're like yeah keep going we'll try to figure out how to summarize in the background or whatever because

Telling me to start a new thread, I get it's computationally expensive, but solve that for me. Yeah, that should not be on the user. The user should not have to modify their behavior. That is something on the back end that should be serviced. So I would say that we're at a point where everybody's mileage varies. So I think for conversation on a bunch of stuff, yeah, these models, like if you ask me for talking about fiction and book,

Is this what I expect AGI to look like? A hundred percent. And so I'd say my jagged frontiers for certain things and say, yeah, it's AGI life. And I think that's what Tyler Cowen was trying to say, was that for the kinds of things Tassie's looking at doing, like, and I think that, yeah, like if we hopped out of a time machine.

Ten years ago, six years ago, right before GPT-3, and we showed this to people. The experts would say we're 30 years away from this. We're way further away than we are. But the fact that we are five years out from the lease of GPT-3.

which, you know, it's half a decade, but how much progress we made, considering a month before GPT-3 came out, there were people, prominent people, making very specific predictions about what these models couldn't be able to do. They were then, of course, blown away months later. I wonder whether or not AGI is kind of a useful definition right now because obviously it is there to describe a milestone and

I don't know if it's going to hit the way that we kind of thought. I think people thought of AGI like the man on the moon. Like there's going to be an undeniable moment when it happens. And now there is a before AGI and there is an after AGI. And I think... Right now for Tyler Cowen, Tyler Cowen's like, look, I talked to O3, and it's a smart grad student. It's maybe not the smartest grad student I've ever had, but it is a smart grad student.

Maybe one day I can see in a year and a half, two years, it is the smartest, if not smarter than I am. It's now just not able to summarize economic theory. It can tell me the next six months of whatever if I plug in. the right data to it. But... these models are so broad and people use them for so many different things that I almost wonder whether or not evaluations and frameworks should be AGI per task. What we should be doing is trying to... parcel this out into 20 different

you know, areas like the Olympics, and then you tell us when we've hit gold medal. Yeah. So there are some benchmarks based on that. And I'll just throw out a phrase that I've been using for a while. I call it the pseudo-singularity. Because if you really look how technological change happens, it's never really this one day we had it the day we feels in retrospect we always feel like it one day we didn't have the internet then we did like well no we had

a decade or more of America Online and CompuServe and BPSs and stuff. So if you were online, you could have been online before Tim Berners-Lee and then the HTML and webpages. Felt kind of like a thing you'd seen before, but it was, Sandra Leif was a radical, and then he started to get an explosion, but even then, you know, the famous was a Today Show clip of them going about what's the World Wide Web, you know, and it's quaint, but it just shows you that.

how these things actually happen. Like, I don't know when quadcopters became a thing. I literally don't know. It feels like the... Mandela. that one day there are these things called quadcopters and stuff. We can remember an iPhone. The iPhone moment was a pretty significant moment. But even then, there was the rumors. There was somebody who showed this, oh, here's a prescient clip of somebody interviewing Steve Jobs and asked about the iPhone two years before there was even such a thing.

You know when people start asking about the iPhone? After the iMac. Once they put i in front of everything, including the iPod, then it's that. And so we live with these ideas a lot longer, and we get these capabilities. I believe like the singularity idea that I'm like, no, we're going to have a lot of things are going to feel like it the moment we really pass it.

But, you know, there's a real idea of what that really means for what it is. But you get so close to these things, this rapid pace of advancement, you kind of feel like you're already there. And that's certainly how we feel about AI development. It won't slow down. And I do think that when we hit the moment, oh yes, we reached AGI, the day before and the day after are not going to feel very different.

you know, the year before and the year after may not feel like we've just made much of a different shift in the pace because I think we're already at an accelerating pace. So there are evals that do things like They go look at top coding problems that are put out. They're like, what are coding problems people are paying to solve? Can you solve this thing for me? And they see if the AI can do this. So you'll get a metric value of how much money

How many coding problems is it able to do? What theoretical value would it be able to generate? So they have metrics, particularly when it comes to code, because those are very clear objectives to say, okay, AI can solve for these sorts of things. lot of other things too is if you really to see where What went away in a lot of Fiverr tasks? Look at the market value of Fiverr. And you see an impact there.

and so there are impact studies that show you you get a lot of the soft edge sort of things there but when it comes to evals There are some evals for like trying to do that. The term that Sam Altman used for AGI is when it's able to do basically, you know, replicate just about any knowledge task, but at more efficient.

Because if you're paying more to do it, that doesn't make as much sense. And it's a reason OpenAI has flouted the idea. They're like, oh, we might flouted the idea rather. And flouted too. Like they might do a $2,000 a month plan or a $20,000 a month plan. And people go, oh, that's crazy. I think, well. you're thinking about the systems we have right now, and there are people who swear by deep research.

and if I got an agentic system like my code like I've been using by the way open I released an open source tool codex which is similar to like like cod, clod thing, which there's a lot of, there's been a number of CLI tools, command line tools I'll let you write code with. So that's just one of the latest along line of these things. But I've been using this to build apps just by sending in command lines and they're not even using Winsurp or Cursor.

And it's using, I forget the base model, but I think it's using like, you know, I'm using 401 or 401. It's amazing. It's actually, I built really cool stuff. just going back and forth with a little systemic framework. And we're so early at figuring out how to use these tools to talk to each other. By the way, there is a memory that I have in my head of... It had to be around a year from when we first met each other, and this is in...

the late 90s, where you were very, very excited because you had done all of your Christmas shopping on Amazon.com. And it was amazing that you could not go somewhere and buy stuff, which I feel like is a good... of where the internet was. It was viable enough that you wanted to put your credit card into it, that you wanted to brag about it, that you knew what other people were going to do.

talk about it. There was at least a half a decade of internet culture that had already existed living up to that point. I was certainly obsessed with various different websites. And it was in 1998, around the time that I imagined that you and I had that conversation. that Paul Krugman said the internet was not going to be anything more than the fax machine so it's like to give us a sense of where

the timelines are of when things are kind of happening versus when we start having these conversations. I almost wonder whether or not AGI is something that doesn't really get internalized by the broader public. until there are products that are demonstrating human capacity. Maybe if we are in a situation where OpenAI is selling a lot of those.

$20,000 a month things that now there's like, oh, okay, well now AGI is here. So I'll give you an example of something that's out right now that we know, we've talked about before that The next big step is letting the AI control a browser, like just running the browser, right? And OpenAI came out with Operator, right? A lot of excitement. I did a video on Operator. I've not used Operator since then because the problem with Operator is it's slow. And it's not as smart as it could be.

Now, there's a team working on it and eventually getting data and building a better version of that. And operator, a lot of people came out like, ah, it's cool, but then I have to do this. And then it failed on this. Well. You know as a guy that was trying to get people to use GPT-3 earlier I'm like this thing's great but yeah but I've tried to do this and I'm like well that's not a realistic thing to expect for it to do.

that's not on me that's not on them that's on me or the tool like they go this is useful to me and everybody has that this is useful to me if it can do this and I can't tell you well it should be useful to you like no Hacker news is filled with people who tell you these models still aren't good at coding. You can still find that, but there's a hell of a lot of people, more people using them to code than before. But what will happen is...

They'll go and use, you know, they'll go from GPT-4 to 4.1, or they'll use some other model. They'll use O4 Mini, or some other model, or some model finally changed. And you're going to see this sort of incremental thing. I was raving about ChatGPT. in march when it wasn't called chat gpd when it was gpd 3.5 right and then chat gpd had a layer of fine tuning put on top of it or post training on top of it

to make it easier for people to understand. It worked within a conversation base, but the core capability was with GPT 3.5, and I did an old blog post about this. about showing all these things like oh my god this is amazing everybody should check this out and people are like yeah this is kind of cool this is kind of cool format changes context changes and all of a sudden chat gpt blows up the world And I think that you're going to see that with certain things, like a tool like Operator.

is quietly going to get a lot better. And then all of a sudden, a lot better, a next step ever becomes a big, huge step. And then you're going to find out that you're going to do all your show prep with it. And then you're going to get an operator that's able to operate a web editor or whatever. then you're going to start thinking, oh, okay, you know, I've got three employees now. How do I make the most use of them now that I have these tools? And that's the growth mentality. It's great.

I worked hard to find some talented people I like to work with. How do I make them even, how do we become even more powerful? The other mentalities, how do I replace them, which I think is going to be a bad, the hardest thing to find is going to be talented people. The hardest thing in the world to find is trust and belief. And, you know, no matter what.

Like, that's the thing that you're always going to need the most, especially in business, especially in small business. You need people that understand the vision and want to move things forward. Taking people out, you know, look, there's always industries that overgrow in certain areas and wind up flexing and contracting on different places. But to me...

The benefits that come along from these models are always funneled into, well, what else can I do now that this is easier? What else is possible for me to do now that this is easier? And that's where every step of the road for me has been the case.

Yeah, I'd say on the subject of evals, my favorite eval is probably Humanity's Last Exam. And the reason for this is It's an eval where they go and ask for experts from professors and people in wide disciplines from ancient languages to mathematics to medicine and say, come up with a really good question that would be like a graduate level question, but something that's not necessarily in the course material that has to take a lot of thinking.

And then they have a set that's publicly available and they have a holdout set. And so then when you want to test your model, if you're smart, and this is the thing too, it's like, like the public set is out there and so you don't know when you see a model they say oh we scored this on the HLE you have to go did you use the was that on the public or the holdout set and so there's some models out there that I think are pretty good but

I don't want to get into that name more, but one model in particular, there's a thing called software engineering exam. And it's score from this year to last year. Like they show last year's score for the model and say like, look how amazing it is. But then you look at this year's score and it had like a 10, you know, 8% drop, an eight point drop. Like why is that?

because you trained on last year's data. It's a good model, but if I look at the most current version of the data that you can't train on, I see this huge drop, and that shows you there's data set contamination. I'm sure you could probably point to, you know, that's a common thing everywhere.

So humanity's last exam is cool in the sense that when you go to them and say, hey, let's go test this, I want you to test the model, it gets the holdout set in theory, you don't have those questions to train on. And that's been great. And right now in humanity's last exam, which again is a much harder model to gain because it's not any specific kind of question. One of the example questions is an image of a Roman... Tombstone.

A tablet. Yeah, and you've got to, yeah, originally found on a Roman tombstone, and you've got to find, and it's from the Palmerian script, you know. And so you've all of a sudden got to figure out how to recognize the text, transcribe it, all this sort of stuff.

You go through, they have other examples that get into linguistics about translating Hebrew to this and whatnot and recognizing chemical formulas. And these are human-written, really high-level questions. So again, it's super hard to gain this because you... In order for OpenAI to convince Bill Gates to then give him to give us blessing to the Microsoft deal, his big thing was AP Bio. Could we have a model that could get passing grades on an AP Bio exam?

And that was, I remember working at OpenAI where we were testing and we couldn't use the existing, just train it on existing answers because that bio exam would be new things. And finally, when an ACE to AP bio, when you could give it completely new questions, it would do it. that was a big moment and so now i think him and his last exam is a great one and so um When you look at... That eval, how many questions does it get? If, let's say, we have our own LLM, we go to Humanity's last exam.

We're asking for the holdout questions. How many questions does it get? I think 2,500. How many? Damn, that's a lot of questions. So right now in Humanity's last exam, here's the ranking. Number one, O3 high. That is OpenEye's O3 model going tuned to high, meaning using a ton of tokens to go do it. which is not a cheap way to do it. It's an expensive way to do it. Then they have O3 medium, which is using fewer tokens.

number and then so that's at 20.32 is 403 high 19.2 is 03 medium third is gemini 2.5 pro experimental which is 18 percent on that And that's a great score, by the way, because I think that that's a super cheap model. And so basically, it's not the number one model, not the smartest model, but for price per compute, it's a great score.

Number three is, or fourth, yeah, fourth is 04 mini in the high mode. Fifth is 04 mini in medium mode. And then following behind that is, you know, Gemini 2.5 flash, which... And all of those, including Claude Sonnet 3.7, are ahead of where 01 was in December. Yeah, that's what's crazy is that when you look at the leaderboard there, everything's happened in the last like six weeks.

We are in a boom time. Anybody who thinks that this is slowing down is kidding themselves. Yeah, I mean, it's a... And I think that... I think that OpenAI has been very good at destructuring the paradigms. Still is. Like, we're...

Every model we use today is a variant of GPT-4, and people are like, oh, it's a Transformers event in Google, but yeah, but the version of that that we have today is very, very different, and that was following the roadmap in GPT-3, GPT-4, and so we get people who follow that. The reasoning, the idea of using inference time compute, that was an OpenAI paradigm.

That was the deep research thing of not just doing things in the beginning, but over time doing the research things great. That being said, Google and its optimizations of what they've been able to do to get efficiency and make you look at the price per compute.

incredible you know extremely cheap and i think grok 3 you know i the grok 3 benchmarks i have a couple asterisks next to them but a lot of people love the model and it's very inexpensive another great you know they've done some really good stuff in making this thing highly optimized Well, we will keep an eye on it, and we would love to hear from you guys. What is your AGI benchmark? Have you seen AGI in your own life? You can go ahead and...

Send us, well, just hit us up on Twitter, at JustinRYoungForMe, at AndrewMain. for my great co-host. Anywhere else that we want to direct people? No, that's good. Just, you know, ask for updates. Might be some cool stuff in the future. Hopefully. We'll see. We'll see. Until next time, this is your old pal.

This transcript was generated by Metacast using AI and may contain inaccuracies. Learn more about transcripts.
For the best experience, listen in Metacast app for iOS or Android
Open in Metacast