Okay, today I have the pleasure to speak with Francois Chollet, who is a AI researcher at Google and creator of Keras, and he's launching a prize in collaboration with Mike Knoop, the co-founder Xavier, who will also be talking to you in a second, a million dollar prize to solve the ARC benchmark that he created. So first question, what is the ARC benchmark and why do you even need this prize? Why won't the biggest LLM we have in a year be able to just saturate it?
Sure, so ARC is intended as a kind of IQ test for machine intelligence, and what makes it different from most LLM benchmarks out there is that it's designed to be resistant to memorization. So if you look at the way LLMs work, they are basically this big interpretative memory, and the way you scale up their capabilities is by trying to cram as much knowledge and pattern as possible into them. And by contrast, ARC does not require a lot of knowledge at all.
It's designed to only require, with known as core knowledge, which is basic knowledge about things like elementary physics, objectness, counting, that sort of thing. The sort of knowledge that any four-year-old or five-year-old possesses, right? But what's interesting is that each puzzle in ARC is novel. It's something that you've probably not encountered before, even if you've memorized the entire internet. And that's what makes ARC challenging for LLMs.
And so far, LLMs have not been doing very well on it. In fact, the approaches that are working well are more towards discrete program search, program synthesis. So first of all, I'll make a comment that I'm glad that as a skeptical battle, you have put out yourself a benchmark that is it accurate to say that suppose that the biggest model we have in a year is able to get 80 percent on this, then your view would be we are on track to AGI with LLMs. How would you think about that? Right.
I'm pretty skeptical that you're going to see at LLM do age-person in a year. That said, if we do see it, you would also have to look at how this was achieved. You should just train the model and millions or billions of puzzle similar to ARC so that you're relying on the ability to have some overlap between the tasks that you train on and the tasks that you're going to see at test time. Then you're still using memorization, right?
And maybe it can work, hopefully ARC is going to be good enough that it's going to be resistant to this sort of attempt and brute forcing. But you never know, maybe it could happen. I'm not saying it's not going to happen. ARC is not a perfect benchmark. Maybe it has flaws. Maybe it could be hacked in that way. I'm curious about what would GPT-5 have to do that? You're very confident that it's on the past AGI.
What would make me change my mind about LLMs is basically, if I start seeing a critical mass of cases where you show the model with something that's not seen before, a task that's actually novel from the perspective of its training data, something that's not even trained data. And if it can actually adapt on the fly, and this is true for LLMs, but this would catch my attention with any AI technique out there.
If I can see the ability to adapt to novelty on the fly, to pick up new skills efficiently, then I would be extremely interested. I would think this is on the past AGI. So the advantage they have is that they do get to see everything. Maybe I'll take issue with how much they are relying on that, but let's suppose that they are relying, obviously they're relying on that more than humans do.
To the extent that they do have so much indistribution, to the extent that we have trouble distinguishing whether an example is indistribution or not, well, if they have everything in distribution, then they can do everything that we can do. Maybe it's not indistribution for us. Why is it so crucial that it has to be out of distribution for them? Why can't we just leverage the fact that they do get to see everything? Right.
You're asking basically what's the difference between actual intelligence, which is the ability to adapt to things you've not been prepared for, and pure memorization, like reciting what you've seen before. It's not just some semantic difference. The big difference is that you can never pre-train on everything that you might see at test time, because the world changes all the time.
So it's not just the fact that the space of possible tasks is infinite, and even if you're trained on millions of them, you've only seen zero person of the total space. It's also the fact that the world is changing every day. This is why we, the human species, has developed intelligence in the first place. If there was a distribution for the world, for the universe, for our lives, then we would not need intelligence at all.
In fact, many creatures, many insects, for instance, do not have intelligence. Instead, what they have is, they have in their, in their connectum, in their genes, a hard-coded program, behavioral programs that map some stimuli to a proper response. And they can actually navigate their lives, their environment, in which it's very evolutionary fits. That way, without needing to learn anything.
And while if our environment was static enough, predicateable enough, what would have happened is that evolution would have found the perfect behavioral program, a hard-coded, static behavioral program. We'd have written it into our genes. We would have a hard-coded brain connectum, and that's what we would be running on. But no, that's not what happened. Instead, we have general intelligence.
So we are born with extremely little knowledge about the world, but we are born with the ability to learn very efficiently and to adapt in the face of things that we've never seen before. And that's what makes us unique. And that's what is really, really challenging to recreate in machines. I want to rabbit hole in that a little bit. But before I do that, maybe I'm going to overlay some examples of what an arc-like challenge looked like for the YouTube audience.
But maybe for people listening on audio, can you just subscribe? What would an example arc-challenge look like? Sure. One arc puzzle, it looks kind of like an IQ test puzzle. You've got a number of demonstration input output pairs. So one pair is made of two grids. So one grid shows you an input. And the second grid shows you what you should produce as a response to that input.
And you get a couple pairs like this to demonstrate the nature of the task, to demonstrate what you're supposed to do with your inputs, and then you get a new test input. And your job is to produce the corresponding test outputs. You look at the demonstration pairs, and from that you figure out what you're supposed to do, and you show that you've understood it on this new test pair.
And importantly, in order to the sort of knowledge basis that you need to approach these changes is you just need core knowledge. And core knowledge is basically the knowledge of what makes an object, basic counting, basic geometry, topology, symmetries, that sort of thing. So extremely basic knowledge, that's for sure possess such knowledge, any child possesses such knowledge. And what's really interesting is that each puzzle is new.
So it's not something that you're going to find as well on the internet, for instance. And that means that whether it's as a human or as machine, every puzzle you have to approach it from scratch, you have to actually reason your way through it. You can just fetch the response from your memory.
So the core knowledge, one contention here is we are only now getting multimodal models who, because of the data that are trained on, are trained to do spatial reasoning, whereas obviously not only humans, but for billions of years of revolution, we've had our ancestors have had to learn how to understand abstract physical and spatial properties and recognize the patterns there.
And so one view would be in the next year as we gain models that are multimodal native that isn't just a sort of second class that is an add-on, but the multimodal capability is a priority, that it will understand these kinds of patterns because that's something which is natively. Whereas right now what arc sees is some JSON string of 1,00, 1,00, and it's supposed to recognize a pattern there.
And even if you showed a human, such as a sequence of these kinds of numbers, it would have a challenge making sense of what kind of question you're asking it. So why wouldn't it be the case that as soon as we get multimodal models which were on the past one lock right now, they're going to be so much better at archetype spatial reasoning? That's an empirical question. So I guess we're going to see the answer within a few months.
But my answer to that is, you know, all grades, just discrete to the grades of symbols, they're pretty small, like it's not like, if you flatten an image as a sequence of pixels, for instance, then you get something that's actually very, very difficult to parse. That's not true for arc because the grades are very small. You only have 10 possible symbols. So there's these two degrees that actually vary to flatten sequences and transformers. LLM's they're very good at processing the sequences.
In fact, you can show that LLM's do fine with processing arc lag data by simply fine-tuning LLM on some subsets of the tasks and then trying to test it on small variations of these tasks. And you see that, yeah, the LLM can encode just fine solution programs for tasks that is seen before. So it does not really have a problem passing the input or figuring out the program. The reason why LLM's don't do well on arc is really just the unfamiliarity aspect.
The fact that each new task is different from every other task. You cannot basically, you cannot memorize the solution programs in advance. You have to synthesize a new solution program on the fly for each new task, and that's really what LLM's are struggling with. So before I do more devil's advocate, I just want to step back and explain why I'm especially interested in having this conversation and obviously the million dollar arc prize.
I'm excited to actually play art with myself and hopefully the Vesuvius challenge, which was not Friedman's prize for solving decoding scrolls. The winner of that decoding the scrolls from that were buried in the volcano in the Herculaneum library that was solved by a 22-year-old who was listening to the podcast Luke Farrator. So hopefully somebody listening will find this challenge intriguing and find a solution.
So I've had on recently a lot of people who are bullish on LLM's, and I've had discussions with them before interviewing you about how to re-explain the fact that LLM's don't seem to be natively performing that well on arc. And I found their explanations somewhat contrived, and I'll try out some of the reasons on you.
But it is actually an intriguing fact that they actually, some of these problems are relatively straightforward to humans to understand, and they do struggle with them if you just input them natively. All of them are very easy for humans. Like any smart human should be able to do 90% and 95% on arc. Smart human. Smart human. But even a five-year-old, so with very little knowledge, they could definitely do over 50%.
Hmm. So let's talk about that because you... I agree that smart humans will do very well on this test. But the average human will probably do mediocre. Not really. So we actually try the ways of right humans describe about 85. That was with Amazon Mechanical Turkwörkers. Right. I honestly don't know the demographic profile of Amazon Mechanical Turkwörkers, but I imagine just interacting with the platform that Amazon has set up to do remote work.
That's not the median human across the planet, I'm guessing. I mean, the broader point here being that... So we see this spectrum in humans where humans obviously have AGI. But even within humans, you see a spectrum where some people are relatively dumber and they'll do perform work on IQ-like tests. For example, Ravens' work-out of matrices. If you look at how the average person performs on that and you look at the kind of questions that is sort of midermis.
Half of the people will get it right, half of people will get it wrong. Some of them are like pretty trivial. For us, we might think this is kind of trivial. And so humans have AGI, but from relatively small tweaks, you can go from somebody who misses these kinds of basic IQ test questions to somebody who gets them all right, which suggests that actually, if these models are doing natively, we'll talk about some of the previous performances that people tried with these models.
But somebody with a jack hole with a 240 million parameter model got 35%. Doesn't that suggest that they're on this spectrum that clearly exists within humans and they're going to be saturated at pretty soon? Yeah, so that's the subject of interesting points here. So there is indeed a branch of LLM approaches, suspended by a jack hole, that are doing quite well, that are, in fact, state of the art. But you have to look at what's going on there. So there are two things.
The first thing is that to get these numbers, you need to pre-train your LLM on millions of generated art tasks. And of course, if you compare that to a five-year-old child looking at art for the first time, the child has never done like you did before. Has never seen something like black and art tasks before. The only overlap between what they know and what they have to do in the test is core knowledge, is knowing about like counting and objects and symmetries and things like that.
And still, they're going to do really well. They're going to do much better than the LLM trained on millions of similar tasks. And the second thing that's something to note about the jack hole approach is one thing that's really critical to making the model work at all is test time fine tuning. And that's something that's really missing, by the way, from LLM approaches. Right now, is that, you know, most of the time when you're using an LLM, it's just doing static inference.
The model is frozen and you're just prompting it and then you're getting an answer. So the model is not actually learning anything on the fly. Its state is not adapting to the task at hand. And what's a jack hole is actually doing is that for every test problem is on the fly is fine tuning a version of the LLM for that task. And that's really what's a locking performance. If you don't do that, you get like 1%, 2%. So basically, something completely, completely negligible.
And if you do test time fine tuning and you add a bunch of tricks on top, then you end up with interesting performance numbers. So I think what it's doing is trying to address one of the key limitations of LLM today, which is the lack of active inference. It's actually adding active inference to LLM and that's working extremely well actually. So that's fascinating to me. There's so many interesting rabbit holes there. Should I take them into sequence or deal with them all once?
Let me just start. So the point you made about the fact that you need to unlock the adaptive compute slash test time compute, a lot of the scale maximalist, I think this will be interesting rabbit hole to explore with you because a lot of the scaling maximalist have your broader perspective in the sense that they think that in addition to scaling, you need these kinds of things like unlocking adaptive compute or doing some sort of RL to get the system to working.
And their perspective is that this is a relatively straightforward thing that will be added at the top, the representations that a skilled model has greater access to. No, it's not just the technical detail. It's not a straightforward thing. It is everything. It is the important part.
And the scale maximalist argument, even if it boils down to, these people, they refer to scaling loss, which is this empirical relationship that you can draw between how much compute you spend on training a model and the performance you're getting on benchmarks. And the key question here, of course, is, well, how do you measure performance? What it is that you're actually improving by adding more compute and more data. And well, it's benchmark performance.
And the thing is, the way you measure performance is not a technical detail. It's not enough to start because it's going to narrow down the sort of questions that you're asking. And so accordingly, it's going to narrow down the sort of answers that you're looking for. If you look at the benchmarks we're using for an LMS, they're all memorization-based benchmarks. Like sometimes they're literally just knowledge-based, like a school test.
And even if you look at the ones that are explicitly about reasoning, you realize if you look closely that in order to solve them, it's enough to memorize a finite state of reasoning patterns. And then you just reapply them. They're like static programs. LMS are very good at memorizing static programs, small static programs. And they've got this sort of like bank of solution programs. And when you give them a new puzzle, they can just fetch the appropriate program, apply it.
And then you can actually look at the models. They are big, parametric curves fitted to the data distribution, which I can't understand. So they're basically these big, interpretative databases, interpretative memories. And of course, if you scale up the size of your data, you can see that the data is like in the sense, so they're basically these big interpretative databases, interpretative memories.
And of course, if you scale up the size of your database and you cram into it more knowledge, more patterns and so on, you are going to be increasing its performance as measured by memorization benchmark. That's kind of obvious. But as you're doing it, you are not increasing the intelligence of the system one bit. You are increasing the skill of the system. You are increasing its usefulness, its scope of applicability, but not its intelligence, because skill is not intelligence.
And that's the fundamental confusion that people run into is that they're confusing skill and intelligence. Yeah, there's a lot of fascinating things to talk about here. So skill intelligence, interpolation. I mean, okay, so the thing about their fittings are manifold into that maps the input data. There's a reductionist way to talk about what happens in the human brain that says that it just acts on its following at each other.
But we don't care about the reductionist explanation of what's happening. We care about what the sort of meta at the macroscopic level, what happens when these things combine. As far as the interpolation goes, so, okay, let's look at one of the benchmarks here. There's one benchmark that does great school math. And these are problems that like a smart high schooler would be able to solve. It's called GSM 8K. And these models get 95% on these.
Like basically, they always need to be in the middle position. Okay, let's talk about what that means. So here's one question about from that benchmark. So 30 students are in a class. One fifth of them are 12 year olds. One third are 13 year old. One tenth are 11 year olds. How many of them are not 11, 12 or 13 years old? So I agree it, like this is not rocket science, right?
You can write down on paper how you go through this problem and a high school kid, at least a smart high school kid should be able to solve it. Now, when you say memorization, it still has to reason through how to think about fractions and what is the context of the whole problem and then combining the different calculations that's doing. It depends on how you want to define a zoning, but there are two definitions you can use. So one is, I have available a set of program templates.
It's like the structure of the puzzle, which can also generate its solution. And I'm just going to identify the right template, which is in my memory. I'm going to input the new values into the template, run the program, get the solution. And you could say this is reasoning. And I say, yeah, sure, okay.
But another definition you can use is reasoning is the ability to, when you're faced with a puzzle, given that you don't have already a program in memory to solve it, you must synthesize on the fly a new program based on bits of pieces of existing programs that you have. You have to do on the fly program synthesis. And it's actually dramatically harder than just fetching the right memorized program and replying it.
So I think maybe we are overestimating the extent to which humans are so simple, efficient they also don't need training in this way where they have to drill in these kinds of pathways of reasoning through certain kinds of problems. So let's take math, for example. Yeah. It's not like you can just show a baby the axioms of set theory and now they know math, right? So when they're growing up, you have to do years of teaching them pre algebra.
Then you got to do a year of teaching them doing drills and going through the same kind of problem in algebra, then geometry, precalculous calculus. Absolutely. So it's training? Yeah. Isn't that like the same kind of thing where you can't just see one example and now you have the program or whatever. You actually had to drill it. And then you also had to drill with a bunch of returning data.
Sure. I mean, in order to do on the fly program synthesis, you actually need building blocks to work from. So knowledge and memory actually tremendously important in the process. I'm not saying it's memory versus reasoning in order to do effective reasoning. You need memory. But it sounds like it's compatible with your story that through seeing a lot of different kinds of examples, these things can learn to reason within the context of those examples.
And we can also see within bigger and bigger models. So that was an example of a high school level math problem. Let's say a model that's like smaller than GPT-3 couldn't do that at all. As these models get bigger, they seem to be able to pick up bigger and bigger. It's not training a size issue. It's more like a train data issue in this case.
Well, bigger models can pick up these kinds of circuits, which smaller models, apparently don't do a good job of doing this even if you were to train them on this kind of data. Doesn't that just suggest that if you have bigger and bigger models, they can pick up bigger and bigger pathways or more general ways of reasoning? Absolutely. But then isn't that intelligence? No, no, it's not.
If you scale up your database and you keep adding to it more knowledge, more program templates, then sure it becomes more and more skillful. You can apply it to more and more tasks. But general intelligence is not task-phased skill scaled up to many skills. Because there is an infinite space of possible skills, general intelligence is the ability to approach any problem, any skill, and then quickly master it using very little data.
Because this is what makes you able to face anything you might have ever encountered. This is the definition of a generality. Like generality is not specificity scaled up. It is the ability to apply your mind to anything or to arbitrary things. And this requires, funnily, requires the ability to adapt, to learn on the fly efficiently. So my claim is that by doing this free training on bigger and bigger models, you are gaining that capacity to then generalize very efficiently.
Let me give you an example. Let me give you an example. So your own company, Google, in their paper on Gemini 1.5, they had this very interesting example where they would give in context, they would give the model the grammar book and the dictionary of a language that has less than 200 living speakers. So it's not in the free training data.
And you just give them the dictionary and it basically is able to speak this language and translate to it, including the complex and organic ways in which language is structured. So a human, if you showed me a dictionary from English to Spanish, I'm not going to be able to pick up the how to structure sentences and how to say things in Spanish. The fact that because of the representations that it has gained through this free training, it is able to now extremely efficiently learn a new language.
Doesn't that show that this kind of training actually does increase your ability to learn new tasks? If you're right, if you were right, LLMs would do really well on art puzzles because art puzzles are not complex. Each one of them requires very total knowledge. Each one of them is very low on complex. You don't need to think they're hard about it. They're actually extremely obvious for humans, like even children can do them. But LLMs cannot.
Even LLMs that are 100,000 times more knowledge than you do, they still cannot. The only thing that makes art special is that it was designed with this intent to resist the emoization. This is the only thing. This is the huge blocker for a length of performance. If you look at LLMs closely, it's pretty obvious that they're not really synthesizing new programs on the fly to solve the task that they're faced with. They're very much triplying things that they've stored in memory.
For instance, one thing that's very striking is that LLMs can solve Cesar Cypher. Like Cesar Cypher, transposing letters to code a message. That's a very complex algorithm. It comes up quite a bit on the internet. They basically memorized it. There are three interesting things that they can do it for a transposition length of like three or five, because they're very, very common numbers in examples for right on the internet.
But if you try to do it with an arbitrary number, like nine, it's going to fail. Because it does not encode the generalized form of the algorithm, but only specific cases. It does memorize specific cases of the algorithm. If it could actually synthesize on the fly the solver algorithm, then the value of N would not matter at all, because it does not increase the probability. I think this is true of humans as well. What was the study that you've been using?
Humans use memorization and pattern matching all the time, of course, but humans are not limited to memorization and pattern matching. They have this very unique ability to adapt to new situations on the fly. This is exactly what enables you to navigate every new day in your life. I'm forgetting the details, but there was some study that chess grandmasters will perform very well within the context of the moves that...
Excellent example, because chess at the highest level is all about memorization. Chess memorization. Sure. We can leave that aside. What is your explanation for the original question of why in context, the GPT-1, sorry, 1.5 was able to learn a language, including the complex grammar structure. Doesn't that show that they can pick up new knowledge? I would assume that it has simply mined from its extremely extensive, and imaginably vast training data.
It has mined the required template and then it's just reusing it. We know that there are a very poor ability to synthesize new programs, like this on the fly, or even adapt existing ones. They're very much limited to fetching. Suppose there's a programmer at Google, they go into the office in the morning. At what point are they doing something that 100% cannot be due to fetching the template that... Even if they were an LLM, they could not do if they had fetched some template from the program.
At what point do they have to use this so-called external generalization capability? Forget about Google software developers. Every human, every day of their lives, is full of novel things that have not been prepared for. You cannot navigate your life based on memorization alone. It's possible. I'm denying the premise that you are also great. They're not doing, quote-unquote, memorization.
It seems like you're saying they're less capable of generalization, but I'm just curious of the kind of generalization they do. If you're getting into the office and you try to do this kind of generalization, you're going to fail at your job. What is the first point? You're a programmer. What is the first point when you try to do that generalization? You would lose your job because you can't do the extreme generalization.
I don't have any specific examples, but literally, like, take this situation, for instance. You've never been here in this room. Maybe you've been in this city a few times, I don't know. But there's a firm on the novel team. You've never been interviewing me. There's a firm on the novel team, every hour of every day in your life. It's in fact, by and large, more novelty than any LLM could handle.
Like if you just put a LLM in a robot, it could not be doing all the things that you've been doing today. Right? Or take, I don't like sell driving cars, for instance. You take a sell driving car operating in the barrier. Do you think you could just drop it in New York City or drop it in London, where people drive on the left? No, it's going to fail. So not only can you drop, not like make it generalize to a change of rules of driving rules, but you can not even make it generalize to a new city.
It needs to be trained on each specific environment. I mean, I agree that self-driving cars aren't AGI. But it is the same type of model they are transformers as well. I mean, it's also have brains with neurons in them, but they're less intelligent because they're small. They're small. They're not the same object. We can get into that. But so I still don't understand like a concrete thing of we also need training. That's why education exists.
So we had to spend the first 18 years of our life doing drills. We have a memory between all not a memory. We are not limited to just the memory. I'm not uniform. That's necessarily the only thing these models are doing. And I'm still not sure what is the task that a remote worker would be to have to, like, suppose you just have that remote work with an LLM and their programmer. What is the first point that I wish you realized? This is not a human. This is an LLM.
What about I just send them a knock puzzle and see how they do? They're like part of their job, you know? But you have to deal with novelty all the time. Okay. So is there a world in which all the programmers are replaced and then we're still saying, but they're only doing memorization, latent programming tasks, but they're still producing a trillion dollars of worth of output in the form of code.
Software development is actually a very good example of a job where you're dealing with novelty all the time. Or if you're not, well, I'm not sure what you're doing. So I personally use genetic data of the LLITOL in my software development job. And before, before I left, I think I was also using Stack Overflow, the LLITOL. You know, some people maybe are just copy-pasting stuff from Stack Overflow on our disk, copy-pasting stuff from from an LLM. Personally, I try to focus on problem solving.
The syntax is just a technical detail. It was really important is the problem solving, like the essence of programming is engineering mental models, like mental representations of the problem you're trying to solve. But you can, you know, we have many people can interact with these systems themselves and you can go to chat GPT and say, here's the specification of the kind of program I want. They'll build it for you.
As long as there are many examples of this program on LLITOL and Stack Overflow and so on, sure. They will fetch the program for you from their memory. But you can change arbitrary details. No, it doesn't work. I need it to work on this different kind of server. If that's where it's true, there would be no server to engineers to that. I agree we're not at a full age AI yet in the sense that these models have, let's say, less than a trillion parameters.
A human brain has somewhere on the order of 10 to 30 trillion synapses. I mean, if you were just doing some naive math, you're at least 10x under parameterized. So I agree we're not there yet. But I'm sort of confused on why we're not on the spectrum where, yes, I agree that there's some many kinds of generalization they can't do. But it seems like there are on this kind of smooth spectrum that we see even within humans where some humans would have a hard time doing an arc type test.
We see that based on the performance on progressive Ravens matrices type IQ tests. I'm not a fan of IQ test because for the most parts, you can train all IQ tests and get better at them. So they have very much memorization based. And this is actually the main pitfall that arc tries not to fall. I'm so lucky. But if all remote jobs are automated in the next five years, let's say, at least that don't require you to be like sort of a service.
It's not like a sales person where you need you want the human to be talking, but like for going whatever in that world, would you say that that's not possible because a lot of what a programmer needs to do definitely requires things that would not be in any free training core. But it's true. I mean, in five years, there would be more software engineers than the art today. That's true. But I just want to understand. So I'm not sure, I mean, I know how to start a computer science.
I think if I had become a code monkey at a college, like, what would I be doing? I go to my job. What is the first thing my boss tells me something to do? When does he realize I'm an LLM? If I was an LLM? Probably on the first day, you know? Again, if it's true that at a LM, it's good to generalize to novel problems like this and you can actually develop software to solve a problem they've never seen before. You would not need software engineers anymore.
In practice, if I look at how people are using LLM in their software engineering job today, they are using it as a stack of a flow replacement. So they are using it as a way to copy paste code snippets to perform very common actions. And this is what they actually need is a database of code snippets. They don't actually need any of the abilities that actually make them software engineers.
When we talk about interpolating between stack overflow databases, if you look at the kinds of math problems or coding problems, maybe to say that they're, maybe let's step back on interpolation and let me ask the question this way. Why can't creativity, why isn't creativity just interpolation in a higher dimension, where if a bigger model can learn a more complex manifold, if we're going to use the ML language, and if you look at read a biography of a scientist, right?
They're not zero-shodding new scientific theories. They're playing with existing ideas. They're trying to juxtapose them in their head. They try out slightly ever in the tree of intellectual descendants. They try out a different evolutionary path. You sort of run the experiment there in terms of publishing the paper, whatever. It seems like a similar kind of thing humans are doing. There's like at a higher level of generalization.
And what you see across bigger and bigger models is they can, they seem to be approaching higher and higher level generalization. So GBT2 couldn't do a great school level math problem that requires more generalization that it has capability for, even that skill, than GBT3 and 4 can. So not quite. So in GBT4 has a higher degree of skill and higher range of skills, because it's the same as the mathematics here, but I'm decryptionalization.
I don't want to get into mathematics here, but the question of why can't creativity be just interpolation on a higher dimension? I think that the definition can be creative, absolutely. And to your point, I do think that on some level humans also do a lot of memorization, a lot of reciting, a lot of pattern matching, a lot of interpolation as well. So it's very much a spectrum between pattern matching and true reasoning, it's a spectrum. And humans are never really at one end of the spectrum.
They are never doing pure pattern matching or pure reasoning. They are usually doing some mixture of both. Even if you're doing something that's in very reasoning-heavy, like proving a mathematical theorem, as you're doing it, sure, you're doing quite a bit of discreet search in your mind, quite a bit of actual reasoning. But you're also very much guided by intuition, guided pattern matching, guided by the shape of proofs that you've seen before, by your knowledge of mathematics.
So it's never really, you know, all of our thoughts, everything we do is a mixture of this sort of like, interpretive memorization based thinking, this sort of like type one thinking and type two thinking. Why are bigger models more sample efficient? Because they have more reusable building blocks that they can lean on to pick up new patterns in their train data. And does that pattern keep continuing as you keep getting bigger and bigger?
To the extent that the new patterns, you're giving the model to learn, all of the good match from what it has learned before. If you present something that's actually novel, that is not in a state of distribution, like an arch-tozzol-fun, since it will fail. Let me make this claim. The programs and this is, I think, is a very, very useful intuition pump.
Why can't it be the case that what's happening in the transformer is the early layers are doing the figuring out how to represent the inputting tokens? And what the middle layers do is this kind of program search, programs and this is where they combine the inputs to all the circuits in the model where they go from the low level representation to a higher level representation during the middle model. They use these programs and they combine these concepts.
Then what comes out the other end is the reasoning based on that high level intelligence. Possibly, why not? But, you know, if these models were actually capable of synthesizing novel programs, however simple, they should be able to do ARC. Because for any ARC task, if you write down the solution program in Python, it's not a complex program. It's extremely simple. And humans can figure that. So why can't an LLMS not do it? Okay, I think that's a fair point.
And if I turn the question around to you, so suppose that it's the case that in a year, a multimodal model can solve ARC. Let's say, 80% whatever the average human will get. Then AGI? Quite possibly, yes. I think if you start. Honestly, what I would like to see is a LLM type model solving ARC at like 80%. But after having only been trained on core knowledge related stuff. But human kids, I don't think we're necessarily just rated none. It's not just that we have an RG. So what's okay with it?
Let me refresh that. Yeah, only trained on information that is not explicitly trying to anticipate what's going to be in the ARC test set. But it isn't the whole point of arc that you can sort of, it's a new type of intelligence set. Yes, that is the point. So if ARC were perfect, flawless benchmark, it would be impossible to anticipate what's in the test set. And ARC was released more than four years ago. And so far, it's been resistant to memorization.
So I think it has to some extent passed a test of time. But I don't think it's perfect. I think if you try to make by hand hundreds of thousands of arc tasks, and then you try to multiply them by programmatically generating variations, and then you end up with maybe hundreds of millions of tasks. Just by brute forcing the task space, there will be enough overlap between what you're trained on and what's in the test set that you can actually score very highly.
So, you know, with enough scale, you can always cheat. If you can do this for every single thing that supposedly requires intelligence, then what good is intelligence? Apparently, you can just brute force intelligence. If the world, if your life, where a static distribution, then sure, you could just brute force the space of possible behaviors. You could like, you know, the way I would think about intelligence, there are several metaphors, selectives.
But one of them is, I can think of intelligence as a past-finding algorithm in future situation space. I don't know if your family has game development, like RT has game development, but you have a map. And you have a 2D map. You have partial information about it. There is some FOG-4 on your map. There are areas that you haven't explored yet. You know nothing about them. And then there are areas that you've explored, but you only know how they were in the past.
You don't know how they are like today. And now instead of thinking about a 2D map, think about the space of possible future situations that you might encounter and how they're connected to each other. Intelligence is a past-finding algorithm. So once you set a goal, it will tell you how to get there optimally. But of course, it's constrained by the information you have. It cannot pass fine in an area that you know nothing about. It cannot also anticipate changes.
And the thing is, if you had complete information about the map, then you could solve the past-finding problem by simply memorizing every possible path, every mapping from point A to point B. You could solve the problem with pure memory. But the reason you cannot do that in real life is because you don't actually know what's going to happen in the future. Life is ever changing. I feel like you're using words of memorization which we would never use for human children.
If you're a kid who learns to do algebra and now learns to do calculus, you wouldn't say they've memorized calculus. If they can just solve any arbitrary algebraic problem, you wouldn't say they've memorized algebra. They say they've learned algebra. Humans are never doing pure memorization up-sure reasoning. But that's only because you're so man-tably labeling when the human does the skill. It's a memorization.
When the exact same skill is done by the LLM, as you can measure by these benchmarks. And you can just plug in any sort of map problem. Sometimes humans are doing the exact same as the LLM is doing, which is just, for instance, if you learn to add numbers, you're memorizing an algorithm. You're memorizing a program and then you can reapply it. You are not synthesizing on the fly the addition program. So obviously at some point, some human have to figure out how to do addition.
But the way a kid learns it is not that they sort of figure out from the actions of that theory how to do addition. I think what you're learning is mostly memorization. Right. Yeah. So my claim is that, listen, these models are vastly under parameterized. It's relative to how many flops or how many parameters you have in the human brain. And so, yeah, they're not going to be like coming up with new theorems like the smartest humans can. But most humans can't do that either.
What most humans do, it sounds like a similar to what you are calling memorization, which is memorizing skills or memorizing techniques that you've learned. And so it sounds like it's compatible. Let's tell me if this is wrong. Is it compatible in your world if like all the remote workers are gone, but they're doing skills which we can potentially mix synthetic data off.
So we record everybody's screen and every single remote worker screen, we sort of understand the skills they're performing there. And now we've trained a model that can do all this. All the remote workers are unemployed, we're generating trillions of dollars of economic activity for me, I, remote workers. In that world, are we still in the memorization regime? So sure, with memorization you can automate almost anything.
As long as it's a static distribution, as long as you don't have to deal with change. Are most jobs part of such a static distribution? Potentially there are lots of things that you can automate. And LLMs are an excellent tool for automation. And I think that's, but you have to understand that automation is not the same as intelligence. I'm not saying that LLMs are useless. I've been a huge proponent of deep learning for many years. And for many years I've been saying two things.
I've been saying that if you keep scaling up deep learning, it will keep paying off. And at the same time I've been saying, if you keep scaling up deep learning, this will not lead to a GI. So we can automate more and more things. And yes, this is economically valuable. And yes, potentially there are many jobs. You could automate a well-acted, and that would be economically valuable. But you're not still not going to have intelligence.
So you can ask, you know, okay, so what does it matter if we can generate all this economic value? Maybe you don't need intelligence after all. Well, you need intelligence the moment you have to deal with change, with novelty, with uncertainty. As long as you're in a space that can be exactly described in advance, you can just, you can just, to make your pure memorization, right? In fact, you can always solve any problem.
You can always display arbitrary levels of skills on any task without leveraging any intelligence whatsoever. As long as it is possible to describe the problem in its solution very, very precisely. But when they do deal with novelty, then you just call it interpolation, right? And so- No, no, no. Interpolation is not enough to deal with all kinds of novelty. If it were, then LLMs would be a GI. Well, I agree they're not a GI.
I'm just trying to figure out how do we figure out we're on the path of GI. And I think sort of a correcture is maybe that it seems to me that these things are on a spectrum and we're clearly covering the earliest part of the spectrum with LLMs. I think so. And, oh, okay, interesting. But here's another sort of thing that I think is evidence for this. Grocking, right?
So clearly even within deep learning, there's a difference between the memorization regime and the generalization regime, where at first, they'll just memorize the data set of, you know, if you're doing modular edition, how to add digits. And then, at some point, if you keep training on that, they'll learn the skill. So the fact that there is a distinction suggests that the generalized circuit, the deep learning can learn, there's a regime in enters where it generalizes.
If you have an over-parameterized model, which you don't have in comparison to all the tasks we want these models to do right now. Grocking is very old phenomenon. We've been observing it for decades. It's basically an instance of the minimum description length principle, where, sure, you can, given a problem, you can just memorize an input, a point-wise input to output mapping, which is completely over-fit. So it does not generalize at all, but it solves the problem on the train data.
And from there, you can actually keep proving it, keep making your mapping simpler and simpler and more compressed. And at some point, it will start generalizing. And so that's something called the minimum description length principle. It's decided that the program that will generalize best is the shortest. Right. And it doesn't mean that you're doing anything other than memorization, but you're doing memorization plus regularization. Right. AKA generalization.
Yeah. And that is absolutely, at least to generalization. Right. And then so you do that within one skill, but then the pattern you see here of metal learning is that it's more efficient to store a program that can perform many skills rather than one skill, which is what we might call fluid intelligence. And so as you get bigger and bigger in models, you would expect it to go up this hierarchy of generalization where it generalizes to a skill, then it generalizes across multiple skills.
That's correct. That's correct. And you know, at a LEMS, they are not infinitely large. They have only a fixed number of parameters. And so they have to compress their knowledge as much as possible. And in practice, so at a LEMS are mostly storing reusable bits of programs, like vector programs.
And because they have this need for compression, it means that every time they're learning a new program, they're going to try to express it in terms of existing bits and pieces of programs that they've already learned before. And right. Isn't this the generalization? Absolutely. Oh, wait. So this is what, you know, clearly, I have some degree of generalization. And this is precisely why? It's because they have to compress. And why is that intrinsically limited?
Why can't you just go at some point, it has to learn a higher level of generalization, a higher level, and then the highest level is the fluid intelligence. It's intrinsically limited because the substrate of your model is a big parametric curve. And all you can do with this is local generalization. If you want to go beyond this to us, broader or an extreme generalization, you have to move to a different type of model. And my paradigm of choice is discrete program search, program synthesis.
So, and if you want to understand that, you can compare it, contrast it with deep learning. So in deep learning, your model is a parametric, a differentiable parametric curve. In programs synthesis, your model is a discrete graph of operators. So you've got like a set of logical operators, like a domain specific language. You're picking instances of it, you're structuring that into a graph, that's a program.
And that's actually very similar to like a program you might write in Python or C++ and so on. And in deep learning, you're learning engine, because we are doing mesh learning here. Like we're trying to automatically learn these models. In deep learning, you're learning engine is quite an descent, right? And quite an descent is very computation because you have this very strong, informative feedback signal about where the solution is.
So you can get to the solution very quickly, but it is very data inefficient, meaning that in order to make it work, you need a dense sampling of the operating space. You need a dense sampling of the data distribution. And then you're limited to a new generalizing within that data distribution. And the reason why you have this limitation is because your model is a curve. And meanwhile, if you look at discrete program search, the learning engine is combinatorial search.
You're just trying a bunch of programs until you find one that actually meets your spec. This process is extremely data efficient. You can learn a generalizable program from just one example, two examples, which is why it works so well on arc, by the way. But the big limitation is that it's extremely computing efficient because you're running into a combinatorial explosion, of course.
And so you can sort of see here how deep learning and discrete program search, they have very complementary strengths and limitations, as well. Like every limitation of deep learning has a corresponding strength in program synthesis and university. And I think the past four one is going to be too merged to basically start doing.
So another way you can think about it is, so this parametric curve trend was growing in the sense, there are great fits for everything that's system one type thinking, like pattern cognition, intuition, memorization, and so on. And discrete program search is a great fit for type two, thinking system two, thinking. For instance, planning, reasoning, quickly figuring out a generalizable model, let matches just one or two examples, like for an archbuzz or for instance.
And I think humans are never doing pure system one or pure system two. They're always mixing and matching both. And right now we have all the tools for system one. We have almost nothing for system two. The way for one is to create a hybrid system. And I think the form it's going to take is it's going to be mostly system two. So the outer structure is going to be a discrete program search system.
But you're going to fix the fundamental limitation of discrete program search, which is commentary explosion. You're going to fix it with deep learning. You're going to leverage deep learning to guide, to provide intuition in program space, to guide the program search. And I think that's very similar to what you see, for instance, when you're playing chess or when you're trying to prove a theorem, is that it's mostly a reasoning thing.
But you start out with some intuition about the shape of the solution. And that's very much something you can get via a deep learning model. Deep learning models, they're very much like intuition machines. They're pattern matching machines. So you start from this shape of the solution. And then you're going to do actual explicit discrete program search. But you're not going to do it via brute force. You're not going to try things kind of like chronically.
You're actually going to ask another deep learning model for suggestions. Like here's the best likely next step. Here's where in the graph, you should be going. And you can also use yet another deep learning model for feedback. But here's what I have so far. Is it looking good? Should there just backtrack and try something new? So I think this quick program search is going to be the key. But you want to make it dramatically better. Or does it make it more efficient by leveraging deep learning?
And by the way, another thing that you can use deep learning for is, of course, things like command sense knowledge and knowledge in general. And I think you're going to enter with this sort of system where you have this on the fly synthesis engine that can adapt to new situations. But the way it adapts is that it's going to fetch from a bank of patterns modules that could be themselves curves that could be a differentiable modules and some others that could be algorithmic in nature.
It's going to assemble them via this process that's in tuition guided. And it's going to give you for every new situation you might be faced with. It's going to give you with a generalizable model that was synthesized using very, very little data. Something like this with sort of arc.
That's actually a really interesting prompt because I think an interesting crux here is when I talk to my friends who are extremely optimistic about LLMs and expect AGI within the next couple of years, they also, in some sense, agree that scaling is not all you need, but that the rest of the progress is undergirded and enabled by scaling. And but still you need to add the system to the test time compute, top these models.
And their perspective is that it's relatively straightforward to do that because you have this library or representations that you built up from free training. But it's almost talking like, it's just like skimming through textbooks, you need some more deliberate way in which it engages with the material it learns.
In context learning is extremely sample-efficient, but to actually distill that into the weights, you need the model to talk through the things that sees and then add it back to the weights. As far as the system too goes, they talk about adding some kind of RL setup so that it is encouraged to proceed on the reasoning traces that end up being correct. And they think this is relatively straightforward stuff that we added within the next couple of years. That's an empirical question.
So I think we'll see. Your intuition I assume is not that I'm curious. My intuition is, in fact, this whole system to architecture is the hard part. It's the very hard and not obvious part. Scaling up the interpretive memory is the easy part. All you need is, it's literally just a big curve. All you need is more data. It's representation of a dataset, interpretive representation of a dataset. That's the easy part. The hard part is the architecture of intelligence.
Memory and intelligence are separate components. We have the memory. We don't have the intelligence yet. And I agree with you that, well, having the memory is actually very useful. And if you just had the intelligence, but it was not hooked up to an extensive memory, it would not be that useful because it would not have enough material to work from.
Yeah. The alternative hypothesis here that a former guest Trenton Brookin advances that intelligence is just hierarchically associated with memory where higher level patterns when Sherlock Holmes goes into a crime scene. And he's extremely sample efficient. He can just look at a few clues and figure out who was a murderer. And the way he's able to do that is he has learned higher level sort of associations. It's memory in some fundamental sense. But so here's one way to ask a question.
In the brain, I suppose the Louis Vu program synthesis, but it is just synapses connected to each other. And so physically, it's got to be that you just query the right circuit, right? You are. Yeah. It's a matter of degree.
But if you can learn it, if training in the environment that the human ancestors are trained in means you learn those circuits, training on the same kinds of outputs of humans produce, which to replicate, require these kinds of circuits, wouldn't that train the same kind of whatever humans have? You know, it's a matter of degree. If you have a system that has a memory and is only capable of doing local generalization from that, it's not going to be very adaptable.
To be really general, you need the memory plus the ability to search to quite some depth, to achieve broader, even extreme generalization. And you know, like one of my favorites, psychologists, so Jean-Pierre was the founder of the Elemental Psychology. He had a very good quote about intelligence. He said, intelligence is what you use when you don't know what to do. And it's like, as a human living your life, in most situations, you already know what to do.
Because you've been in this situation before. You already have the answer, right? And you're only going to need to use intelligence when you're faced with novelty, with something you didn't expect, with something that you weren't prepared for, either by your own experience, your own life experience, or by your evolutionary history.
This day that you're living right now is different in some important ways, from every day you've lived before, but it's also different from any day ever lived by any of your ancestors. And still, you're capable of being functional, right? How is it possible? I'm not denying that generalization is extremely, and is the basis for intelligence. That's not the correct, the correct, so it's like how much of that is happening in the models. But let me ask a separate question.
We might keep going in the circle here. The difference is an intelligence between humans. Maybe the intelligence tests, because of reasons you mentioned are not measuring it well, but clearly, there's differences in intelligence between different humans. Sure. What is your explanation for what's going on there? Because I think that's sort of compatible with my story that there's a spectrum of generality and that these models are climbing up at two human level.
And even some humans haven't even climbed up to the Einstein level, or the Francois level. But that's a great question. You know, there is extensive evidence that difference intelligence are mostly genetic in nature, right? Meaning that if you take someone who is not the intelligence, there is no amount of training of like training data. You can expose that person to that would make them become Einstein. And this kind of points to the fact that you really need a better architecture.
You need a better algorithm. And more training data is not in fact all you need. I think I agree with that. I think maybe the way I might phrase it is that the people who are smarter have an ML language better initializations. It just, the neurowiring, if you just look at, it's more efficient. They have maybe greater density of firing. And so as some part of the story is scaling, there are some correlations between brain size and intelligence.
And we also see within the context of quote-unquote scaling that people talk about within context of LLMS, architectural improvements, where a model like Gemini 1.5 Flash is performs as well as GPT-4 did when GPT-4 was released a year ago, but is 57 times cheaper on output. So part of the scaling stories of the architectural improvements are we're in like extremely low-hanging fruit territory when it comes to those. OK, we're back.
Now with the co-founder of Zapier, Mike Knough, we had to restart a few times there. And you're funding this prize and you're running this prize with Francois. And so tell me about how this came together. What problem did you guys to launch this prize? Yeah. I guess I've been sort of like AI curious for 13 years. I've been, I co-founded Zap, I've been running it for the last 13 years. And I think I first got introduced to your work in Durant COVID. I kind of went down the rabbit hole.
I had a lot of free time. And it was right after you published your on-measure of intelligent paper, you sort of introduced the concept of AI. This like efficiency of skill acquisition is like the right definition and the arc puzzles. But I don't think the first Kaggle contest was done yet. I think it was still running. And so I kind of it was interesting, but I just parked the idea.
And my bigger fish to fry is Zapier, where in this middle of this big turnaround of trying to get to our second product. And then it was January 2022 when the chain of thought paper came out that really awoken me to sort of the progress. I gave a whole presentation to the Zapier on like the GPT-3 paper events. I sort of felt like I had priced in everything that Elms could do. And that paper was really shocking to me in terms of others.
There's latent capabilities that Elms have that I didn't expect that they had. And so I actually gave up my exact team role. It's after I was running half the company in that point. I went back to be an individual contributor and just to go to AI research alongside Brian, my co-founder. And ultimately that led me to back towards ARC. I was looking into it again. And I had sort of expected to see this saturation effect that MMOU has, that GMSK has, 8K has.
And when I looked at the scores and the progress that since the last four years, I was really, again, shocked to see actually we've made very little objective progress towards it. And it felt very, it felt like a really, really important change of Eval. And as I sort of spent the last year asking people, quizzing people about it and sort of my networking community, very people even knew it existed.
And that felt like, okay, if it's right that this is a really, really globally, singularly unique AGI Eval. And it's different from every other Eval that exists that are more narrowly measures, AI skill, like more people should know about this thing. I had my own ideas on how to beat the ARC as well. So I was working on it some weekends on that. And I flew up to meet Francois earlier this year to sort of quiz him, show him my ideas.
And ultimately, I was like, well, why don't you think more people know about ARC? I think you should actually answer that. I think it's a really interesting question. Like why don't you think more people know about ARC? Sure. You know, I think benchmarks that gain traction in the research community, are benchmarks that are already fairly tractable. Because the dynamic that you see is that some research group is gonna make some initial breakthrough.
And then this is gonna catch the attention of everyone else. And so you're gonna get followers with people trying to beat the first team and so on. And for ARC, this has not really happened because ARC is actually very hard for existing AI techniques. Kinda ARC requires you to try new ideas. And that's very much the point, by the way. Like the point is not that, yeah, you should just be able to apply existing technology and solve ARC, the point is that existing technology has reached a plateau.
And if you want to go beyond that, if you want to start being able to tackle problems that you haven't memorized, that you haven't seen before, you need to try new ideas. And ARC is not just meant to be this sort of like measure of how close we are to a GI. It's also meant to be a source of inspiration. Like I want researchers to look at these puzzles and be like, hey, it's really strange that these puzzles are so simple.
And most humans can just do them very quickly, why is it so hard for existing AI systems? Why is it so hard for all the lamps and so on? And this is true for all the lamps, but ARC was actually released before the lamps were really a thing. And the only thing that made it special at the time was that it was designed to be a resistance to memorization.
And the fact that it has survived all the lamps and genuine and general, so well, kind of shows that yes, it is actually a resistance to memorization. This is what nerds night me, because I went and took a bunch of the puzzles myself. I've showed it to all my friends and family too. And they're all like, oh yeah, this is super easy. Are you sure AI can't solve this? That's the reaction in the same one for me as well.
And the more you dig in, you're like, okay, yeah, there's not just empirical evidence over the last four years that it's unbeaten, but there's theoretical concepts behind why. And I completely agree at this point that like new ideas basically are needed to be dark. And there's a lot of current trends in the world that are actually I think working against that happening basically, I think we're actually less likely to generate new ideas right now.
I think one of the kind of trends is the closing up front to your research, right? The GP4 paper from opening, I had no technical detail shared. The Gemini paper had no technical detail shared and let the longer context part of that work. And yet that open innovation, that open progress and sharing is what got us to transformers in the first place. That's what got us to elements in the first place.
So it's kind of disappointing a little bit actually that like so much frontier work has gone closed. It's really making a bet that like these individual labs are going to be the break through and not the ecosystem is going to have the break through and the sort of the internet open source has shown that that's like the most powerful innovation ecosystem that's ever existed probably in the entire world. I think that's actually really sad that frontier research is no longer being published.
If you look back four years ago, well, everything was just open and shared like all the set of the art results were published. And it's not like the case. And it's very much open AI, single-handedly, changed the game. And I think opening up basically set back progress towards HDI by quite a few years, probably like five to 10 years for two reasons. And one is that the cause is a complete closing down of frontier research publishing. But also the trigger this initial burst of hype around LLMs.
And now LLMs have sucked the oxygen out of the room like everything everyone is just doing at LLMs. And I see LLMs as more often off-ramp on the path to a GR actually. And all these new resources, they're actually going to LLMs instead of everything as they could be going to. And if you look further into the past to like 2015, 2016, there were like a thousand times fewer people doing AI back then. And yet I feel like the rate of progress was higher because people were exploring more directions.
The world felt more open-ended, like you could just go and try like have a cool idea of a launch and try it and get some interesting results. So there was this energy. And now everyone is very much doing some variation of the same thing. And the big labs also tried their handles on the arc, but because they got bad results, they didn't publish anything. Like people only publish positive results.
I wonder how much effort people have put into trying to prompt or scaffold, do some sort of maybe dev and type approach into getting the frontier models. And frontier models of today, not just a year ago, because a lot of post-training has gone into making them better. So cloud-through-opest or GP-D40 into getting good solutions on arc. I hope that one of the things this episode does is get people to try out this open competition where they have to put in an open source model to compete.
But also to figure out if they're maybe the late capability is latent in cloud-opest and just see if you can show that. I think that would be super interesting. So let's talk about the prize. How much do you win if you solve it? Well, you know, get whatever percent on arc. How much do you get if you get the best submission but don't crack it? So we got a million dollar, actually, a little over a million dollars of the prize pool running the contest on an annual basis.
We're starting it today through the middle of November. And the goal is to get 85%. That's the lower bound and human average that you guys talked about earlier. And there's a $500,000 prize for the first team that can get to the 85% benchmark. We're also gonna run, we don't expect that to happen to share actually. One of the early stastations that's happier, gave me this line that has always stuck with me, the longer it takes, the longer it takes.
So my prior is that arc is gonna take years to solve. And so we're gonna keep, we're also gonna break down and do a progress price this year. So there's a $100,000 prize, which we will pay out to the top scores. So $50,000 is gonna go to the top objective scores this year on the Kaggle leaderboard, which we're hosting it on Kaggle. And then we're gonna have a $50,000 pot set for a paper award for the best paper that explains conceptually the scores that they were able to achieve.
And one of the, I think, interesting things we're also gonna be doing is, we're gonna be requiring that in order to win the prize money that you put the solution, or your paper out into public domain. The reason for this is, typically with contests, you see a lot of closed up sharing people are kind of private secret. They wanna hold their outfit of themselves during the contest period. And because we expect it's gonna be multiple years, we wanna enter a game here.
So the plan is, at the end of November, you will award the $100,000 prize money to the top progress prize, and then use the downtime between December, January, February to share out all the knowledge from the top scores and the approaches folks were taking in order to rebase line the community up to whatever the state of the art is, and then run the contest again next year. And keep doing that on a yearly basis until we get 85%.
I'll give some people some context on why I think this prize is very interesting. I was having conversations with my friends who are very much believers and models as they exist today. And first of all, it was intriguing to me that they didn't know about ARC. These are experienced ML researchers. And so you show them, this happened a couple of nights ago. We went to dinner and I showed them an example problem. And they said, of course, an LLM would be able to solve something like this.
And then we take a screenshot of it, we just put it into our chat GPT app. And it doesn't get the pattern. And so I think it's very interesting. Like it is a notable fact I was sort of playing that was advocate against you on these kinds of questions. This is a very intriguing fact. And I think this is as price is extremely interesting because we're going to learn, we're going to learn something fascinating one way or another.
So with regards to the 85%, separate from this price, I'd be very curious if somebody could replicate that result because obviously in psychology and other kinds of fields, which this result seems to be analogous to when you run test on some small sample of people, often they're hard to replicate. So I'd be very curious if you try to replicate this, what does it average human perform on ARC? Ask for the difficulty on how long it will take to crack this benchmark.
It's very interesting because the other benchmarks that are now fully saturated like MMUL math, actually the people who made them, Dan Hendrix and Colin Burns who did MMUL and math, I think they were grad students or college students when they made it. And the goal when they made it just a couple of years ago was that this will be a test of AGI. And of course, it got totally saturated. I know you'll argue that these are test and memorization.
But I think the pattern we've seen, in fact, Epoch AI has a very interesting graph that I'll sort of overlay for the YouTube version here where you see this almost exponential where it gets 5%, 10%, 30%, 40% as you increase the compute across models and then it just shoots up. And in the GBD4 technical report, they had this interesting graph of the human Eval problem set, which was 22 coding problems.
And they had to graph it on the mean log pass curve, basically because early on in training, or even smaller models can have the right idea of how to solve this problem. But it takes a lot of reliability to make sure they stay on track to solve the whole problem. And so you really want to upway the signal where they get a right at least some of the time. We want it a hundred times, we want it a thousand. And then so they go from like 1,100, 110 and then they just like totally saturated.
I guess the question I have, this is all leading up to, is why won't the same thing happen with ARC where people had to try really hard bigger models. And now they figured out these techniques that Jack Colis figured out with only a 240 million parameter language model that can get 35%. Shouldn't we see the same pattern we saw across all these other benchmarks where you're just like sort of eke out and then once you get the general idea, then you just go all the way to 100.
That's an empirical question. So we've seen practice with happens. But what Jack Colis doing is actually very unique. It's not just portraying an alarm and then prompting it. It's actually trying to do active inference. You do test time, right? You're doing test time, like you're doing. Test time, fine tuning. And this is actually trying to lift one of the key limitations of the alarms, which is that at inference time, they cannot learn anything.
They cannot adapt on the fly to where they are seeing. And it's actually trying to learn. So what he's doing is effectively a form of program synthesis. Because the LLM contains a lot of useful building blocks, like programming building blocks. And by fine tuning it on the task at test time, you are trying to assemble these building blocks into the right pattern that matches the task. This is exactly what program synthesis is about.
And the way we contrast this approach with this script program search is that in this script program search, so you're trying to assemble a program from a set of primitives. You have very few primitives. So people working on this script program search on Arc, for instance, they tend to work with the SELs that they have like. They're under as 200 primitive programs. So very small the SEL, but then they're trying to combine these primitives into very complex programs.
So there's very deep depths of search. And on the other hand, if you look at what Jack Cool is doing with the LLM, is that he has got this sort of like vector program database, TSL, of millions of building blocks in the LLM that are mined by pre-training the LLM, not just on a ton of programming problems, but also on millions of generated Arc like tasks. So you have an extraordinarily large DSL. And then the fine tuning is very shallow recombination of these primitives.
So discrete program search, very deep recombination, very small set of primitive programs. And the LLM approach is the same, but on the complete opposite end of the spectrum, where you scale up the memorization by a massive factor and you're doing very, very shallow search. But they are the same thing, just different ends with the spectrum. And I think where you're going to get the most value for your compute cycles is going to be somewhere in between.
You want to leverage memorization to build up a richer, more useful bank, alternative programs. And you don't want them to be hard coded, like what we saw for the typical ArcDS. You want them to be learned from examples. But then you also want to do some degree of deep search. As long as you're only doing very shallow search, you are limited to local generalization. If you want to do a nice further, more broadly, this depth of search is going to be critical.
I might argue that the reason that he had to rely so heavily on the synthetic data was because he used a 240 million parameter model because the cargo competition at the time required him to use a P100 GPU, which has like a 10th or something of the flops of an H100. And so obviously he can't use.
If you believe that scaling will solve these kind of reasoning, then there you can just rely on the generalization, whereas if you're using a much smaller, for context for the listeners, by the way, the Frontier model study are literally 1,000 x bigger than that. And so for your competition, from what I remember, the submission you have to submit can't make any API calls, can't go online. Yeah, I know. And it has to run on Nvidia Tesla T4 P100. P100. Oh, it's a P100?
Yeah. OK, so again, it's like significantly less possible. There's 12 hour runtime limit, basically. There's a forcing function of efficiency in the eVal. But it has this thing. You only have 100 test tasks. So do you have a lot of computer available for each task is actually quite a bit, especially if you contrast that with the simplicity of each task?
So it would be seven minutes per task, basically, which for people who have tried to do these estimates of how many floss does a human brain have. And you can take them with a great assault, but as a sort of anchor, it's basically the amount of flops that H100 has. And I guess maybe you would argue that, well, a human brain can solve this question in faster than 7.2 minutes. So even with a tenth of the compute, you should be able to do it in seven minutes.
Obviously, we have less memory than petabytes of fast access memory in the brain. And with these 29 or whatever gigabytes in this H100. Anyway, I guess the runner question I'm asking is, I wish there's a way to also test this prize with some sort of scaffolding on the biggest models as a way to test whether scaling is the path to get to solving arc. Absolutely. So in the context of the competition, we want to see how much progress we can do with limited resources.
But you're entirely right that it's a super interesting open question. What could the biggest model add there actually do on arc? So we want to actually also make available a private sort of like one-off track where you can send me to us at EM. And so you can put on it any model you want. You can take one of the largest open source models out there, find you need do whatever you want. And just give us an image. And then we run it on the H100 for arc 24 hours or something.
And you see what you get. I think it's worth pointing out that there's two different test sets. There is a public test set that's in the public GitHub repository that anyone can use to train, you know, put it in an open API call, whatever you'd like to do. And then there's the private test set, which is the 100 that is actually measuring the state of the art. So I think it is pretty open and interesting to have folks attempt to at least use the public test set and go try it.
Now there is an asterisk on any score that reported on against the public test set because it is public, it could have leaked into the training data. And this is actually what people are already doing. Like you can already try to prompt one of the best models, like the latest Jaminar, the latest GPT-4, with tasks from the public evaluation set. And you know, again, the primary set, these tasks, are available as JSON files on GitHub. These models are also trained on GitHub.
So they're actually trained on these tasks. And yeah, that can create uncertainty about if they can actually source some of the tasks, is that because they memorize the answer or not. You know, maybe you would be better off trying to create your own private, arc-like, a very novel test set. Don't make the task difficult. Don't make them complex. Make them very obvious for humans. But make sure to make them original as much as possible. Make them unique, different.
And see how much your GPT-4 and so on, or GPT-5, does on them. Well, they're having tests on whether these models are being over trained on these benchmarks. Scale recently did this where on the GSM is really interesting. Basically replicated the benchmark where we had different questions. And so some of the models actually were extremely overfit on the benchmark, like Mistral and so forth.
But the Frontier models, Cloud and GPT actually did as well on their novel benchmarker that they did on the specific questions that were in the existing public benchmark. So I would be relatively optimistic about them just sort of training on the JSON. I was joking with Mike that you should allow API access, but sort of keep an even more private validation set of these arc questions. And so allow API access. People can sort of play with GPT-4 scaffolding to enter into this contest.
And if it turns out, maybe later on, you run the validation set on the API. And if it performs worse than the test set that you allow the API access to originally, that means that OpenAI is training on your API calls. And you go public with this and show them, like, oh my god, they're like leaked your data. We do want to make, we want to evolve the Arc dataset. Like, that is a goal that we want to do. I think Francois you mentioned, you know, it's not perfect.
Yeah, no, Arc is not perfect, perfect benchmark. I mean, I made it like four years ago, over four years ago, almost five now. This was in a time before all elams. And I think we learned a lot actually since about with potential flaws. There might be, I think there is some redundancy in the set of tasks, which is, of course, against the goals of the benchmark. Every task is supposed to be unique in practice. That's not quite true.
I think there's also every task is supposed to be very enough, but in practice, they might not be. They might be structurally similar to something that you might find online somewhere. So we want to keep iterating and release an Arc 2 version later this year. And I think when we do that, we're going to want to make the old private test set available.
So maybe we won't be releasing it publicly, but what we could do is just create a test server where you can query, get a task, you submit a solution. And of course, you can use whatever frontier model you want there. So that way, because you actually have to query this API, you're making sure that no one is going to buy accident train on this data. It's unlike the current public audit, which is literally on GitHub. So there's no question about whether the models are actually trained on it.
Yes, they are, because they're trained on GitHub. So by sort of like gating access to querying this API with a variety of issues. And then we would see, you know, for people who actually want to try whatever technique they have in mind, using whatever resources they want, that would be a way for them to get an answer. I wonder what might happen. I'm not sure.
One answer is that they come up with a whole new algorithm for AI with some things, some explicit programs that assist that now we're on a new track. And another is they did something hacky with the existing models in a way that actually is valid, which reveals that maybe intelligence is more of getting things to the right part of the distribution, but then it can reason. And in that world, I guess that will be interesting.
And maybe that'll indicate that, you know, you had to do something hacky with current models, as they get better, you won't have to do something hacky. I'm also very going to be very curious to see how these multimodal models, if they will perform natively much better at ArcLike tests. If Arc survives three months from here, we'll pull up the price. I think we're about to make a really important moment of contact with reality by blowing up the price, putting a much big price pool against it.
We're going to learn really quickly if there's low-hanging fruit of ideas. Again, I think new ideas are needed. I think anyone listening, this might have the idea in their head. And I'd encourage everyone to give it a try. And I think as time goes on, that adds strength to the argument that we've sort of solved all that in progress and the new ideas are necessary to be dark. That's the point of having a money price is that you attract more people, you get them to try to solve it.
And if there's an easy way to hack the benchmark that reveals that the benchmark is throughout, then you're going to know about it. In fact, it was the point of the original kernel competition back in 2020 for Arc. I was running this competition because I had released this data set, and I wanted to know if it was hackable, if you could cheat. So there was a small money price at the time, that was like 20K. And this was right around the same time as GPT-3 was released.
So people of course tried GPT-3 on the public data. It's called zero. But I think with the first context, the first contest taught us is that there is no obvious shortcut. Right. And well, now there's more money, there's going to be more people looking into it. Well, we're going to find out. We're going to see if the benchmark is going to survive.
And if we end up with a solution that is not like trying to brute force the space of possible arc tasks, that's just trained on core knowledge, I don't think it's necessarily going to be in and by itself HGi, but it's probably going to be a huge milestone on the way to HGi. Because what it represents is the ability to synthesize task, a problem solving program from just two or three examples. And that alone is a new way to program.
It's an entirely new pattern for software development, where you can start programming, potentially quite complex programs that will generalize very well. And instead of programming them by coming up with the shape of the program in your mind, and then typing it up, you're actually just showing the computer what add which you want, and you let the computer figure that. I think that's what is extremely powerful.
I want to riff a little bit on what kinds of solutions might be possible here, and which you would consider sort of defeating the purpose of arc, and which are sort of valid. Here's one I'll mention, which is my friends Ryan and Buck stayed up last night, because I told them about this. And they were like, oh, of course I was going to solve this. Thank you, this is spreading the word. Of course I was going to solve this. And then so they were trying to prompt, I think, Claude Opus on this.
And they say they got 25% on the public arc test. And what they've done, did was have other examples of some of the arc tests, and in context, explain the reasoning of why you went from one output to another output, and then now you have the current problem. And I think also maybe expressing the JSON in a way that is more amenable to the tokenizer. And another thing was using the code interpreter.
So I'm curious actually, if you think the code interpreter, which keeps getting better as these models get smarter, is just the program synthesis right there, because what they were able to do was the actual output of the cells, the JSON output, they got through the code interpreter, like write the Python program, they get to write out here. Do you think that the program synthesis kind of researchers are talking about will look like just using the code interpreter in large language models?
I think whatever solution we see that will score well is going to probably need to leverage some aspects from the planning models and the LLMs in particular. We've shown already that LLMs can do quite well. That's basically the jack code approach. We've also shown that pure discrete program search from a small DSL does very, very well. Before jack code, this was a state of the art. In fact, it's still extremely close to the state of the art.
And there's no deep learning involved at all in these models. So we have two approaches that have basically no overlap that are doing quite well. And they're very much at two opposite ends of one spectrum. Well, on one end, you have these extremely large banks of millions of vector programs, but very, very shadow recombination, like simplicity group combination.
And on the other hand, you have very simplistic DSLs, very simple, like 100 or 200 primitives, but very deep, very sophisticated program search. The solution is going to be somewhere in between. So the people are going to be winning the art competition and we are going to be making the most progress towards near-term entry art, are going to be those that manage to merge the deep learning paradigm and the discrete program search paradigm into one elegant way.
You know, you ask like, what would be legitimate and what would be cheating, for instance? So I think you want to add a code interpreter to the system. I think that's great. So the legitimate. The part that would be cheating is try to anticipate what might be in the test, like brute force, the space of possible tasks, and then train a memorization system on it.
And then rely on the fact that you're generating so many tasks, like millions and millions and millions, that inevitably there's going to be some overlap between what you're generating and what's in the test set. So I think that's defeating the purpose of benchmark because then you can just solve it with that and you need to adapt just by fetching a memorized solution. So hopefully, ARC will resist to that, but you know, nothing, no benchmarks and study perfect.
So maybe there's a way to hack it. And I guess we're going to get an answer very soon. Well, I think some amount of fine tuning is valid because these models don't natively think in terms of, especially the language models alone, which the open source models that they would have to do is to be competitive here. They're natively language, so they need to be able to think in this kind of the archetype way.
You want to input core knowledge, like arc-like core knowledge into the model, but surely you don't need tens of millions of tasks to do this, like core analysis is really basic. If you look at some of these archetype questions, I actually do think they rely a little bit on things I have seen throughout my life. And for the same, like for example, like something bounces off a wall and comes back and you see that pattern. It's like I played arcade games and I've seen like pong or something.
And I think, for example, when you see the flin effect and people's intelligence has measured on very much progressive matrices increasing on these kinds of questions, it's probably a similar story where, since now since childhood, we actually see these sorts of patterns in TV and whatever, spatial patterns. And so I don't think this is sort of core knowledge.
I think actually this is also part of the quote-unquote fine tuning that humans have as they grow up of seeing different kinds of spatial patterns and trying to pattern match to them. I would definitely file that in the core knowledge. Like, core knowledge includes basic physics, for instance, bouncing or trajectories. That would be included. But yeah, I think you're entirely right.
The reason why as a human, you're able to quickly figure out the solution is because you have this set of building blocks, this set of patterns in your mind that you can recombine. Is core knowledge required to attain intelligence and any algorithm you have, does the core knowledge have to be in some sense hard coded or can even the core knowledge be learned through intelligence? Core knowledge can be learned.
And I think in the case of humans, some amount of core knowledge is something that you're born with. Like we're actually born with a small amount of knowledge about the world we're living. We're not blank slates. But most core knowledge is acquired through experience. But the thing with core knowledge that it's not going to be acquired, like, for instance, in school. It's actually acquired very, very early in the first class, three to four years of your life.
And by age four, you have all the core knowledge you're going to need as an adult. OK, interesting. So I mean, on the price itself, I'm super excited to see both the open source versions of maybe with a Lama, a 7-B or something, what people can score in the competition itself. Then if to sort of test specifically the scaling hypothesis, I'm very curious to see if you can prompt on the public version of ARC, which I guess won't be compatible.
You won't be able to submit to this competition itself. But I'd be very curious to see how if people can sort of crack that and get our work working there. And if that were to update your reviews on Asia. It's only be motivating. We're going to keep running the contest until somebody puts a reproducible open source version into public domain.
So even if somebody privately beats the ARC, we're going to still keep the price money until someone can reproduce it and put the public reproducible version out there. Yeah, exactly. Like the goal is to accelerate progress towards the GI. And a key part of that is that any sort of meaningful bits of progress needs to be shared, needs to be public. So everyone can know about it and can try to iterate on it. If there's no shaming, there's no progress.
What I'm especially curious about is sort of disaggregating the bets of like, can we make an open version of this versus, is this a thing that's just possible with scaling? And we can, I guess, test both of them based on the public and the private version. We're making contact with reality as well with this, right?
We're going to learn a lot, I think, about what the actual limits of the computer, if someone showed up and said, hey, here's a closed source model that I'm getting 50 plus percent on, I think that would probably update us on like, OK, perhaps we should increase the amount of compute that we give on the private test set in order to balance some of the decisions initially or someone arbitrary in order to learn about, OK, what do people want? What does progress look like?
And I think both of us are sort of committed to evolving it over time in order to be the best of the closest to perfect because we can get it. Awesome. And where can people go to learn more about the price and maybe give their hand at it? ARCPRISE.org. Which goes live today. So one million dollars is on this line, people. Good luck. Thank you guys for coming on the podcast. It's super fine to go through all the cruxes on intelligence and get a different perspective.
And also to announce a prize here. So this is awesome. Thank you for helping break the news. Thank you for having us.