The kaleidoscope hypothesis is this idea that the world in general and any domain in particular follows the same structure that it appears on the surface to be extremely rich and complex and infinitely novel with every passing moment. But in reality, it is made from the repetition and composition of just a few atoms of meaning.
And a big part of intelligence is the process of mining your experience of the world to identify bits that are repeated and to extract them, extracts these unique atoms of meaning. And when we extract them, we call them abstractions. And then as we build sort of flak in a banks of such abstractions, then we can reuse them to make sense of novel situations.
Of situations that appear to be extremely unique and novel on the surface, but actually they can be interpreted by composing together these reusable abstractions. And before we start building a GI, we need to ask ourselves the hard questions. What is intelligence? How can we measure it, benchmark progress, and what are the directions that we should follow to build it? So I just want to give you my take on these questions.
And I'm going to start by taking you back to peak a GI hype, which was early last year. Remember what February 20, 2023 felt like. Tad GPT had been released just a couple months prior. GPT 4 just came out. Bing chat just came out. It was the Google killer. Anyone remember Bing chat? And we are told that Tad GPT would make us 100x more productive, a thousand x more productive, that it would outright replace us. The existential risk of AI was becoming front page news.
And AGI was just around the corner. It was no longer 10 years away, not even five years away. It was just a couple years away. You could start the countdown in months. And that was one and a half years ago. And clearly back then, AI was coming for your job right away. It could do anything you could, but faster and cheaper. And how did we notice? Well, it could pass exams. And these exams are the way that we tell whether other humans are fit to perform a certain job.
If AI passes the bar exam, then it can be a lawyer. If it can solve programming puzzles, then it can be a subtle engineer and so on. So many people were saying that all lawyers, all software engineers, all doctors, and so on, we're going to be as a job, maybe even within the next year, which would have been today. In fact, most desk jobs were going to disappear and we faced mass unemployment.
So it's very funny to think about it because today the employment rates in the US actually higher than it was at the time. So it was actually true. You know, was it really where the benchmarks were telling us back then? So if you go back to the real world, you know, away from the headlines, away from the February 2023 hype, it seemed that airlines might be a little bit short of generally.
I'm sure most of you in this room would agree with that. The suffer from some problems and these limitations are inherent to care of feeding. It's the inherent to the product that you are using to build these models. So they're not easy to patch. And in fact, there's been basically no progress on these limitations since day one. And day one was not last yet. It was actually when we started using these transformer based large language models over five years ago.
And we've not really made any progress on this problem is because the models we are using are still the same. Their parametric curves, fill is to a data sets of the agrarian distance and they're still using the same transformer architecture. So I'm going to cover these limitations. I'm not actually going to cover hallucinations because all of you are probably very familiar with it. But let's take a look at the other ones.
So to start with, an interesting issue with airlines is that because they're autoregressive models, they will always add with something that seems to, that seems likely to follow your question without necessarily looking at the contents of your question. So for instance, for a few months after the original release of charge epity, if you asked what's heavier 10 kilos of steel or one kilo of feathers, it will answer the way the same.
And it would answer that because the trick question what's heavier one kilo of steel or one kilo of feathers is found in all over the internet. And the answer is of course that they weigh the same. So the model would just pattern match the question without actually looking at the numbers without passing the actual question you're asking. And saying if you provide a variation of the multi hold problem, which is the screenshot right here.
The LLM has memorized perfectly the clinical answer to the actual multi hold problem. So if you ask an observed variation, it's just going to go right through it and output the answer to the original problem. So to be clear, these two specific problems, they've already been patched via RLHF, but they've been patched by special casing them. And it's very, very easy to find new problems that still fit this failure mode.
So you may say, well, you know, these examples are from last year. So surely today we are doing much better. And in fact, no, we are not. The issues have not changed since day one. If we've not made any progress, they were addressing them. They still play the latest set of the art models like a close 3.5, for instance. This is a paper from just a few days ago from last month that actually investigate some of these examples instead of the art models, including a close 3.5.
So a close and a related issue is the extreme sensitivity of LLM to phrasing. If you change the names or places, variable names in the text paragraph, it can break LLM performance. Or if you change numbers in formula. There's an interesting paper that investigates this. So you can check it out. It's called Embers of autoregression.
And people who are very optimistic, they would say that this Briton-ness is actually a great thing, because it means that your models are more performant than you know. You just need to query in the right way, and you will see better performance. You just need prompt engineering. And the counterpart to that statement is that for any LLM, for any query that seems to work, there is an equivalently phrasing of the query that a human would readily understand that will break.
And to what extent do LLM actually understand something, if you can break their understanding with very simple reneinings and rephrasing, it looks a lot more like superficial pattern matching than robust understanding. You know, besides that there's a lot of talk about LLM's ability to perform in context learning to adapt new problems on the fly.
And what seems to actually happen is that LLM's are capable of fetching memorized programs, like problem solving templates, and map them to the current task. And if they don't have a memorized program ready, if they're faced with something slightly unfamiliar, even if it's something very simple, they will not be able to analyze it from first principles the way a human could. So one example sees our ciphers. LLM's still of the art elements can solve a czar cipher, which is very impressive.
But as it turns out, they can only solve it for very specific values of the key size. The specific values like three and five that you find commonly in online examples, if you show it an example of cipher with a key size like 13, for instance, it will fail. So it has no actual understanding of the algorithm for solving the cipher.
And it has only memorized it for very specific values of the key size. So my hypothesis is that LLM performance depends purely on task familiarity and not at all on task complexity. Not really any complexity ceiling to where you can get LLM's to solve to memorize as long as you give them the opportunity to memorize the solution or the problem solving template to program, right, that you have to run to get the answer.
Instead, LLM performance depends entirely on task familiarity. And so even very simple problems if they are unfamiliar will be out of reach. And lastly, LLM's fur from a generalization issue with the programs that they did in fact memorize. So some examples include the fact that LLM's have trouble with number multiplication as you probably know with this sorting and so on, even though they've seen millions of examples of these problems.
So they typically have to be aided by external symbol existence in order to handle these things. There's an interesting paper that invocates high LLM's handle composition. So it's still all the limits of transformers of compositionality. And the main finding is that LLM's do not actually handle composition at all, what they're doing instead is linearized subgraph matching.
There's another paper that's also very intriguing. It's the reversal curse. So the authors found that if you train an LLM with content like A is B, it cannot actually infer the reverse B is A. So that's really, you know, breakdown of generalization on a deep level. And that's actually why it's surprising, like even I am particularly skeptical at LLM. So it's very surprised with this result.
One thing to keep in mind about this failure case is that specific queries will tend to get fixed relatively quickly because the models are being constantly fine tuned on new data collected from human contractors based on past query history.
And so many of the examples that I showed in my slides are probably already working with some of the state of the art elements because they fell in the past and so they've been manually addressed since. But that's a very brittle way of making progress because you're only addressing one query at a time.
And even for a query that you patched, if you rephrase it or if you change names and variables, it's going to start fading again. So it's a constant game of work and all. It is very, very heavily reliant on human labor today. Today, there's probably between between 10 and 30,000 humans that work full time on creating annotated data to train the elements. So, you know, on balance, it seems a little bit contradictory. Like on one hand, LLM are beating every human benchmark that you throw at them.
And on the other hand, they're not really demonstrating a robust understanding of the things they are doing. So to solve this paradox, you have to understand that skill and benchmarks are not the primary lens through which you should look at these systems.
So, let's zoom out by a lot. There have been historically two currents of thoughts to define the goals of AI. First, there's the Minskistar view, which echoes the current big tech view that a G.I. would be a system that can perform most economically valuable tasks. So, Minskist said AI is the science of making machines capable of performing tasks that would require intelligence if done by humans. So, it's very task-centric. You care about whether the AI does well on a fixed set of tasks.
And then does the MADCOTS view. So, it didn't exactly say what I'm quoting here, but it was a big proponent of these ideas that, generally, TNAI is not task-specific performance scale to many tasks. It's actually about getting machines to handle problems. They have not been prepared for.
And that difference echoes the lock view of intelligence versus a Darwin view of intelligence. So, intelligence has a general purpose learning mechanism versus intelligence as a collection of task-specific skills imported to you by evolution.
And my view is more like the lock and mark-art view. I see that intelligence is a process. And skill, task-specific skill, is the output of that process. This is a really important point. If there's just one point, you take away from this talk. It should be this.
Skill is not intelligence. And displaying skill at any number of tasks does not show intelligence. It's always possible to be skillful at any given task without requiring any intelligence. And this is like the difference between having a road network versus having a road-building company.
If you have a road network, then you can go from A to B for a very specific set of A's and B's that are the fun in advance. But if you have a road-building company, then you can start connecting arbitrary A's and B's on the fly as you need to evolve.
So, attributing intelligence to a crystallized behavior program is a category error. You are confusing the output of the process with the process itself. Intelligence is the ability to deal with new situations, the ability to blaze fresh trails and build new roads. It's not the road. So, don't confuse the road and the process that created it. And all the issues that we are facing today with the Netherlands, they are a direct result of this misguided conceptualization of intelligence.
The way we define and measure intelligence is not a technical detail that you can live to externally provided benchmarks. It reflects our understanding of cognition. So, it reflects the questions that you are asking and by through that, it also limits the answers that you could be getting. It's really the way that you measure progress. It's the feedback signal that you use to get closer to your goals. If you are the feedback signal, you are not going to make progress towards actual generality.
So, there are some key concepts that you have to take into account if you want to define and measure intelligence. The first thing to keep in mind is the distinction between static skill and fluid intelligence. So, between having access to a large collection of static programs to solve known problems, like what LLM would do versus being able to synthesize brand new programs on the fly to solve a problem using a sin before.
So, it's not a binary, right? Either you have fluidity or you don't. It's more like a spectrum, but there's higher intelligence on the right side of the spectrum. And the second concept is operational area. There's a big difference between being skilled only in situations that are very close to which a family always versus being skilled in any situation within a broad scope.
So, for instance, if you know how to add numbers, then you should be able to add any two numbers, not just specific numbers that you've seen before or numbers close to them. If you know how to drive, then you should be able to drive in any city. You should even be able to learn to drive in the US and then move to London and drive in London, where you're driving on the other side of the road.
If you know how to drive, but only in very specific geofence terriers, that's less intelligent. So, again, there's a spectrum here. It's not a binary, but there's higher intelligence on the higher generalization side of the spectrum. And lastly, the last concept is information efficiency. How much information, how much data was required for your system to acquire a new skill program? If you're more information efficient, you are more intelligent.
And so, all these three concepts, these three quantities, they're linked by the concept of generalization. And generalization is really the central question in AI, not skill. Forget about skill, forget about benchmarks. And that's really the reason why using human exams to evaluate AI models is terrible idea, because exams we are not designed with generalization in mind.
Or rather, you know, they were designed with generalization assumptions that are appropriate for human beings, but are not appropriate for machines. Most exams assume that humans haven't tried the exam questions and the answers before him. The assume that the questions you're going to be asking are going to be at least somewhat unfamiliar to the test take care.
And so, I think that's a pure memorization exam in which it would make sense that elements could acids since the memorizing the entire internet. So, to get to the next level of capabilities, we've seen that we want AI to have the ability to adapt to generalize to new situations that does not be prepared for. And to get there, we need a better way to measure the stability, because it's by measuring it that we'll be able to make progress. We need a feedback signal.
So, in order to get there, we need clear understanding of where generalization means. So, generalization is the relationship between the information you have, like the priors that you're born with, and the experience that you've acquired over the course of your lifetime, and your operational area over the space of potential future situations that you might encounter as an agent.
And, in order to get there, you need to get there, and the future, and certainty that they're going to feature novelty, they're not going to be like the past. And generalization is basically the efficiency with which you operationalize past information in order to deal with the future. So, you can interpret it as a conversion ratio. If you enjoy math, you can, in fact, use algorithmic information theory to try to characterize and quantify precisely this ratio.
And if that's interesting to you, you can check it out. One of the things I talk about in the paper is that to measure generalization power, to measure intelligence, you should control for priors and experience since intelligence is a conversion ratio, you need to know what you're dividing by.
If you're interested specifically in comparing AI to human intelligence, then you should standardize on a shared set of cognitive priors, which should be, of course, human cognitive priors, what we call core knowledge. So, as an attempt to fulfill these requirements for a good benchmark of intelligence, I've put together a data set. It's called the abstraction reasoning corpus for artificial general intelligence, or arched GI for short.
And you can think of it as an IQ test. So, it's kind of intelligence test that can be taken by humans. It's actually very easy for humans, or AI agents. You can think of it as a program synthesis data set as well. So, a key idea is that in arched GI, every task that you see every task you get is novel. It's different from any other task in the data set.
So, it's also different from anything you may find online, for instance. So, you cannot prepare in advance for arc. You can not just solve arc by memorizing the solutions in advance. That just doesn't work. You're doing a few short program learning. You're seeing two or three examples of a thing, and then you must infer from that the program that links the input to the output. And we're also controlling for priors in the sense that arched GI is grounded purely in core knowledge priors.
So, it's not going to be using any sort of acquired knowledge like the English language, for instance. It's only built in over four core knowledge systems. There's objectness, there's basic geometry and topology and there's numbers and there's agentness.
So, we first ran a Kaggle competition on this data set in early 2020 that produced several very interesting solutions, all based on program synthesis. And right now, the state of the art is about 40% of the task solved. And that's very much baby steps.
And that's the data set that's from before the edge of the lens, but actually it has become even more relevant in the edge of the lens because most benchmarks based on human exam and so on, they've already saturated in the edge of the lens, but not RKGI. And that's because RKGI is designed to be resistant memorization and all the other benchmarks can be hacked by memory alone.
So, in June this year, we've launched a much more ambitious competition around RKGI. We call it the art prize. So, together with Mike and Oop, the couple of that here, we're offering over a million dollar in prizes to get researchers to solve RKGI and open source of solution. The competition has two tracks. There's a private track that takes place on Kaggle is the largest Kaggle competition at the moment. You get evaluated on 100 hidden tasks and your solution will be self contained.
So, it must be able to run on a GPU VM within 12 hours with good speed as well. There's a big price to get over 85%. And then there are prizes for the top scores as well. And there's even a best paper price for the 5K.
So, even if you don't have a top result, but you have good ideas, just try the paper, submit it. You can win some money. And there's also a public track which we added because people kept asking, OK, but you know, how do we do the state of the art elements like GPT 4, Klovenson, how did they do on the day set?
So, we launched this sort of like semi-private evil where the tasks are not public, but they're also not quite private because they are being queried by this remote API. And surprisingly, the state of the art on this track is pretty much the same as on the private track. It's actually quite interesting. So, what's the length of almost an object? Yeah, exactly. It's not very good. The state of the art elements are doing most of them are doing between 5% and 9%.
And then there's one that's doing better. It's close 3.5. Close 3.5 is a big jump. It's a 21%. And meanwhile, you know, basic program search should be able to get you at least 50%. So, how do we notice 50% is where to get a few install all of the submissions that were made in a 2020 competition, which are all resource programs.
So, basically, if you scale up workforce programs search to more compute, you should get at least 50%. And meanwhile, humans do like easily over 90%. The product that's set was very five by two people and each of them scored 97 to 98. Right. And together, they can sort 100%. So, you know, 5 to 21 is not great, but it's also not zero. So, it implies that elements have non-zero intelligence according to the benchmark. And that's intriguing.
But when thinking you have to keep in mind is that the benchmark is far from perfect. There's a chance that you could achieve this score by purely memorizing patterns and reciting them. It's possible. So, we have to investigate where this comes from. Because if it comes from a kernel of reasoning, then you could scale up the approach to become more general over time and eventually get to join AI.
But if that performance actually comes from memorization, then you'll never reach generality. You will always have to keep applying these one time human guided point wise fixes to quiet new skills. It's going to be this perpetual game of whack more than it's not going to scale to generality. So, to better understand what airlines are doing, we have to talk about abstraction. Abstraction and generalization are closely tied because abstraction is the engine through which to produce generalization.
So, let's take a look at abstraction general and then we look at abstraction in airlines. To understand abstraction, you have to start by looking around zoom out, look at the universe. An interesting observation about the universe is that it's made of many different things that are all similar. They're all analogous to each other. Like one human is similar to other humans because they have a shared origin. Electromagnetism is analogous to hydrodynamics, is also analogous to gravity and so on.
So, everything is similar to everything else. We are surrounded by isomorphisms. I call this kaleidoscope hypothesis. So, you know what kaleidoscope is. It's a tube with a few bits of colored glass that are repeated and amplified by a set of mirrors. And that creates this remarkable richness of complex patterns out of just a few kernels of information.
And the universe is like that. And in this context, intelligence is the ability to mind the experience that you have to identify bits that are reusable and you extract these bits and you call them abstractions. And they take the form of programs, patterns, representations. And then you're going to recombine these bits together to make sense of novel situations. So, intelligence is sensitivity to abstract analogous.
And in fact, that's pretty much all there is to it. If you have a high sensitivity to analogous, then you will extract powerful abstractions from little experience and you will be able to use these abstractions to make sense of the maximally large area of future experience space. And one really important thing to understand about abstraction ability is that it's not a binary thing where either you're capable of abstraction or you're not. It's actually a matter of degree.
There's a spectrum from factories to organized knowledge to abstract models that can generalize broadly and accurately to meta models that enable you to generate new models on the fly even a new situation. And the degree zero is when you purely memorize point wise factories. There's no abstraction involved. It doesn't generalize at all. That's where you memorized.
So here we are representing our factories as the functions with no argument. You're going to see why in a bit. The fact that they have no argument means that they're not abstract at all. Once you have lots of rated factories, you can organize them into something that's more like an abstract function that encodes knowledge. So here this function has a variable X, so it is abstract for X.
So this type of thing, this type of organized knowledge based on point wise factories or interpolations between point wise factories. It doesn't generalize very well. You know, kind of like the way LLM's add numbers. It looks like abstraction, but it's so relatively weak from abstraction. It may be inaccurate. It may not work on the points that are far from the points that you've seen before.
And the next degree of abstraction is to turn your organized knowledge into models that generalize strongly. A model is not an interpolation between factor with anymore. It's a concise and causal way of processing inputs to obtain the correct output. So here a model of addition using just binary operations. It's going to look like this. This returns the correct result 100% of the time. It's not approximate and it will work with any input whatsoever, like regardless how large they might be.
So this is strong abstraction. LLM's as we know, you know, the still fall short of that. But the next stage would be the ability to generate abstraction autonomously. That's how you're going to be able to handle novel problems, like things you've not been prepared for. And that's what intelligence is.
The last stage is going to be able to do so in a way that's maximally information efficient. That would be HDR. So it means you should be able to master new tasks using very little experience, very little information about the task. Not only are you going to display high skill at the task, meaning that the model you apply and can generalize strongly, but you will only have to look at a few examples, a few situations to produce that model.
That's a holy grail of AI. And if you want to sit with LLM's on the spectrum of abstraction, there's somewhere between organized knowledge and generalizable models. They're clearly not quite at the model stage, I spur the limitations that we're discussing. If LLM's were at the model stage, they could actually add numbers or sort lists.
But they have a lot of knowledge and that knowledge is structured in that show where they can generalize to some distance from previously seen situations. It's not just the collection of point wise factor. And if you solve all the limitations of LLM's like hallucinations and great one and so on, you would get to the next stage.
But in order to get to actual intelligence to on the fly model synthesis, they're still massive jump. You cannot just purely scale the current approach and get there. You actually need brand new directions. And of course, a G.R. after that is still a pretty long way off. So how do we build abstraction in machines? Let's take a look at how abstraction works. I said that intelligence is sensitivity to analogies, but there's actually more than one way to draw analogies. There's two ways.
There are two key categories of analogies from which arise two categories of abstraction. There's value centric abstraction and this program centric abstraction. And they're pretty similar to each other. They mirror each other. They're both about comparing things and then merging individual instances into common abstractions by erasing certain details about the instances that don't matter.
So you take a bunch of things, you compare them among each other, you erase the stuff that doesn't matter, what you're left with is an abstraction. And the key difference between the two is that the first one operates in continuous domain and the other one operates in discrete domain. So value centric abstractions about comparing things via a continuous distance function like dot products in LLM for instance or all the L2 distance.
And this is basically what powers human perception and tuition and pattern recognition. And meanwhile, programs and tree abstraction is about comparing discrete programs, which are graphs, obviously. And instead of competing distance between graphs, you are looking for exact subgraph isomorphisms for exact subgraph matching.
So if you ever hear like a software engineer talk about abstraction for instance, this is actually what they mean when you're factoring something to make it more abstract, that's what you're doing. And both of these form of subtraction, you know, they're redriven by analogy making is just different ways to make analogies.
So the analogy making is the engine that produces abstraction and value analogy is grounded in geometry, you compare instances via distance function and a program centric analogy is grounded in topology, you're doing exact subgraph matching. And the whole cognition arises from an interplay between these two forms of abstraction. And you can also remember then we had the left brain versus right brain analogy or the type one versus type two thinking distinction from a can man.
So of course, you know, the left brain versus right brain stuff, it's actually an image, it's not how lateralization of cognitive function actually works in the brain, but you know, it's fun way to remember it. Transformers actually great at type one at value centric abstraction, they do everything that type one is effective for like perception intuition pattern cognition.
So in that sense, transformers represent a major brextwing here, but they are not a good fit for type two abstraction and that is where all the limitations we listed came from. This is why you cannot add numbers or why you cannot infer from a is b that b is as well. Even with a transformer that's trained on all the data on the internet. So how do you go forward from here? How do you get to type two? How do you solve problems like you know, RKGI, right, any reasoning or planning problem?
The answer is that you have to leverage discrete program search as opposed to purely manipulating continuous interpretive embedding spaces, language, and decent. And there's an entirely separate branch of computer science that deals with this. And in order to get to a GI, we have to merge discrete program search with deep learning. So quick intro, you know, what's this great program search exactly? It's basically common search of a class of operators taken from the main specific language, DSL.
There are many flavors of that idea like genetic programming, for instance. And to better understand it, you can sort of like draw a side by side analogy between what machine does and what program synthesis does. So machine learning your model is a differentiable parametric function in PS. It's a graph of operators taken from a DSL. In ML, the learning engine is quite in the sense in PS its combinatorial search. And in ML, you know, you're looking at continuous loss function.
Well, in PS, you only have this binary correctness check as feedback signal. And the big hurdle in ML is data density to learn a model. Your model is a curve. So to fill it, you need to then sampling of the problem space. Meanwhile, PS is extremely data efficient. You can see the program using just a couple examples. But the key hurdle is combinatorial explosion. The size of the program space you have to look at to find the correct program is immense increases,
and the common with DSL size of program length. So program synthesis has been very successful on our GIs so far, even though it's just baby steps. And so all programs and the solutions on our, they follow the same template, you know, basically doing brute force program search. Even though that's very primitive, it still act performs a state of the art of the lens with much less compute, by the way. So now we know what the limitations of the lens are. We know when they get at that type one.
So where do we go next? And the answer is that we have to merge machine learning, like this sort of like type one thinking with the type two thinking part by program synthesis. And I think that's really hard intelligence works in humans. That's where human intelligence is really good at. That's what makes us special is that we combine deception and intuition together with explicit step by step reasoning. We combine really both forms of abstraction. So for instance, if you're playing chess.
You're using type two when you calculate step by step back you and forward specific interesting moves. But you're not doing this for every possible move because there are lots of them, you know, it's coming at your explosion. You're only doing this for a handful of different options. So you use your intuition, which you build up by playing lots of games in order to narrow down the sort of discrete search that you perform.
So when you're calculating, so you're merging type one and type two. And that's why you can actually play chess using very, very small quantity resources compared to what a computer can do. And this blending of type one and type two is where we should take AI next. So we can combine the playing and discrete search into a unified super approach.
How does that work? Well, the key system to technique is discrete search of your space of programs. And the key wall that you're running to is criminal explosion. And meanwhile, the key system one technique is curve fitting and generalization of the interpolation. So you embed lots of data into an interpretive manifold. This manifold can do fast, but approximate judgment course about the target space.
So the big idea is we leverage this fast, but approximate judgment calls to fight criminal explosion. So you use them as a form of intuition about the structure of the program space that you're trying to search over that you're not getting and use that to make such tractable.
So a simple analogy that sounds a little bit to abstract is drawing a map. You take a space of discrete objects with discrete relationships that would normally require communal search like pass finding in the files metro is a good example that's a common problem. And you embed these discrete objects and their relationships into a geometric manifold where you can compare things via a continuous distance function. And you can use that to make fast, but approximate inferences about relationships.
I can pretty much draw a line on this map look at what it intersects and this gives you a sort of flag candidate path that restricts the set of discrete passes that you're going to have to look at one by one. This enables you to keep communal explosion in check. But of course you cannot draw maps of any kind of space, right, like program space actually very, very nonlinear.
So in order to use deep learning for a problem, you need two things. You need an interpretive problem like you need to follow the manifold hypothesis and need lots of data. So if you look at a single arcade, I toss, for instance, it's not interpolative and you only have like two to four examples. So you cannot use deep learning.
You cannot solve it purely by reapplying a memorized pattern either. So you cannot use an LLM. But if you take a step down a lower down the scale of abstraction and you look at core knowledge, the core knowledge systems that arcade is built upon each core knowledge system is interpolated. And could be learned from data. And of course you can collect lots of data about them. So you could use deep learning at that level to serve as a perception layer that passes our world into discrete objects.
And likewise, if you take a step higher up the scale of abstraction and you look at the space of all possible arcade, I toss and all possible programs that solve them. Then again, you will find continuous dimensions of variation. So you can actually leverage interpolation in that space to some extent. So you can use deep learning there to produce intuition over the structure of the space of art tasks and the programs that solve them.
So based on that, you know, I think there are two exciting researchers to combine deep learning and program synthesis. The first one would be leveraging discrete programs that incorporate deep learning components. So, for instance, you are deep learning as a perception layer to pass the real world into discrete building blocks that you can fit into a program synthesis engine.
You can also add symbolic add-ons to deep learning systems, which is something I've been talking about for a very long time, but it's actually starting to happen now with things like external very fires to use for the length and so on.
And the other direction would be deep learning models used to inform discrete search and improve its efficiency. So you use deep learning as a driver as a guide for program synthesis. So, for instance, it can give you intuitive program sketches to guide your program search. And it can reduce the space of possible branching decisions that you could consider, I don't know that so on.
So, what would that look like on ArcGi? I'm going to spread out to you how you can crack ArcGi and win a million dollar maybe. So, there are two directions you can go. First, you can use deep learning to draw a map in a sense of grid state space, grid space.
So, in the limit, this solves program synthesis because you take your initial grid input, you embed it on your manifold, then you look at the grid output, you embed it, and then you draw a line between the two points on your manifold, and you look at the grids that it interpolates.
And this gives you approximately the series of transformations to go from input to output, right? You still have to do local search around them, of course, because this is this is fast, but this is very approximate, it may not be correct. But it's a very good starting point. You are turning program synthesis into pure interpolation problem. And the other direction you can go is program embedding. You can use deep learning to draw a map of program space this time instead of grid space.
And you can use this map to generate discrete programs and make your search process more efficient. So, a very good example of how you can combine elements with discrete program search is this paper. And how to do this search in the English language model. So it uses a LLM to first generate a number of hypothesis about an arc task in natural language.
And then it uses another LLM to implement candidate programs corresponding to each hypothesis in Python. And so by doing this, they are actually getting a 2x improvement on ArcGy. So that's very promising. Another very crude example that's very promising is this submission from Ryan Greenblatt on the ArcGy.lydorboard, the public header board. So it's using a very sophisticated prompting pipeline based on GPT-4O where uses GPT-4O to generate candidate Python programs to solve our tasks.
And then he has an external verify is generating like thousands of tasks per thousands of programs per task. Also as a way to refine tasks, programs that seem to be doing close to where you want. So this scores 42% on the public header board. And that's the current state of the art. So again, here's where we are. We know that LLM's full short of the G.I. The graded system one thinking the lack system to and meanwhile, progress throughout system to a store progress towards the G.I. store.
The limitations that we are dealing with with LLM's, they are still the exact same we are dealing with five years ago. And we need new ideas, we need new breakthroughs. And my bet is that the next breakthrough will likely come from outsider while you know all the big labs are busy training bigger LLM's. So maybe it could even be someone in this room. Maybe you have the new ideas at the time when I see so see you on the little board for our G.I. Thank you.