¶ Introduction: Robert Lange, Sakana AI and Shinka Evolve
So I think a lot of sort of analogies from evolution transfer to scientific research, right? In the sense that we traverse a tree of different ideas or different experiments, and then in a paper we report one path through that tree.
When we run L L Ms autonomously, they they tend to just kind of like nothing interesting happens.
But oftentimes um innovation for a specific problem might require first inventing a different problem, right? Sort of automatically coming up with this reduction or like this, let's say, um recursive nature of problem solving is something these systems right now not necessarily have built in intrinsically, right? Oftentimes it's easier to generate a lot of solutions than to actually like hard verify them, right?
The reason why I'm not that worried yet about labour market disruption is I still believe deeply that Humans are the source of deep understanding and creativity in the world. If I didn't believe that, I would be very worried.
So I think it's gonna be an amplifier of sort of these these latent dimensions humans are great at, right?
And I think one of the Rubicon moments is when the the new Transformers architecture or something massive is discovered by AI and we're all using it. Nvidia GTC starts Monday in San Jose and it's free to attend virtually online. There's already been a leak this week of something called Nemo Claw, which is an open source agent platform. And if it's real, it could be one of the bigger announcements this year. So it's definitely worth watching Jensen's keynote for that alone.
I'm giving away a DGX Spock. Uh NVIDIA just hikes the price seven hundred dollars. You probably heard about these memory shortages, right? So yeah, it's now forty seven hundred dollars, which is very, very expensive. And uh Merv from Hugging Face, by the way, she got one for her birthday and she said she literally cried. So it's a really cool bit of kit. Um if you register through my link in the description and do you attend at least one session?
Then you are in the draw. This is a massive conference. Physical, AI and robotics are gonna be the breakout theme. And Jensen does the keynote Monday at eleven AM Pacific. The link is in the description. Don't miss it. Robert Lange, it's amazing to have you on MLST.
Thank you Tim.
So are you working for Sakana?
Sakana AI is a uh Japanese um AI startup working mostly on um yeah AI for Japan and at the same time sort of exploring exploring, let's say, novel or ambitious ideas on the research side.
It's been around for over a year now. You're on you're one of the founding researchers, right?
Exactly. So Sakana has been around for now like almost two years, like one in three quarters I would say. And um yeah, it it's pretty fascinating to to look back and uh to look at the early days and how much the company sort of organizationally has changed. But in spirit Like um we're we're trying to sort of uh embrace Ken Stanley's open endedness idea and sort of explore many different ideas which uh might not get the resources right now in the ML community module.
general. And we we've got a few interviews coming out with Sakana that that we filmed here in Japan. So I won't spoil the surprise, but the the CEO is David Ha. And David, you know like there are these epic s you know, giants out there like, you know, Clune and Stanley. David Harr is one of these people.
David's work has had a lot of influence on my personal PhD, right? He he did a lot of fascinating work on hyper networks and sort of modulation in in neural networks. But also on uh evolutionary computation and evolutionary optimization. And um yeah, that sort of also painted um yeah, my path during the PhD.
You've you've released a paper called Shinka Evolve and w and we were just saying that that kind of means evolve evolve because in in Japanese Shinker is evolved. But that's quite common. It's a common thing to do, to have these like multilingual, you know, double double namings in in Japanese.
¶ ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution
Just before we get there, so um we interviewed the Alpha Revolve team and I also interviewed Jeremy Berman a few weeks ago. And your paper is is very much like a more sophisticated version of those in the sense that it using language models to generate programs and it's doing an evolutionary approach where we generate the program, we refine the generated program and we have an e an evaluator and we do this over several steps.
And and your your your approach does many things that that the other ones don't do. Tell me about the paper.
First off, of course, this was partially inspired by Alpha Evolve. I think it's great work. Um I know Alex and Matei and I think they're doing incredible science. One thing that sort of uh is important about sort of using all of these evolutionary LLM
¶ AlphaEvolve: A Coding Agent for Scientific and Algorithmic Discovery
driven methods um is sample efficiency, right? So many of these systems sample like, let's say, a thousand programs for a given task. And what we tried to do with Shinka Evolve was uh try to essentially cut down costs as well as sort of computation evaluation time by introducing um a set of sort of technical innovations to this evolutionary search.
And we showed that um it's possible with uh very few program evaluations to basically improve upon like for example the circle packing um canonical result that they showed in their paper. And um yeah, more generally speaking, I think uh we're right now at a point or like at an inflection point where these sort of let's say evolutionary driven L L M systems
can really revolutionize scientific discovery. And uh yeah, we hope to um have made a step forward to making this more democratically accessible, right? So the code is um open source available and uh yeah by It's a sample efficient nature. We hope that many people can interact with the system and can make uh their own scientific discoveries.
Yeah, that that's actually a really important point because I suppose we can use these foundation models. And first of all, isn't it just fascinating to reflect that? We have these amazing models out there that we can access, so like GPT-5 and GROC four, and they are so much better when you get them to refine their solution in in several steps. Why why is that? I mean I suppose a naive question would be why why aren't they just good out of the box?
potentially like with enough random samples, right? It's sort of this monkey typing on the keyboard, um they would potentially be able to get there, right? Um but in principle it That's sort of uh coming back to the principles of evolution, right? In the sense that you need to collect a bunch of stepping stones first.
and then build on top of them to to really find uh innovations or to tune innovations down the line. And I think um language models with the right sort of evolutionary harness are um extremely powerful in terms of scaling up to um to to make discoveries. And um yeah, I think uh Jeremy, as well as the Alpha Evolve paper, as well as sort of um work we've done on like the Darwin Goethe machine, for example, shows that this um sort of stepping stone accumulation plus
¶ Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents
iterative verification and uh collecting sort of information and evidence from the real world or real synthetic evaluator is really important.
Very cool. And stepping stone collection. So th this is um it came from Kenneth Stanley, his wonderful paper, Why Greatness Cannot Be Planned. And he said that it's it's better to have systems that don't convert. So in natural evolution we are just trying all of these different things.
And greatness quite often follows a diverse path, which means you have to do things which initially seem quite stupid and then later on they turn out to be incredibly useful. Yeah. We're trying to design algorithms that can kind of allow for a population of slightly weird things and and then we kind of lock in and and converge a little bit. So We we're still converging though. So we're still building systems that don't diverge forever.
What are we losing?
One one thing I find extremely important after having done Shinka Evolve is um sort of this problem problem, right? So Uh with all of these systems so far, uh maybe except for the AI scientist, which we can also talk about, the problem is given, right? So uh you have an evaluator, you have a correctness checker, and you sample programs only on that single problem, right?
But oftentimes um innovation for a specific problem might require first inventing a different problem, right? So for example, I think in the uh matrix multiplication result that um the alpha evolve people show you can recursively apply sort of the algorithm to larger matrices. So it's actually an important result, right?
Um but uh sort of automatically coming up with this reduction or like this, let's say, um recursive nature of problem solving is something these systems right now not necessarily have built in intrinsically, right? So I think uh going forward it's gonna be um really important to not only sort of do open ended, let's say, optimization of solutions, but sort of do the co evolution of problem and solution together in order to collect even more diverse stepping stones.
and um to really kick off this this open ended process.'Cause also to me, like one of the uh the the big life goals or achievements um I would wanna see is you really having a process that can run uh not only for let's say a week or uh many weeks, but like for years even potentially, right? Collecting even more diverse, interesting stepping stones.
Yeah, I spoke to Joel Lemon and he was talking about the nightian uncertainty um w which is that machine learning algorithms aren't very good with unknown unknowns. And and in a sense the unknown unknown is talking about these
¶ Paired Open-Ended Trailblazer (POET)
these stepping stones that might be useful later. And when we run these algorithms at the moment, it's the same with L L Ms and reasoning systems, is that they're very, very good when we give them a specific thing. And what you're pointing to is we might need to invent new unrelated problems and find the solutions which might then be related to what we're trying to do. So that feels like a bit of a catch-22 situation.
Right. So we're saying, you know, circle packing. Here's my evaluation function and I want you to sort of diversify and then, you know, kind of and then converge towards this solution. It's a s I I had the same thought with Genie, by the way, that it it gives you exactly what what you ask for. So you put a prompt in on, you know, like a Swiss lake with you know, with boats on the water and mountains on the side, and I was thinking, where are the birds? Oh, I forgot to put birds in the prompt.
¶ PowerPlay: Training an Increasingly General Problem Solver by Continually Searching for the Simplest Still Unsolvable Problem
Right. So how can we meaningfully build systems that actually kind of bring in other unknown things that might be useful?
I think uh one inspiration or thing I would personally wanna sort of um research are systems uh like outlined in in PowerPlay or Poet by by Schmidt Ruburn by by Jeff Cluen and others, right? So where there is essentially like a a set of tasks and um a solution generator and both of them sort of co evolve in this almost like odd curriculum play like style, right? And I think sort of the
In Poet the the natural first application was sort of reinforcement learning. Um but I think this can now be broadened up to to yeah, science more generally, right? At least when there's a a simulator available to for for running these evaluations. And
¶ Automated Capability Discovery via Foundation Model Self-Exploration
by doing such a cool evolution, um you you always try to to max out um the the capabilities of that generator while sort of uh increasing this uh this convex hull or potentially even um yeah more diverse problems while doing so.
I I know that there's always the leading thought that even with Poet, which was this thing where you had like a popular you you had like a load of um environments and agents and the environments were in complexified. So the agents would have a kind of effective curriculum to to learn things in increasing complexity.
Even then isn't there a kind of design bias in the system where there's some code somewhere which complexifies the environment step by step and wouldn't that also just be designed by the humans so it would also just give you exactly what you ask for?
Ultimately this uh comes down to like the hypothesis that uh language models can potentially do extrapolation or interpolation, right? In the sense that even though these things might be in the end designed by humans. There are many unknown unknowns, right? that we humans didn't think of while designing them, right? So potentially it is possible for an LLM to yeah, find a novel discovery simply uh by us not having thought about it before, right?
When we run L LMs autonomously, they they tend to just kind of like nothing interesting happens. So depending on the prompt you give them, they'll kind of go a few steps in that direction and then no new interesting novelty emerges. And I think even if you wire them agentially with environmental feedback, they they still seem quite parasitic on their starting conditions. With an LLM, could we build a system which actually adapted to novelty that could actually discover new things?
I think it really kind of also depends on um w what do you give the LLM as a starting point, right? So for example, in Chinka Evolve we from time on time saw that If you give an initial solution program which is already pretty optimized on the problem at hand.
um you still kind of get stuck in in local optimum, right? Where not a lot of novelty is introduced, right? While if you you start off from like an impoverished solution, there's much more room for diversity. And I think this is sort of coming back to you, um
sort of what I did before in uh in my research, namely meta learning. It's sort of um this classical trade-off where you can either start out from something um very uh let's say unconstrained from like a very simple solution and give much more room for the optimization.
But this might actually require open endedness and a long time to find a good solution. Or you start out from something that is already very constrained by inductive biases, let's say, and then you might be much more efficient in terms of convergence, let's say, but um uh you don't have this sort of open-ended
Square knows that in hospitality, efficiency is everything. That's why their system lets you take payments, track sales, handle inventory, manage staff, send in. This is all in one place.
Or this with zero
So you're ready for whatever's next. Learn more about their customizable plans
Yes, and I I suppose where we want to get to is building systems which are not designed by humans. So for example, if if I'm leveraging my deep understanding, you know, LLMs are really good if you if you understand something deeply. And similarly we could kick off um a Chinkka Revolve and we could we could put a starting solution in there which leverages my understanding. We want to have AI systems that ymwneud â nhw'n ymwneud â nhw'n ymwneud â nhw'n ymwneud â nhw'n ymwneud â nhw'n ymwneud â nhw
We should talk about the the evolutionary approach, right? So to maintain diversity, you had a population of programs and they were separated into islands. Tell me about that.
The way how Shinka Evolve, similar to Alpha Evolve, works is you keep an archive, like a database of programs, and then you sample parent programs. with a set of sort of inspiration programs. And then you ask an LLM to basically uh make an improvement to that program, right? So to provide code edits or rewrite an entire program or to potentially even cross over different programs.
And then you you basically you query the LLM, you get a uh program out, and you evaluate it on the problem at hand, right? For example, increasing uh the sum of the radii of a bunch of circle in a square. You run this basically each time collecting evidence from the evaluator. adding it to the the database and then sort of repeating this process. And you don't do this sort of sequentially, but you do this in parallel for many different programs.
¶ Illuminating Search Spaces by Mapping Elites (MAP-Elites)
And each time sort of a program is added, you essentially try to diffuse the knowledge that was collected by the program across the entire sort of database, right? So One way to think about this is you have a tree, a tree where each note in the tree represents a program, and then you you sort of uh branch off of it based on the parent notes, right? And uh interestingly, like these approaches do tend to scale. Um but
Yeah.
Ideally, we can make the scaling happen at a faster rate, right? And um this is something we tried in Chinka Evolve by sort of uh doing a bunch of uh innovations including sort of um model ensembling. So we're not using just Gemini, but we're using basically all um frontier model providers and uh figuring out a smart way how to uh use each model for a given parent, right? So um if you have a certain program in some situations it might be better to use a
sort of GPT model in other settings it might be better to use uh a Gemini model. And we sort of introduce a um sort of adaptive prioritization scheme that can adapt sort of the evolutionary algorithm on the fly while running the the algorithm. And this sort of also comes back to the naming, right? So Shinka Evolve, Evolve, Evolve kind of means that this evolutionary algorithm that we apply using LLMs sort of also co-evolves at the same time while we optimize.
And on on this um while we're on this circle packing problem. So you you had this plot showing how it converged and it and it seemed to converge quite quickly. So and we'll show the plot on the screen now. So very quickly the performance jumped up and then it slowly converged. And you said in the paper that it it was using three, I think three core innovations.
And
My thinking was, if you ran this 50 times, would it be the same every single time? And how, to what extent is it thinking outside the box? Y you know, um Sebastian Bubeck is always posting on Twitter talking about how GPT five has d you know, discovered new things. And there's always the question of, well, is it just searching the internet? Is it just finding things that have been found before and yeah, combining things together in in a new way? But could it really think outside the box?
Mm. Yeah. I think this is um almost like a subjective question, right? So first off I don't know all problems on the internet that try doing circle packing, right? But what I can see in the tree that we also depict is um there's for example like a crossover operation between two programs happening where sort of
Um
uh different concepts are combined, right? So one important part is for example the the initialization of the circles, another one is like um the optimization. So basically like a constraint uh optimization program is executed. And then the final part is basically like a reheating stage, right? Where noise is added and sort of more straty squeezed out. And to me, um like this
sort of propagation of information through the tree is one that's really, really fascinating, right? Where in some sense these stepping stones are actually used and So in a complementary fashion, right? And w with regards to rerunning the program multiple times, right? Of course, there's some stochasticity in it, right? So we're using language models and uh sort of due to
like the the queuing device scheduling on on their server side basically, we can't get rid of um all the all the noise. We we've seen that um at least for uh the general quality of the solution, so what is arrived um afterwards.
uh it is possible to reobtain this. Uh but sometimes with a different program, like or most of the times just by stochasticity, right? So it's not like um there's uh for many problems there's like not one solution that achieves that score, but there's like a spectrum or like a a region, let's say, in the program space that uh
resembles the same, right? Um I think one thing that was very m uh interesting about the circle packing problem, sort of also coming back to the problem problem that I discussed uh initially, was that uh originally we um We used a formulation where the correctness is checked with like um a very tiny amount of slack, right? So um the the circles could overlap a tiny little bit.
And then um afterwards we we we sort of reduced the rate AI and the solution was exact, right? This didn't change the score by too much, so it's still state of the art. Um but it was essentially like a proxy problem. We r then reran the the Shinkai Evolve on the exact setting and we found that it took a little bit longer to actually obtain the same quality of a solution.
So I think this already points a little bit in this direction of what I discussed b in the beginning. Like sometimes sort of surrogate problems might actually be extremely val valuable in in making such discoveries. And having an automated way for designing these surrogates Problems in an efficient way might be something really important going forward.
That's absolutely fascinating. It reminds me of support vector machines where we um make the optimisation tractable by introducing slack variables and you can think of that as a kind of surrogate problem. But then I'm thinking, well would um Shinker Revolve or Alpha Revolve would it know to introduce a surrogate problem? Because you know, as designers who understand, you know, we can think outside the box and and we can do stuff like that.
Because presumably if the um fitness function had the constraint that there were no circle intersections, then it wouldn't it wouldn't occur to the algorithm to come up with a surrogate problem.
Exactly. Yeah. And we optimise for that problem. But uh when you think about humans, we're really, really good at sort of inventing our own problems, right? Or reformulating the problem so then we can actually sort of work with it, right? So I think A lot of um sort of the innovations in, let's say, mathematics come from uh taking a very different perspective on a problem, right?
uh taking sort of number theory and applying it to linear algebra or the other way around. And I think right now these systems are not yet at the point of
Yes. And it reminded me, um I spoke to uh Lion about this you've got this Sudoku bench and a lot of folks watch Cracking the Cryptic YouTube channel and that's exactly what they do. They invent new problems based on abstractions that capture the essence or aspects of the problem you're solving.
And then they do something which is similar to Shinker Evolves. They do this kind of evolution where they take these different solutions and and they kind of combine the the best aspects of both of them and they forge a divergent path to a new solution. And that seems to be the essence of of what we need to do.
Yeah, for sure. I I I mean there is some work also by Jeff Klun, um Sheng Ren Hu and uh Song Liu on automatic uh automated capability discovery. So there they look at language models that generate tasks, right? But it's in a let's say unstructured way in the sense that it's not uh done in order to enable the solution to one target problem, right? And I think sort of doing these connections is is gonna be very fruitful down the line.
Very cool. Now the other thing we'll show the graph on the screen, the evolutionary graph. So um for the circle packing problem. I was looking at that and first of all it looked incredibly parsimonious, which is good. It it it looked like it had found an optimal path to the solution very quickly.
And I was thinking in my mind, well maybe there's some natural pattern that that there's there's there's there's something about that that we could use in the abstract to guide the evolution in the future. But the other thing I'm thinking about is
Right now the problem with machine learning is that we don't really have semantics baked in. So what we're doing is we have a verifier, we're looking at the rewards and we're sort of like doing patent exploration and we're taking steps towards the you know, towards the target.
And I love mechanistic forms of reasoning where we actually know something about what the program components mean. And the reason this is important is when we're merging together the best performing programs from two different islands. Um, that's a kind of first order interaction and it it might not make sense to merge them together. It's wonderful that L L Ms you can give them any pairs of programs and it will find a way to merge them together.
But wouldn't a more principled way be of there's there's some kind of semantic primitives here and we know they fit together. So this this Lego analogy that we're kind of building up based on principles rather than forging a path based on the performance. Yeah.
Um that's a good point. So um one thing we do in Shinkai Evolve as well is we keep essentially a scratch pad. So each program is being summarized and then from the program summaries we keep sort of a set of global insights, let's say, that were shared or like extracted from these programs. And then based off of this scratch pad, we construct um sort of meta recommendations that then become part of the system prompt, right?
So um that way y you can try to sort of semantically grasp some of the discoveries. But a general problem, which is again sort of task dependent, is
Um thereby you sort of diffuse that knowledge across the tree, right? But sometimes you want things to be much more isolated, right? It's always like um a trade off where you somehow have to find for your problem the right uh position on the spectrum of how much knowledge diffusion do you want to have and how much sort of uh let's say hard islands of programs do you want to have, right?
And um yeah, we we're trying to make steps in the direction of sort of automatically adjusting this in an optimal way. But again, it's very program sensitive. And then sort of I think uh another point where you're already sort of going into um is sort of Jeremy Jeremy's solution to Arc AGI, right? And sort of doing
um solution evolution in the instruction space, right? Instead of the program space. I do think that this is uh something important and we're Like I said, with like the construction of this meta scratch pad, trying to do sort of both at the same time. Uh again it's problem dependent. Like I played around a little bit with ARC AG one AGI one and ARK AGI two. And I think on ARC AGI one actually the the transform sort of program direction is actually quite effective, right?
Like Jeremy said, it's deterministic and um it's easier to sort of get clear signal to improve on during your evolution process. Um While on others, like Arc AGI2, like this whole sort of semantic evolution uh seems to be more efficient. So I think ideally we we can get a system that can automatically, in some sense, decide whether or not it wants to take like a programmatic approach.
in settings where it's actually feasible and easier to to bootstrap off, or it takes the semantic approach of uh evolving instructions or like LLM driven um input output.
Yeah, it's it's so interesting because um, you know, like a a symbolic AI person. Oh I don't like connectionism because it doesn't underst you know, the only semantics in connectionism is this notion of similarity. It doesn't really understand things. So so they would say, well, just just start with a a an entity relationship graph and then just kind of build up using, you know, composition and first principles.
That that that doesn't work. Right. So we're using neural networks because they're incredibly flexible and they understand a lot of things about the world, but they don't have the kind of constraints that we want. So what we do is we use these tricks. So Jeremy evolved program descriptions. Um on your program selection you had a semantic novelty detection, you know, using like a um
Embedding based similarity.
So you'd you had like a kind of self-similarity matrix and uh, you know, based on the um the the cosines. Um and and indeed you've got this meta scratch pad. So what we're seeing is this fascinating spectrum of possibilities. Where still using neural networks you can imbue semantics in using all of these different tricks, but they all come with trade offs.
Yeah, for sure. Like I think it's it's kind of interesting. We we've had a long period of computer science where algorithms were sort of designed by humans, right? Then we had sort of this. Andrew Kapathi Software 2.0 paradigm where like we trained neural networks that then performed a certain function. And now we're sort of at this point where we're using LLMs to design algorithms or solutions more generally. And I think uh actually like even though like
Large frontier language models are extreme, like let's say black boxes, or it's very hard to get a full mechanistic understanding of them. Um, the outputs can be, right? The programs, the instructions. Right. So I think it opens up a very sort of new paradigm of um doing research or basically doing anything, right? If you if you think about it. Um but I think we're we're just sort of at the starting point of figuring out the the right uh user interface for
So the other innovation in the paper was using um UCB, which is um upper confidence bound. It comes from the multi arm bandit literature, which is this problem where you can pull these these levers. and at the beginning you don't know which levers to pull and and over time you kind of reduce your uncertainty and you can kinda pull the ones that work. But there's this exploration, exploitation dilemma.
And you've implemented that for figuring out which L L M, so it could be Gemini, it could be like, you know, Grophor or something, to figure out which one to use.
We're we're using like a model ensemble, right? To propose program mutations. And um intuitively one could say like the the best frontier model on on Sweebench is always the best mutation proposal. But that's actually in practice not always the case, right? And in general it's extremely hard in this um evolutionary setting to assign clear credit to a single model, right? So you have For example, like one improvement is uh implemented by GPT-5, and then the next one is implemented by Sonnet 4.5.
It's unclear basically if the performance gain you get from the second mutation actually originated from GPT five sort of collecting the first stepping stone or from Sonnet four point five. Instead of sort of uniformly sampling models, what we do is we implement um this bandit-based approach, where each model um is basically one arm of a bandit. And then we look at how often did this model improve performance of
a sort of parent node by creating a mutation. And we then adjust sort of this posterior probability to sort of first explore all arms once, right? And then essentially um change over the course of time in order to uh prefer
The great thing about using a U C B like algorithm is is you can it it actually has um a theoretical regret, which means it's not it's it's like only log worse than the optimal switching path, if that makes sense. But if I understand correctly, UCB is based on a a sort of like a global rating, like a a mean score of every single L L M.
And I think what we want is to have more of a contextual switching um decision, which means we know for this particular program Gemini is better. And do I understand correctly at the moment that it might converge to a single frontier model and then in a nuanced situation, we might still get the wrong model.
So uh in general like there is um some amount of probability associ uh like allocated to all models, right? So it's not like it can just peek on one model and then you stop using the others, right? So there's still a chance for open endedness and serendipity, if you will. And uh we in general, like for the problems we consider, we we haven't seen that like um one model clearly dominates all the others, right? Uh we've seen then it really depends on the course.
of this evolutionary um process, like which model is better and um UCB or like the the bandit approach that we take dynamically adjust this in in an efficient
And would it be possible in the future to use an LLM to make this judgment?
Potentially. In some sense, uh in that case again you think of the LLM as a uh surrogate model, right? In some sense, um you can think of like uh a Gaussian process as a surrogate regression model, and there has been some work sort of showing that language models can act
as surrogate models. And um the real question to me is like how do you represent the information to the LLM, right? In the sense that if you use like the raw programs in their fitness evaluations, you you quickly run out of context, right? Um so you need some amount of compression in order to present the information the right way to the LLM in order to do this prioritization of the models.
I hadn't appreciated how long the context is. I mean, I I was thinking, you know, could we use like a an eight billion llama model and we're doing um active fine tuning? So we're saying, I just ran it on, you know, I just ran this program on Grok and and it got this score. Yeah. And and then o over time that, you know, this thing for the given run of this evolution, it will kind of know that Grok is good at these problems.
Yeah, potentially. I I'm not sure like how efficient this f fine tuning is if if we're only evaluating like a hundred and fifty programs. Um but in principle one could imagine. I think it's on the engineering side, not necessarily like the prettiest to do. Yeah, it could uh it could in fact happen. But I think like for all of these things, um we started out sort of with the let's say uh most intuitive algorithmic component that we had in UCB was
that really um did the job here. And uh yeah, much credit to Eduardo Cetin who who introduced this to to Shinka.
So let's talk about the um the diffs and and the mutations. So um we we generate programs and I I think you folks were inspired a bit by Alpha Revolve. So they actually had this gating where where you kind of gate part of the code which is mutable. tell me about all of that
program is just let's say a long string, right? And um in order to to make sure that certain parts which are sort of essential to the evaluation, for example, and to the imports and so on were not sort of deleted by the LLM mutations. Um there are so-called markers, which basically state which parts of the code are mutable and evolvable. And um it's easy to like programmatically sort of make them actually immutable when you get a
um this proposal and these will not be changed. So only the the rest of the um the code snippet will be changed. We sort of implement a type of rejection sampling with reflection approach where if an L L M by chance, for example, tries to mutate this port, it's gonna be rejected and you resample a new proposal. And yeah, thereby you you can somewhat mitigate certain security or safety um problems and uh yeah, get a robust sort of mutation.
One of the sort of I think the the bigger questions is how can you turn this from a single file mutation setup to a multi file mutation setup? So working on entire code bases. In principle you can
represent many code bases in a single file, right? But um the hierarchical structure might be actually useful. And there are some um ideas from uh let's say Ader, this this um coding tool um where uh you construct like um a repository map and sort of have some level of abstraction but they also come again with uh positive uh and negative
basically. I I love Ada by the way. Um it it feels that in the future the um the aud you know like the code generation systems will will actually resemble Shinker Revolt.
And
D i if you think about it, it'll be using some kind of Git repo. Maybe Cursor already does this because in Cursor you can restore previous checkpoints. But it can be exploring different branches and and merging checkpoints together and um you know, obviously you you just say in natural language what you want to do.
But um we didn't talk about mutation by the way, so so we just spoke about diffs and there's also an option to do the a full file rewrite. Exactly. But there's also this notion of of of crossover. So how how does that work?
small innovation uh on top of Alpha Evolve uh where uh I believe they only use sort of this diff based um um mutations is that here we wanted to have more flexibility to entirely rewrite the programme, right? To come up with a completely different stepping stone, if you will. So um again there you can make parts of the code mutable, but instead of proposing uh let's say a patch to change certain parts of it, we um essentially rewrite the entire program. And um
This sometimes is helpful, right? Um it's not always like a clear benefit, um, but uh it it it allows you to essentially get more diversity into the search, right? So um this is one type of mutation next to sort of this diff. Patch based approach.
And the other one is uh a crossover mutation where we sample basically uh not only a single parent program, but sort of two different ones, and we ask the the system to sort of make a complementary improvement. And um again, on some problems this is really helpful and on others it's not.
Um but in generally we found that um sort of having a diversity in terms of operators is also helpful in discovering new things. And I wanted to to sort of follow up on the point you made before about this sort of being a new paradigm. I think so too. Uh I'm really convinced. I think right now we're sort of um at the beginning where we we still think a lot about sort of this chat assistant interface.
um as the way how we interact with LLMs, but it's uh most of the times inherently single threaded, right? So we're sitting in front of the computer, we're interacting with the chat, we're seeing sort of changes uh as they occur in the editor, we accept them and so on.
Um but I think this is sort of also just a stepping stone towards sort of a more let's say distributed way about thinking about research, optimization and so on. So I like to sort of think of um vibe coding, vibe chatting and on the other hand we have sort of vibe optimization and vibe researching where sort of my ideal future scenario is one in which um you as a researcher sort of during the day co-work with like
system like Shinka or the AI scientist, you um sort of steer the ship like a shepherd in some sense. And then during the night you you you you press play and you go to bed and in this uh in the background you have multiple experiments running and automatically new ones being proposed by L L Ms, evidence being accumulated and then in the morning you come back and sort of you have an
uh multi threaded sort of system running in parallel. And you're more like the shepherd of this ship than the the person actually Executing experiments and analyzing. Oh yeah, you're still analyzing, but you're not executing. This is happening sort of by the system itself.
Yes. And increasingly this might be semi supervised or even proactive. I mean, you know, there's that new product from OpenAI where it knows what you're interested in and while you sleep it's going off and, you know, fight your pulse, that's right. And, you know, we're in the situation now where we're reasonably technical people. So you know, mat MATLAB and Mathematica, they're they're supremely powerful. But you need to know how to express problems precisely.
Whereas I can imagine a future where we um express problems just in natural language or maybe just based on our interactions with language models, the platform knows what we're interested in and it can just go and find things on our behalf because this is about democratizing this technology to people who perhaps
for. I think one of the bigger problems there is sort of this verification aspect to it, right? In the sense that oftentimes it's easier to generate a lot of solutions than to actually like hard verify them, right? Language models are capable of doing
sort of soft verification, looking at code and sort of uh latently running like a like a stack trace of execution, right? But it it's not exact, right? And I think um sort of these notions of uh reward hacking and sort of um not doing real discoveries, but sort of shortcutting them is one uh where we need to put more
time and effort into to figure out um yeah how to make sure that this actually moves in the right direction, right? And I would hope that language models at some point can do this efficiently themselves, right? So either implementing encode or latently doing it. Um but this is also like part of the problem problem, right? It's not only coming up with the problem, but also with the automatic verification at the same
Yeah, isn't it a tantalizing idea that there are natural patterns in the world and the building blocks to construct novel solutions are already there. Right. And and maybe they're there for a reason. Maybe they just reflect natural regularities in in the universe.'Cause there's always this question of you know, intelligence is about adapting to novelty. So the world is always changing.
and the world tomorrow will have things that we can't explain, you know, with our with our knowledge today. But we do have like abstract knowledge that could be easily recombined to explain the future and LLMs might already have those building blocks.
Yeah, for sure. I think like in some sense the more you think about sort of Occam's razor applying to everything in our world, like let it be language or let it be sort of science, um is is pretty interesting because like these artifacts now go into our language models of today and potentially there is some amount of this being captured. Um I think though it might also be an inductive bias that leads to a local optimum at some point, right? And you need more complexity.
But I do think like with systems that sort of do this evolutionary mutation sort of style approach, you might still sort of push The system out of these locations.
eventually. Yes. And there there's also the notion of the importance of adaptivity. So this is what Shawleigh says in intelligence is.
And
Since we've had these models that actually do adaptivity at inference time, so things like test time, active fine tuning and the reasoning models and and so on, they started getting non-trivial performance on our Now it's very, very expensive to have adapting huge foundation models. You know, it it's it's just a a practical concern where we haven't done that yet.
But what we can do is build systems like Shrinker Evolve that leverage the best of both worlds. So they leverage frozen foundation models, but they give you adaptivity and the purpose of adaptivity is to respond to novelties, to create new building blocks, synthesize new building blocks in this principled tree like structure that allow us to adapt to novelty. So we are having our cake and eating it.
I have to say I found it very interesting that Jeremy basically in your podcast when you asked him about Shinka was saying like he doesn't believe that there are a lot of sort of percentage points to be gained by using a system like Shinka, but you can make it much more efficient.
Right. That was sort of the gist of his answer. Um and to me it's like once you have made it much more efficient, you can scale it up again, right? So if you essentially have a cheaper system that can uh generate many more sort of instructions, I would expect that by the
nature of open endedness, uh, you might get some amount of improvement out of it. Uh right now I don't have any evidence for it. Um I would love to collect that evidence. It's again like the magic of open endedness that comes into play. That as long as sort of these training examples of Arc AGI give you a good signal for a final test submission, you should be able to.
progress. Yes, and that and that is a great segue because um certainly on on the circle packing problem, it was so sample efficient that in less than two hundred, you know, interactions with an with an LLM you converged on the solution. But I was thinking that
Great.
But it's still quite dependent on the starting conditions. You know, we talk about this design bias and and and so on. So what we put in is very important. But now what we could do is scale out. So we could run this a thousand times and we could have another process which
prompts, generates, breeds the starting conditions. Because'cause every time we run uh Chinker Revolve, what it's doing is it's it's searching parts of the epistemic tree. Sure. And what would happen if we just scaled that out massively?
We haven't tried, but you could even start with like an empty program, right? Which be b would be basically the same, right? And then you would branch off of that empty program, I would expect. Yeah, we haven't done this simply out of sort of uh cost and um time reasons. Um, but I do think in many ways sort of
This is the question that will push us towards like this true open ended vision of running a system for like a month or so, right? Really trying to squeeze this out. Yeah, I'm not sure if we're entirely there yet, but I will do my best that we
And the reason this is interesting is we know as a practical matter that we can't start with nothing. If if we were just sort of like starting from the most primitive building blocks, the search space would just be huge and there'd be no learning signal. So we know we need to start a little way up the stack, but we can massively parallelize that.
So let's say we have a thousand different instantiations of Chinka Revolve. It doesn't have to be embarrassingly parallel. We could still have some sharing. So during their execution we could still have a little bit of like crossover and and and and maybe then we could we could run or the Chinker Evolve instantiations in a in a similar kind of meta evolution loop and
My suspicion is contra Jeremy, I agree with you. We know there are diverse stepping stones out there that could dramatically, dramatically improve many of these solutions. We simply haven't scaled it up.
Yeah. Yeah. I also believe that um using a system like Shinka Evolve um could be able to sort of automatically detect whether or not like an instruction base optimization approach for a given problem or a transform based approach is actually the right thing to do. And sometimes potentially it's like even the mixture, right? There's some things you can probably easier even articulate in Python than you can articulate in
in sort of language, right? So I I would be really interested in um sort of exploring.
Yeah, I mean you said earlier about Jeff's Clun what what was Jeff Clune's paper, the um the thing that generates property.
Uh capability discovery.
I did speak to him about this at at Neuris, but s something like that could be fascinating as well. You know, where where we're also generating the problems and solutions and then kind of moving them back in. But I I I think the way this will land commercially is there'll be a new type of GPT.
where everyone is solving different types of problems and and the system it'll be like a kind of chinker revolve but a massively distributed version where mathematicians are using the platform over here to solve this problem and it will see commonalities. And it and it will kind of like link them together. Because because you need to leverage like human creativity in this process as well, I think.
Like a big challenge going forward is going to be like uh how do we uh change our incentive system for this to actually scale, right? Um I think like for example, some amount of uh economy will be needed, uh or some amount of uh mechanism design in order to make sure that everyone is still happy to engage in it, right? So maybe uh we're gonna have many more leaderboards for whatever is numerically um sort of
scorable. And um I think this this this will be really, really interesting to see how sort of compute, um, these automated agents, uh, human shepherding and steering will ultimately sort of change and revolutionize science and I guess society module.
generally. And Rob, looking at the future, we've got a load of people in in um San Francisco that that wanna scale language models and they are adding in implicit forms of adaptivity and composition. So that they're building controllers and they're doing
reinforcement learning with verifiable feedback and so on. I think that you subscribe to the slightly different idea that that we need to be far more open ended and we need to be using evolutionary algorithms and so on. But do you think that they are on a path to nowhere? Do you think they might change tack? D uh I mean wh where is this going?
So I I actually think um that these things can be complementary, right? In the sense like um let's say you find your model to be like a circle packing.
uh expert, right? So I I do believe that uh mixing in sort of different sort of RL fine-tuned models into sort of the ensemble of models and then having a good way to adaptively select which one model to use is is not a bad idea, right? Um So to me, uh I just very fully subscribe to this philosophy of open endedness and uh reading Kant's and Joel's book was
really like a f fundamental moment in my life and I uh want to see how far we can push this. And I think we're uh we're not yet at uh sort of uh convergence where either the capabilities of the models uh has converged or the the way how we uh scaffold around them or the way how we humans interface with them. So to me they're really like these three
¶ Automated Design of Agentic Systems (ADAS)
points like uh model capability, model scaffolding, and sort of the user interface. And I think we have a lot still to push on all three angles.
Beautiful. The only thing we didn't talk about was we spoke about the circle packing problem, but you also applied it to a few other things. Can you tell us about that?
So one thing we did was um we uh sort of used um a framework called ADAS Automatic Design of Agentics System s where basically instead of manually riding an agent scaffold, Um you use an LLM to write agent scaffolds for a specific task, right? So what we did is we looked at um mathematics tasks. So um Amy and we uh use Chinka to evolve basically an agent, right? So using an agent to evolve an agent.
And uh we found that uh there we could dramatically improve um sort of the performance of very cheap models like GPT four point one nano. But the agent scaffold was also able to either like generalize to other language models
or to different years of uh of Amy, right? That was one application. Um one important other application that we did was to uh ALE bench. ALE bench is basically uh work done by other folks at um Sakana, including Yuki, who's also part of the paper, uh which is uh considering heuristic sort of programming contact uh contests um sort of previously done and executed by Adcoder, which is like this famous
Japanese competitive programming organization. And we sort of showed that Shinka can also work very well as a co-scientist. So basically we we took initial solutions obtained by um an ALE agent that was previously designed, and then we optimized on top of these initial solutions with Shinka and showed that on one of these um
sort of programming tasks, if the combination of this agent and Shinka would have competed in the challenge, it would have um uh ranked second place basically. So I think there's uh some evidence that Shinka can work as a coscientist and um not only for LLM agents, but potentially even for humans, like we discussed. And then finally, the final application that we looked at was um designing sort of mixture of expert load balancing loss functions.
So at Zakana we've done some uh previous work called Disco Pop. I think we discussed this during the last podcast we did, where we're using LLMs to design objective functions. And back then we did it for preference optimization and post-training. And here we did it for load balance.
uh of experts. Also there uh we found that within like I think like even only 20 sort of generations uh we were able to sort of um explore let's say not only a single objective function um but sort of Let's say a convex hall where there are different trade-offs between sort of performance and load balance. So I think this is another application of Shinka where it's not only basically about sort of finding the best solution, but essentially illuminating.
a program space where there are always potential trade offs between like let's say for example runtime and the quality of the circle packing, right? And um having a system that can explore all of these is
I'm very excited to see you apply this to the ARC challenge. Like what what are what are your thoughts about that?
I still need to collect results.
Ha ha.
uh hard claims before having done this. Um, but I would hope that there is some chance of for sure improving sort of the the cost of these systems.
Oh, very exciting. So you've done some experiments. Exciting news is potentially coming.
I've started looking into it.
Yeah, and th I mean what what are your thoughts in general about about ARC though?
I think it's great. I think it's uh it's really important and I think um it fills an important gap and uh I do really d deeply respect Francois and sort of um read the paper when it first came out and no one thought of actually being able to to get numbers above ten percent, right? And
It's uh also pretty fascinating s on a society level how far we've come since then. And sometimes while you're sort of deep in the say battle mode or uh work mode, uh you kind of forget where you were one year ago and then just looking back it's It's pretty amazing. Also how far we've come since um.
one. It's insane. I mean I I think Francois doesn't get enough credit because it's such a good benchmark. And not necessarily for reasons people think because Francois is always saying that um we need to have a benchmark which is easy for humans and hard for AI.
And and in a sense, that's not quite the case. I I said when Arc V two came out that it's actually very difficult for humans. You know, there was one task where Duggar was stumped for about fifteen minutes with there's three of us looking at it and w we just
And it's one of those things that depending on your perspective, you might get it straight away or or you might not. So there's that criticism. And people have said that R three is even harder. Yeah. You know, but I I think that's rather missing the point. I I I think he's saying that with a lot of these competitive coding um problems.
the the data set is contaminated. These are problems that have been solved before in in part or in whole, which means when you look at the epistemic tree, many of the building blocks for solving them are very high up in the tree. He's he's looking at these these problems that there is very little dataset contamination.
Um they need to be solved from very abstract building blocks. So you're starting much lower down the tree and you're synthesizing a model by composing together very abstract building blocks, which is the essence of intelligence. Yeah. And and I think for that reason ARC is is really kind of pushing us to build adaptive systems which we could say are intelligent. Yeah.
I agree. I I mean like in many ways I'm I'm really looking forward to the next years and seeing how far we can push this and then also how much generalization we can get afterwards.'Cause I I I believe like when you look at sort of the more recent models They're getting much better at the uh transform style code. evolution or outputting for ARG than they are on the instruction-based level. And I think this might already be like a small sign of some amount of
over training on ARC AGI one at least, right? I do believe there are some aspects of work which will be automated before it comes to sort of fully science automation and the type of work I'm doing. But I could imagine that certain parts of the dimensions that I deal with every day are for sure going to be hit by AI. And then the question is, are there gonna be new dimensions opened up that we as humans will fill in, right? And I think what I said before about like shepherding and so on.
I really hope that that's the way forward, right? In the sense that humans are the ones steering the ship while just being massively amplified in their productivity.
Right now, I am not really seeing the kind of job market disruption that was being predicted. I know from personal experience that I in in a sense it's made it very difficult to hire people. you know, sc script writers use um chat GPT, I can spot it instantly. And uh writers and copy editors are actually in more demand than they were before fixing all of the crap that has been generated with Chat GPT.
And there's the cloud analogy as well. So, you know, IT system administrators who were earning, you know, sixty thousand pounds a year in the UK, they rebranded as as cloud DevOps engineers and they more than doubled their pay.
And people are very adaptive. They they see new trends, new bandwagons and and they just adapt and and they add value on top. And that has been the trend for, you know, for a very long time. Do you think that AI is going to be so transformative that it will transcend people's ability to adapt.
I think it's uh just a question of um speed, right? So um I was talking about sort of cultural evolution and technological evolution and it seems like we humans we need more adaptation. and more time to to get used to the technology to carve out these niches where we we can fill in and it's complementary, right?
So first off I I think we're we're still not at the ceiling of the sort of technological progression, right? So maybe in a couple of years we will need less of sort of slop editing, like you said. But I do think we'll we we need some more time to adapt to the different modalities of interacting with these systems, right? I think everyone um can sort of interact with uh a chat assistant. Um, but I think this is the most sort of naive form of interacting with
AI agents, for example, right? So yeah, I think we need to get the pacing of all of this right and we need to do much more exploration in human machine interfaces, UI UX design and uh how to make sure that humans sort of fill Alright, feel fulfilled during this experience.
particularly relevant because, you know, you were behind the AI scientist paper and there's now version two of that. Allow me to be a tiny bit skeptical. You know, we were talking about when we evolve systems to do a sp to do a particular thing. And at the moment it feels like as good as they are, they are still quite parasitic on the instructions and intentions of the human supervisor. So it's very much
um, an exchange between the humans and and the system. Because the implication is that in the future we might have systems that are so autonomous and so open endedness and can figure out valuable things to research that humans wouldn't be needed anymore. And the reason why I'm not that worried yet about labor market disruption is I still believe deeply that humans are the source of deep understanding and creativity in the world. If I didn't believe that, I would be very worried.
I agree. To me, like the AI scientists like V one and now V two are sort of glimpses into a potential transformation. But I fully agree in order to make really big Scientific breakthroughs like multiple of them, like every day or whatever, you still need humans in the loop to sort of either
seed or guide the direction in which to explore or to to verify, check and um actually yeah transfer these insights, right? So I think um it's not gonna be like all ML PhDs will will be unemployed. It's it's more gonna be a um sort of core evolution of humans with this technology. And um potentially like in an ideal future for me, like it will allow humans to focus on what they're really, really great at, right? So I think it's gonna be an amplifier.
of sort of these these latent dimensions humans are great at, right? I think something that's critical is that we as humans try to interact with these systems as early as possible in order to actually like have influence and uh ownership over like this development process, right? It's um ultimately collective intelligence that will shape all of these systems together.
And do you think these systems can become incredibly sophisticated such that they are you know, somewhat detached from humans.
Well, I mean with the AI Scientist V two we sort of released that um one paper that we submitted to an iClair workshop was able to sort of pass uh the acceptance um threshold before meta review. So I I do think at least for sort of workshop level contributions um w we're we're getting there. Um while not every submission in an AI scientist paper um does is or
uh is reaching that threshold. Um we're we're at the point where we can even talk about sort of uh noisy review processes and this actually being uh yeah, something that as long as you have a large budget, you might get something out of it. I think going forward for the bigger innovations and so on, um, for now, it you still need humans. Um, but we're sort of at the GPT one moment of of making this
sort of a reality and potentially uh in ten years this is gonna look very, very different once the sort of also the infrastructure for it has been built up, right? So there are places like periodic labs, right, which sort of now are building like real physical labs with uh robotic systems to automatic uh automatically sort of execute experiments. This will take some time.
But it is uh sort of imaginable for sure that as we sort of do RL on these types of uh systems and we actually also account for negative results and for actual like hypothesis testing, so getting these systems to be a real good hypothesis testers with verifiers in the loop, um, that we might be able to unlock many more capabilities.
Yeah, I mean, I suppose I d I don't want to sound like a Luddite. So it's entirely possible that this is just you know, I I'm I don't have the imagination to think about the future. So it is possible that in the future these systems might understand very deeply and be creative. You know, I I think right now the problem is they only understand things a few levels down in the epistemic tree.
So they can do some surface level recombination and they can discover new things in the basin of things that they've already discovered. But but we understand things very deep down in in the epistemic tree, which means our you know, our cone of creative potential is is much wider. It's possible that that gap might be closed. What would happen then?
The way how I kinda think about the scientific process is Like a tree search ultimately, right? So I think a lot of sort of analogies from evolution transfer to scientific research, right? In the sense that we traverse a tree of different ideas or different experiments, and then in the paper we report one path through that tree.
And I think what I kind of alluded to before, we need much more like full tree data sets for training these L L M systems to actually learn how to do this exploration and this foraging basically. Um at the same time I I feel like um evolution will also take place on a cultural level, like for us, right? We will get better at sort of steering the ship and um
I can imagine that uh in in a future world, sort of the way how we do research will be completely different. And I'm pretty sure that right now already ninety nine percent of machine learning research is done with sort of AI assistance, right? Think about ChatGPT brainstorming, cursor coding, cloud code, et cetera. In the long run, we're going to move on that spectrum from sort of with AI closer to by AI and then sort of more high level sort of orchestration and overseeing by humans.
There's also the notion of how intrinsically coupled to humans is the value function. So one school of thought is that AI will develop a mind of its own and it will you know, basically transcend humanity and it will just have agency which is not parasitic on on on ours.
I personally don't subscribe to that view, but the other view is that it is like let's say the AI scientist, you know, like version ten, it's going to be continually epistemic, you know, epistemic foraging. It's going to be finding new things that are useful. And they kind of have to be useful to us. Because if it finds things that are not useful to us, then we just won't use them and then nothing will happen. So so d do you think there'll always be a kind of coupled value function to humans?
Um, Jeff Kloon had this work on Omni, right? And using LLMs as sort of amortized notions of interestingness for humans, right? And I think ultimately the way how we train these systems is coupled in in human data, right? And
um going forward it will also be coupled with human data that is collected using verifiers, right? So I have a hard time believing that um in the long run, when you run this open endedness uh sort of paradigm with AI scientist agents, uh it's gonna c s completely divert to to something that's either
fully non interpretable or unrelated to problems we as humans care about, right? And uh then again, like humans can steer to a certain degree where like the search happens, right? So you can tell a system, okay, um To do cancer research, right? And sort of work on problems that we care about. And ultimately, like we are the ones who control how much flops are being pushed into.
Yeah, because as a thought experiment, I can imagine, let's say, in the world of mathematics, what if um an AI scientist could come up with entirely new problem formulations and then solve them? And these are things that humans had never conceived of before and maybe they would be less interested in the answer because humans hadn't spent time thinking about it. And if you think about it, we could just explore the phylogeny of mathematics just to the nth degree.
Yeah. And at some point maybe we just wouldn't care anymore. Maybe we can just carve out that space just forever and ever.
Yeah, but maybe down the road there is a stepping stone that enables a new innovation in a different field that we actually care about, right? So it's very hard to say a priori whether or not something is interesting or not, right?
Yes. And there's also the notion of I love this idea of diverse intelligences and diverse minds. And maybe we we could just create artifacts in a space which is completely alien to us. and we might even ascribe moral value to them and we might not want to turn off, you know, the the power because we we want these alien artifacts to stay alive.
Maybe. Like I'm I I read a lot of science fiction, but I I would uh sort of shy away from from speculating about all of this. But um I do think one thing I'm extremely certain of is that the way how we conduct research in science is going to fundamentally change in the next five years, ten years, and twenty years. And I hope that we're going to be able to sort of tackle some of the biggest problems which are still sort of seemingly unreachable right now with and by AI.
So Terence Tao has posted that he's been using GBT five to and and it's it's been speeding him up. It's it's taking away a lot of the um the drudgery. But the cynical take is that and Scott Aris Arredson uh posted something similar as well. The cynical take is that maybe laziness is is stepping in and in some pernicious way using AI models is actually stopping us from thinking outside the box. So it's it's encouraging us to kind of search
in the neighborhood of things that are known. And that is very useful. It's very useful to have an artifact that knows all of the experiments, all of the things that are ever done by people twenty years ago. But now we don't have people really kind of applying their their brilliance, their talent in completely new areas.
So first off, it's great that these experts are already using the technology in their day-to-day work, right? And I think it's also important that really, really top-level scientists try to push what's capable with these systems or squeeze out where there might be sort of black spots or stuff where you these systems can't do. Um second off, I think it comes down sort of to
Discipline and how you raise sort of the next generation, right? So discipline on a personal level, like how much do you just sort of uh Tap accept everything that's being proposed by these systems and um responsibility in terms of educating the next generation in the sense that we need to sort of teach our kids that ultimately what comes out of these systems might not always be be true, that uh facts can be sort of uh subjective, if you will, and that there needs to be more research about
uh w w what's being given to you. And um I think this will be, like I said, this cultural evolution that we have to step through and um try to make the best out of.
Yeah, the autopilot thing is very interesting because there is a tendency using cursor just to you know at at some point the models are getting so quickly that you can't even read Yeah. The tokens coming at you and then you just press accept and you press accept. It's the same thing in cars that as soon as you have too strong of an autopilot, you just completely switch off.
and and then you see a divergence because there's something about thinking that it must be grounded on your path. There's this path dependence. And when you start kind of becoming parasitized by this other train of thought then you stop thinking about your path and then you're not in the driver's seat anymore.
Now, like uh a bit of a harsh statement, but sometimes I wonder if these systems, like these coding assistants, um, are almost like drugs, right? In the sense that you become addicted, you re uh y you use up all your um sort of budget and then you need to load up again and uh once you you fully re reached sort of the the budget limit you feel like, Okay, what am I gonna do now? And I think once that happens to you, you should really
sort of rethink um the way how you work, right? And um to me right now um there's certain parts Where like sort of auto accepting is acceptable, and there are certain parts where it's definitely not, and you really need to go deep into it. And I think Right now we're sort of in this weird, um, non equilibrium state where things are moving constantly, right? So the systems or the models are changing, the features are changing, the sort of um
points where the systems are good uh is change uh are changing all the time. And we humans need to constantly adapt to to that, right? And uh I think it's a big cognitive challenge. And uh I think w we just all need to be aware That there are certain problems and certain challenges that we have to adapt to. I think the best way to do so is just interact with the technology as much as you can and um Maybe find new research ideas for uh out of that.
And how is AI scientist V two different to V one?
In V one we we used sort of a template based approach. So we had like a base experiment and then for that base experiment we asked sort of an L L M to generate ideas sort of with semantic scholar calls and sort of literature. Search and then uh it implemented sort of these ideas based on the template, right? It did basically code different. And then it linearly executed like an experiment plan and wrote a paper in the end. And so what could happen was that
Uh there was an idea and that idea didn't work out, right? But then in the end, the paper like the experiments were still executed linearly and you wrote a paper. And uh this was already impressive in the sense that it looked very much like like science. Um but if you think about human sort of uh science and like the scientific method, um it's much more like tree search, like I said before, right? You sort of adapt what you're gonna execute next.
And you sort of refine based on evidence that you accumulated, right? So this is sort of the the notion of falsificationism from from Karl Popper, right? In the sense that. uh we collect evidence for uh hypotheses and reject uh we reject others and we do so in in a loop basically until we
we want to publish or we find something. And we tried to take this notion and directly build it into the agentic scaffolding for the AI scientist v2. So now it's basically like an paralyzable agentic tree surge. where there's no longer a uh template experiment needed, but this is drafted up by the LLM itself. And thereby the AI Scientist V two can be applied to many more sort of settings, if you will. So at the core is sort of this new agentic tree search paradigm.
And then uh we use um sort of a couple of uh minor technical change uh changes like using a VLM reviewer for um sort of figuring out if captions of a paper are aligned with the figures. And uh we we scale this up to many more sort of uh computational notes and then write a paper in the end.
Need a vehicle that isn't afraid to make a splash? That's the Volkswagen Taus. Capable and confident, the Volkswagen Taus is fit for everyday life. nimble in traffic, agile in tight spots, and still spacious enough for weekend getaways. While available four-motion all-wheel drive gives confidence in rain and snow. Taus. You deserve more confidence. Visit vw.ca to learn more. SUVW, German Engineered for All.
So I'm trying to trying to say this in the most polite way possible. But um a critic might say I I don't want to use the word slot, but a critic might say, um we are producing papers which appear like papers. So they they have figures and they have results and they have things written in a certain way, but they're not grounded deep down the epistemic phylogeny, which means that they they they're they have
you know, near near the top of the tree we're seeing some novelty and composition happening but it but it it doesn't reflect a deep understanding. What would you say to that chart?
It's for sure that not every paper that comes out of the AI Scientist V2 is a nature worthy publication, right? That that's for sure the case. So um definitely there is uh some amount of let's say slop or um content that is not like a scientific big discovery being written up by the AS Scientist V2.
Um but ultimately like we we showed that it was possible to obtain a workshop level paper. I do think this is sort of the first time basically where we can see that at least now we're able to fully autonomously spend computes, spend API calls. to obtain some amount of scientific insights. And for me, at least right now, it's a good way to sort of prototype ideas or to investigate a certain field, get like initial starting point, initial results, and then to to work on top of it.
Um but for sure more work needs to be done to make this uh entire process more robust, more efficient and um essentially produce many more sort of uh true positives.
Yeah, and it might be one of these things, you know, like when we moved from uh GPT three to G P T four, there was just a massive increase in fidelity because The thing is with with slop, to me, it simply means lack of deep grounded understanding. And there's no reason in principle why these things couldn't have a deep grounded understanding. They just don't have it yet.
Yeah. So it's something that could improve over time. But it it's likely to improve quite slowly. And then at some point we might just think, oh my God, we've got an AI scientist.
Yeah, I mean like to me this this kind of comes back to um what we were discussing about before. So first off there is a verifier in the loop, right? Or in the sense that um experiments are actually executed o on a computer, right? So the numerical results uh can be be fed back or are fed back into the system to come up with the next thing to explore.
Um but like uh we haven't made uh like a uh let's say discovery like uh residual connections or something that have diffused into everything in machine learning. And I think what we really need is to make these systems be much better at sort of integrating knowledge over multiple experiments and sort of become better at sort of formulating the next hypothesis based on previous insights.
And uh yeah, this might require some amount of post training on sort of these traces basically. But I'm pretty positive that we might also get there with just diversity and scaling these systems up um in in an efficient but uh scaled up way.
I'm just thinking that the the first breakthrough discovery, would it resemble the AOA scientist paper or would it resemble Chinka Revolve? So for example, we we could do like a massively scaled up uh Jinker Revolve and we could say, I want to discover a new architectural design and Would that happen and then we would get the AI scientist paper to kind of write it up and do ablations and stuff. May maybe that would be the the pattern of it.
certain degree I've been thinking a lot about how you can potentially even combine these two paradigms, right? Um the AI scientist and in in Chinca or Apha Evolve style uh optimization uh algorithms. And I do think there is some amount of work to be done on sort of this autoverification sort of aspect to it, on the sort of problem formulation aspect to it. The paper writing part is actually um the least important.
about the AI scientist, right? It's a form factor that we humans are sort of uh used to and it helps anchor our mental model of like a scientific discovery. Um but ultimately uh I'm not sure if the paper is going to be the the knowledge transmission medium in let's say twenty years, right? Something else I've been thinking of um a lot is uh whether or not uh w we can make papers much easier agentically accessible, right? In the sense that uh right now it's
It's a LaTeX document.
But you could imagine sort of equipping every paper with um sort of several um model context protocols so that every figure is reproducible, data is accessible, and essentially make it much easier for the LLM agents to essentially either replicate work
or to work off of them afterwards, right? Doing sort of epsilon improvements, ablations yourself through that interface to a paper. But to be entirely honest I'm not sure if it's gonna happen because there have been many great ideas for improving sort of uh, let's say, the the format of scientific artifacts out there and um people still seem to to like the paper format which has existed for, let's say, hundreds of years, right?
So uh I think it's a question of incentives again and uh really showing that if something like that would exist, it would enable much faster progress of um AI agents.
Yeah, paper is a great human interface. It's a similar thing with um automated driving, right? That we could revolutionize the road network to have sensors and we could dramatically improve the the monitoring and observability and optimization. But I'm fascinated by that idea. So so you're saying not just reproducibility of the experiments, but also the way that the figures are designed and and the code and and so on. Because then we could create this huge playground.
Where agents can repurpose, recombine, restudy work that has been published by other scientists. And it also made me think. Does like having an automated scientist, does that make peer review more or less important?
I do think it actually uh makes it more important, at least for now, right? In the sense that we now have a mechanism or could have a mechanism that generates many, many papers, right? And it uh first increases like the workload on on on human reviewers and we need some effective way for filtering and then essentially only taking the cream of the crop.
for human verification afterwards. Right. So I think uh for now like the ultimate verification is still like the human and the diffusion of the result through the community. And we need better tools for doing this automatically. filtering and verification. Like we have the AI reviewer that comes with sort of the AI scientist. Um but uh you actually probably need some form of experiment execution for actually verifying everything.
Yeah. But I there is, for example, work by uh OpenAI on on paper bench and trying to go into that direction using sort of L L M soft verification and these types of things. So I I'm I'm hopeful that we're going to figure this out in the next years.
Yeah, and I think one of the Rubicon moments is when the the new Transformers architecture or something massive is discovered by AI and we're all using it. My worry, I suppose, is that probably folks like Google who have enough compute power. They're gonna be running AI scientists and they're going to own many of these discoveries, which is why it's so important to have work which can efficiently discover new things in science.
And it's important to have um work that's openly available, right? I think like with the AI scientist in Shinka, we're really trying to to make sure that we can sort of apply the collective intelligence of all of us to uh to shape how this might look in the future.
Amazing. Well Rob, this has been so fantastic to have you on the show. Sakana is hiring amazing engineers by the way. So if th if this sounds like and it it is an amazing opportunity, get in touch with Rob and and the guys. And I trust you're working on some exciting new things that are coming up.
Yes.
Absolutely. Rob, thank you so much for coming. Thank you so much, Tim.
